Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
* feat: add Dataset class * test: add dataset tests * feat: add meaningful error message to TaskType * feat: add to from datasest for text classification * test: add dataset fixtures * feat: implement pandas to text classification * feat: add token classification support * test: add token classification tests * test: use singlelabel_textclassification_records * chore: small improvements * refactor: switch to class implementation * chore: import in init * test: add missing tests plus minor fixes * chore: add future warning about as_pandas * chore: more integrations in the library * fix: wrong import * chore: add new test dependency * test: add test for tasktype * feat: ignore not supported columns * test: add tests for read_pandas/datasets * docs: put type hints only in description, it becomes too messy * test: improve tests * docs: curate docstrings * docs: add datasets to python reference * fix: return None's instead of empty dicts for metrics * docs: add dataset guide * docs: increase contrast * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * test: add general log/load tests for all allowed input types of rb.load * fix: remove append Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> (cherry picked from commit 14c8087)
- Loading branch information
1 parent
a374204
commit b5bbca6
Showing
19 changed files
with
1,769 additions
and
84 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d09c1533-97fd-43f7-83ea-f3a56edd1d5e", | ||
"metadata": {}, | ||
"source": [ | ||
"# Datasets\n", | ||
"\n", | ||
"This guide showcases some features of the `Dataset` classes in the Rubrix client.\n", | ||
"The Dataset classes are lightweight containers for Rubrix records. These classes facilitate importing from and exporting to different formats (e.g., `pandas.DataFrame`, `datasets.Dataset`) as well as sharing and versioning Rubrix datasets using the Hugging Face Hub.\n", | ||
"\n", | ||
"For each record type there's a corresponding Dataset class called `DatasetFor<RecordType>`.\n", | ||
"You can look up their API in the [reference section](../reference/python/python_client.rst#module-rubrix.client.datasets)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ae9d4c9e-24a1-4a59-8a17-3b6ac1a39c88", | ||
"metadata": {}, | ||
"source": [ | ||
"## Working with a Dataset\n", | ||
"\n", | ||
"Under the hood the Dataset classes store the records in a simple Python list.\n", | ||
"Therefore, working with a Dataset class is not very different to working with a simple list of records:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "edbf8c7f-463d-48ee-944a-3adb57edb159", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import rubrix as rb\n", | ||
"\n", | ||
"# Start with a list of Rubrix records\n", | ||
"dataset_rb = rb.DatasetForTextClassification(my_records)\n", | ||
"\n", | ||
"# Loop over the dataset\n", | ||
"for record in dataset_rb:\n", | ||
" print(record)\n", | ||
" \n", | ||
"# Index into the dataset\n", | ||
"dataset_rb[0] = rb.TextClassificationRecord(inputs=\"replace record\")\n", | ||
"\n", | ||
"# log a dataset to the Rubrix web app\n", | ||
"rb.log(dataset_rb, \"my_dataset\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2d3e9fc0-8563-4727-abca-62205a4de385", | ||
"metadata": {}, | ||
"source": [ | ||
"The Dataset classes do some extra checks for you, to make sure you do not mix record types when appending or indexing into a dataset. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "df88889b-12f4-472f-bcbe-fb47be475d02", | ||
"metadata": {}, | ||
"source": [ | ||
"## Importing from other formats\n", | ||
"\n", | ||
"When you have your data in a [_pandas DataFrame_](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or a [_datasets Dataset_](https://huggingface.co/docs/datasets/access.html), we provide some neat shortcuts to import this data into a Rubrix Dataset. \n", | ||
"You have to make sure that the data follows the record model of a specific task, otherwise you will get validation errors. \n", | ||
"Columns in your DataFrame/Dataset that are not supported or recognized, will simply be ignored.\n", | ||
"\n", | ||
"The record models of the tasks are explained in the [reference section](../reference/python/python_client.rst#module-rubrix.client.models). \n", | ||
"\n", | ||
"<div class=\"alert alert-info\">\n", | ||
"\n", | ||
"Note\n", | ||
"\n", | ||
"Due to it's pyarrow nature, data in a `datasets.Dataset` has to follow a slightly different model, that you can look up in the examples of the `Dataset*.from_datasets` [docstrings](../reference/python/python_client.rst#rubrix.client.datasets.DatasetForTokenClassification.from_datasets). \n", | ||
" \n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "62ca56d4-2bb5-4c77-a069-7a50ee78b415", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import rubrix as rb\n", | ||
"\n", | ||
"# import data from a pandas DataFrame\n", | ||
"dataset_rb = rb.read_pandas(my_dataframe, task=\"TextClassification\")\n", | ||
"\n", | ||
"# import data from a datasets Dataset\n", | ||
"dataset_rb = rb.read_datasets(my_dataset, task=\"TextClassification\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "eb290a71-ad07-496f-b167-ad80d91fa633", | ||
"metadata": {}, | ||
"source": [ | ||
"## Sharing via the Hugging Face Hub\n", | ||
"\n", | ||
"You can easily share your Rubrix dataset with your community via the Hugging Face Hub.\n", | ||
"For this you just need to export your Rubrix Dataset to a `datasets.Dataset` and [push it to the hub](https://huggingface.co/docs/datasets/upload_dataset.html?highlight=push_to_hub#upload-from-python):" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "f4d6d70b-0b91-4efb-94b6-6b7710c105c2", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import rubrix as rb\n", | ||
"\n", | ||
"# load your annotated dataset from the Rubrix web app\n", | ||
"dataset_rb = rb.load(\"my_dataset\", as_pandas=False)\n", | ||
"\n", | ||
"# export your Rubrix Dataset to a datasets Dataset\n", | ||
"dataset_ds = dataset_rb.to_datasets()\n", | ||
"\n", | ||
"# push the dataset to the Hugging Face Hub\n", | ||
"dataset_ds.push_to_hub(\"my_dataset\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "696605dd-be87-4ae6-b367-b0cdabfaf39f", | ||
"metadata": {}, | ||
"source": [ | ||
"Afterward, your community can easily access your annotated dataset and log it directly to the Rubrix web app:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e70a4792-bc91-4d64-8465-b2bccf23502f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from datasets import load_dataset\n", | ||
"\n", | ||
"# download the dataset from the Hugging Face Hub\n", | ||
"dataset_ds = load_dataset(\"user/my_dataset\", split=\"train\")\n", | ||
"\n", | ||
"# read in dataset, assuming its a dataset for text classification\n", | ||
"dataset_rb = rb.read_datasets(dataset_ds, task=\"TextClassification\")\n", | ||
"\n", | ||
"# log the dataset to the Rubrix web app\n", | ||
"rb.log(dataset_rb, \"dataset_by_user\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.