Skip to content

Commit

Permalink
docs: add 'how to prepare your data for training' to basics (#1589)
Browse files Browse the repository at this point in the history
* docs: add new section

* docs: save current draft

* docs: complete draft

* docs: add correct link

* docs: fix bullet points

* docs: remove last cell

(cherry picked from commit 9075b1c)
  • Loading branch information
David Fidalgo authored and frascuchon committed Jul 8, 2022
1 parent baf0d4f commit a21bcf3
Showing 1 changed file with 156 additions and 1 deletion.
157 changes: 156 additions & 1 deletion docs/getting_started/basics.ipynb
Expand Up @@ -15,7 +15,7 @@
"id": "14d13ee2-ffeb-46fa-9c62-d77c8328e499",
"metadata": {},
"source": [
"This guide will help you to get started with Rubrix to perform basic tasks such as uploading data or data annotation."
"This guide will help you get started with Rubrix to perform basic tasks such as uploading or annotating data."
]
},
{
Expand Down Expand Up @@ -904,6 +904,161 @@
"Check [our guide](../guides/weak-supervision.ipynb) for an extensive introduction to weak supervision with Rubrix. \n",
"Also, check the [feature reference](../reference/webapp/define_rules.md) for the Define rules mode of the web app and our [various tutorials](../tutorials/weak-supervision.md) to see practical examples of weak supervision workflows. "
]
},
{
"cell_type": "markdown",
"id": "f1144ce2-0fe2-48c0-8116-d57dc4429640",
"metadata": {},
"source": [
"## How to prepare your data for training"
]
},
{
"cell_type": "markdown",
"id": "c5437bcd-b42c-4f6b-a9f5-d4b45572c648",
"metadata": {},
"source": [
"Once you have uploaded and annotated your dataset in Rubrix, you are ready to prepare it for training a model. Most NLP models today are trained via [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation. "
]
},
{
"cell_type": "markdown",
"id": "a62573a8-54c8-4002-9686-3450ad90c7a3",
"metadata": {},
"source": [
"### Manual extraction"
]
},
{
"cell_type": "markdown",
"id": "73b1923f-b100-4755-aba9-68d7de48d247",
"metadata": {},
"source": [
"The exact data format for training a model depends on your [training framework](#how-to-train-a-model) and the task you are tackling (text classification, token classification, etc.). Rubrix is framework agnostic; you can always manually extract from the records what you need for the training. \n",
"\n",
"The extraction happens using the [client library](../reference/python/python_client.rst) within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from the Rubrix UI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "268ed86e-881d-4196-adc2-ebe01dacb306",
"metadata": {},
"outputs": [],
"source": [
"import rubrix as rb\n",
"\n",
"dataset = rb.load(\"my_annotated_dataset\")"
]
},
{
"cell_type": "markdown",
"id": "d061ca1a-98db-4c31-9362-f816e401c2b5",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"\n",
"Note\n",
" \n",
"If you follow a weak supervision approach, the steps are slightly different. \n",
"We refer you to our [weak supervision guide](../guides/weak-supervision.ipynb) for a complete workflow.\n",
" \n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "e6d94095-7cac-4810-97fe-c257b8f34a2c",
"metadata": {},
"source": [
"Then we can iterate over the records and extract our training examples. For example, let's assume you want to train a text classifier with a [sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that takes as input a text and outputs a label. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d34397c3-5b0d-4151-9945-135f54520f7e",
"metadata": {},
"outputs": [],
"source": [
"# Save the inputs and labels in Python lists\n",
"inputs, labels = [], []\n",
"\n",
"# Iterate over the records in the dataset\n",
"for record in dataset:\n",
" \n",
" # We only want records with annotations\n",
" if record.annotation:\n",
" inputs.append(record.text)\n",
" labels.append(record.annotation)\n",
"\n",
"# Train the model\n",
"sklearn_pipeline.fit(inputs, labels)"
]
},
{
"cell_type": "markdown",
"id": "dbd8be8e-dd83-4506-8ed8-2f32cf6b5835",
"metadata": {},
"source": [
"### Automatic extraction"
]
},
{
"cell_type": "markdown",
"id": "bbf82fc5-d3a5-4a56-b308-caf31a4d763b",
"metadata": {},
"source": [
"For a few frameworks and tasks, Rubrix provides a convenient method to automatically extract training examples in a suitable format from a dataset. \n",
"\n",
"For example: If you want to train a [transformers](https://huggingface.co/docs/transformers/index) model for text classification, you can load an annotated dataset for text classification and call the `prepare_for_training()` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa1c2aa8-603b-48a5-8da4-11c7e93f9772",
"metadata": {},
"outputs": [],
"source": [
"dataset = rb.load(\"my_annotated_dataset\")\n",
"\n",
"dataset_for_training = dataset.prepare_for_training()"
]
},
{
"cell_type": "markdown",
"id": "c71ee9dc-93b8-4802-83bc-e63048a83073",
"metadata": {},
"source": [
"With the returned `dataset_for_training`, you can continue following the steps to [fine-tune a pre-trained model](https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model) with the [transformers library](https://huggingface.co/docs/transformers/index). \n",
"\n",
"Check the dedicated [dataset guide](../guides/datasets.ipynb#prepare-dataset-for-training) for more examples of the `prepare_for_training()` method."
]
},
{
"cell_type": "markdown",
"id": "90307acf-ba85-4f8c-86d3-ca398be7a496",
"metadata": {},
"source": [
"## How to train a model"
]
},
{
"cell_type": "markdown",
"id": "29cb1351-6324-4faa-9067-fd50785844f5",
"metadata": {},
"source": [
"Rubrix helps you to create and curate training data. **It is not a framework for training a model.** You can use Rubrix complementary with other excellent open-source frameworks that focus on developing and training NLP models.\n",
"\n",
"Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:\n",
"\n",
" - [transformers](https://huggingface.co/docs/transformers/index): This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;\n",
" - [spaCy](https://spacy.io/): This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;\n",
" - [scikit-learn](https://scikit-learn.org/stable/): This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly; \n",
" \n",
"Check our [cookbook](../guides/cookbook.ipynb) for many examples of how to train models using these frameworks together with Rubrix."
]
}
],
"metadata": {
Expand Down

0 comments on commit a21bcf3

Please sign in to comment.