docs: add 'how to prepare your data for training' to basics (#1589)

* docs: add new section * docs: save current draft * docs: complete draft * docs: add correct link * docs: fix bullet points * docs: remove last cell (cherry picked from commit 9075b1c)
argilla-io · Jul 8, 2022 · a21bcf3 · a21bcf3
1 parent baf0d4f
commit a21bcf3
Showing 1 changed file with 156 additions and 1 deletion.
diff --git a/docs/getting_started/basics.ipynb b/docs/getting_started/basics.ipynb
@@ -15,7 +15,7 @@
    "id": "14d13ee2-ffeb-46fa-9c62-d77c8328e499",
    "metadata": {},
    "source": [
-    "This guide will help you to get started with Rubrix to perform basic tasks such as uploading data or data annotation."
+    "This guide will help you get started with Rubrix to perform basic tasks such as uploading or annotating data."
    ]
   },
   {
@@ -904,6 +904,161 @@
     "Check [our guide](../guides/weak-supervision.ipynb) for an extensive introduction to weak supervision with Rubrix. \n",
     "Also, check the [feature reference](../reference/webapp/define_rules.md) for the Define rules mode of the web app and our [various tutorials](../tutorials/weak-supervision.md) to see practical examples of weak supervision workflows. "
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1144ce2-0fe2-48c0-8116-d57dc4429640",
+   "metadata": {},
+   "source": [
+    "## How to prepare your data for training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c5437bcd-b42c-4f6b-a9f5-d4b45572c648",
+   "metadata": {},
+   "source": [
+    "Once you have uploaded and annotated your dataset in Rubrix, you are ready to prepare it for training a model. Most NLP models today are trained via [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a62573a8-54c8-4002-9686-3450ad90c7a3",
+   "metadata": {},
+   "source": [
+    "### Manual extraction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73b1923f-b100-4755-aba9-68d7de48d247",
+   "metadata": {},
+   "source": [
+    "The exact data format for training a model depends on your [training framework](#how-to-train-a-model) and the task you are tackling (text classification, token classification, etc.). Rubrix is framework agnostic; you can always manually extract from the records what you need for the training. \n",
+    "\n",
+    "The extraction happens using the [client library](../reference/python/python_client.rst) within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from the Rubrix UI:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "268ed86e-881d-4196-adc2-ebe01dacb306",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import rubrix as rb\n",
+    "\n",
+    "dataset = rb.load(\"my_annotated_dataset\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d061ca1a-98db-4c31-9362-f816e401c2b5",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "Note\n",
+    "    \n",
+    "If you follow a weak supervision approach, the steps are slightly different. \n",
+    "We refer you to our [weak supervision guide](../guides/weak-supervision.ipynb) for a complete workflow.\n",
+    "    \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6d94095-7cac-4810-97fe-c257b8f34a2c",
+   "metadata": {},
+   "source": [
+    "Then we can iterate over the records and extract our training examples. For example, let's assume you want to train a text classifier with a [sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that takes as input a text and outputs a label. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d34397c3-5b0d-4151-9945-135f54520f7e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save the inputs and labels in Python lists\n",
+    "inputs, labels = [], []\n",
+    "\n",
+    "# Iterate over the records in the dataset\n",
+    "for record in dataset:\n",
+    "    \n",
+    "    # We only want records with annotations\n",
+    "    if record.annotation:\n",
+    "        inputs.append(record.text)\n",
+    "        labels.append(record.annotation)\n",
+    "\n",
+    "# Train the model\n",
+    "sklearn_pipeline.fit(inputs, labels)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dbd8be8e-dd83-4506-8ed8-2f32cf6b5835",
+   "metadata": {},
+   "source": [
+    "### Automatic extraction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bbf82fc5-d3a5-4a56-b308-caf31a4d763b",
+   "metadata": {},
+   "source": [
+    "For a few frameworks and tasks, Rubrix provides a convenient method to automatically extract training examples in a suitable format from a dataset. \n",
+    "\n",
+    "For example: If you want to train a [transformers](https://huggingface.co/docs/transformers/index) model for text classification, you can load an annotated dataset for text classification and call the `prepare_for_training()` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa1c2aa8-603b-48a5-8da4-11c7e93f9772",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = rb.load(\"my_annotated_dataset\")\n",
+    "\n",
+    "dataset_for_training = dataset.prepare_for_training()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c71ee9dc-93b8-4802-83bc-e63048a83073",
+   "metadata": {},
+   "source": [
+    "With the returned `dataset_for_training`, you can continue following the steps to [fine-tune a pre-trained model](https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model) with the [transformers library](https://huggingface.co/docs/transformers/index). \n",
+    "\n",
+    "Check the dedicated [dataset guide](../guides/datasets.ipynb#prepare-dataset-for-training) for more examples of the `prepare_for_training()` method."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90307acf-ba85-4f8c-86d3-ca398be7a496",
+   "metadata": {},
+   "source": [
+    "## How to train a model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29cb1351-6324-4faa-9067-fd50785844f5",
+   "metadata": {},
+   "source": [
+    "Rubrix helps you to create and curate training data. **It is not a framework for training a model.** You can use Rubrix complementary with other excellent open-source frameworks that focus on developing and training NLP models.\n",
+    "\n",
+    "Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:\n",
+    "\n",
+    " - [transformers](https://huggingface.co/docs/transformers/index): This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;\n",
+    " - [spaCy](https://spacy.io/): This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;\n",
+    " - [scikit-learn](https://scikit-learn.org/stable/): This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly; \n",
+    " \n",
+    "Check our [cookbook](../guides/cookbook.ipynb) for many examples of how to train models using these frameworks together with Rubrix."
+   ]
   }
  ],
  "metadata": {