Source: https://huggingface.co/learn/nlp-course/chapter3/2?fw=pt

# Processing the data

Continuing with the example from the previous chapter, here is how we would train a sequence classifier on one batch in PyTorch:

*Note: A lot of these items will be done in the HuggingFace website and not ran in JupyterLabs.*

Of course, just training the model on two sentences is not going to yield very good results. To get better results, you will need to prepare a bigger dataset.

In this section we will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper (https://www.aclweb.org/anthology/I05-5002.pdf) by William B. Dolan and Chris Brockett. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). We've selected it for this chapter because it's a small dataset, so it's easy to experiment with training on it.

## Loading a dataset from the Hub

https://youtu.be/_BZearw7f0w

The Hub doesn't just contain models; it also has multiple datasets in lots of different languages. You can browse the datasets here (https://huggingface.co/datasets), and we recommend you try to load and process a new dataset once you have gone through this section (see the general documention here: https://huggingface.co/docs/datasets/loading_datasets.html#from-the-huggingface-hub). But for now, let's focus on the MRPC dataset! This is one of the 10 datasets composing the GLUE benchmark (https://gluebenchmark.com/), which is an academic benchmark that is used ot measure the performance of ML models across 10 different text classification tasks.

The 🤗 Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this:

As you can see, we get a DatasetDict object which contains the training set, the validation set, and the test set. Each of those contains several columns (sentence1, sentence2, label, and idx) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the text set).

This command downloads and caches the dataset, by default in *~/.cache/huggingface/datasets*. Recall from Chapter 2 that you can customize your cache folder by setting the HF_HOME environment variable.

We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

We can see the labels are already integers, so we won't have to do any preprocessing there. To know which integer corresponds to which label, we can inspect the features of our raw_train_dataset. This will tell use the type of each column:

Behind the scenes, label is of the type ClassLabel, and the mapping of integers to label name is stored in the *names* folder. 0 corresponds to not_equivalent, and 1 corresponds to equivalent.

## Preprocessing a dataset

https://youtu.be/0u3ioSwev3s

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. As you saw in the previous chapter, this is done with a tokenizer. We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

However, we can't just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. We need to handle the two sequences as a pair, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects:

We discussed the input_ids and attention_mask keys in Chapter 2, but we put off talking about token_type_ids. In this example, this is what tells the model which part of the input is the first sentence and which is the second sentence.

If we decode the IDs inside input_ids back to words:

We will get:

So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP] when there are two sentences. Aligning this with the token_type_ids gives us:

As you can see, the parts of the input corresponding to [CLS] sentence1 [SEP] all have a token type ID of 0, while the other parts, corresponding to sentence2 [SEP], all have a token type ID of 1.

Note that if you select a different checkpoint, you won't necessarily ahve the token_type_ids in your tokenized inputs (for instance, they're not returned if you use a DistilBERT model). They are only returned when the model will know what to do with them, becasue it has seen them during its pretraining.

Here, BERT is pretrained with token type IDs, and on top of the masked language modeling objective we talked about in Chapter 1, it has an additional objective called *next sentence prediction*. The goal with this task is to model the relationship between pairs of sentences.

With next sentence prediction, the model is provided pairs of sentences (with randomly masked tokens) and asked to predict whether the second sentence follows the first. To make the task non-trivial, half of the time the sentences follow each other in the original document they were extracted from, and the other half of the time the two sentences come from two different documents.

In general, you don't need to worry about whether or not there are token_type_ids in your tokenized inputs: as long as you use the same checkpoint for the tokenizer and the model, everything will be fine as the tokenizer knows what to provide to its model.

Now that we have seen how our tokenizer can deal with one pair of sentences, we can use it to tokenize our whole dataset: like in the previous chapter, we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences. This is also compatible with the padding and truncation options we saw in Chapter 2. So, one way to preprocess the training dataset is:

This works well, but it has the disadvantage of returning a dictionary (wiht out keys, input_ids, attention_amask, and token_type_ids, and the values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the 🤗 Datasets library as Apache Arrow files stored on the disk, so you only keep the samples you ask for loaded in memory).

To keep the data as a dataset, we will use the Dataset.map() method: https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each elemenet of the dataset, so let's define a function that tokenizes our inputs:

This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys input_ids, attention_mask, and token_type_ids. Note that it also works if the exmaple dictionary contains several samples (each key as a list of sentences) since the tokenizer works on lists of pairs of sentences, as seen before. This will aloow us to use the option batched=True in our call to map(), which will greatly speed up the tokenizaiton. The tokenizer is backed by a tokenizer written in Rust from the 🤗 Tokenizers (https://github.com/huggingface/tokenizers) library. This tokenizer can be very fast, but only if we give it lots of inputs at once.

Note that we've left the padding argument out in out tokenization function for now. This is because padding all the samples to the maximum length is not efficient: it's better to pad the samples when we're building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable length!

Here is how we apply the tokenization function on all our datasets at once. We're using batched=True in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This alows for fasting preprocessing.

This way the 🤗 Datasets library applies this processing is by adding new fields to the datasets, one for each keyn in the dictionary returned by the preprocessing function: