<a href="https://colab.research.google.com/github/ruthgn/HF/blob/main/05_Loading_and_Processing_Data_from_The_Hub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will demonstrate how to load a dataset from the HuggingFace Hub and how to properly process the dataset. We will use the MRPC (Microsoft Research Paraphrase Corpus) dataset for our demonstration. This is one of the 10 dataset composing the GLUE benchmark, which is an academic benchmark is used to measure the performance of ML models across 10 different text classification tasks.

In [1]:
!pip install datasets transformers[sentencepiece]



In [2]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

sequences = ["This is a sentence.", "This is another sentence!"]

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# (What is this new "labels" key for?)
batch["labels"] = torch.tensor([1, 1])

# Training a sequence classifier on one batch
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Of course, just training the model on two sentences is not going to yield very good results. To get better results, we'll need to prepare a bigger dataset.

In [3]:
batch

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 6251, 1012,  102],
        [ 101, 2023, 2003, 2178, 6251,  999,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([1, 1])}

## Loading a dataset from the Hub

The Hub doesn't just contain models; it also has multiple datasets in many different languages. The HF Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this:

In [4]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

As you can see, we get a `DatasetDict` object which contains the training set, the validation set, and the test set. Each of those contains several columns (`sentence1`, `sentence2`, `label`, and `idx`) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set).

*Note*: This command download and caches the datset, by default in *~/.cache/huggingface/dataset*. To customize your cache folder, set the `HF_HOME` environment variable.

We can access each pair of sentences in our `raw_datasets` object by indexing, like with any dictionary:

In [5]:
raw_train_datasets = raw_datasets["train"]
raw_train_datasets[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

The `label` columns already contain integers, so there is no preprocessing need there. To know which integer corresponds to which label, we can inspect the features of our `raw_train_dataset`. This will tell us the type of each column--we will also be able to see the mapping of integers to label name. 

In [6]:
raw_train_datasets.features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

In [7]:
raw_train_datasets[14]

{'idx': 15,
 'label': 0,
 'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .',
 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .'}

In [8]:
raw_train_datasets[14]["label"]

0

In [9]:
raw_valid_datasets = raw_datasets['validation']

raw_valid_datasets[86]

{'idx': 796,
 'label': 1,
 'sentence1': 'He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .',
 'sentence2': 'He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .'}

In [10]:
raw_valid_datasets[86]["label"]

1

## Preprocessing a Dataset

We can not just pass two sequences to the model and get a predictions of whether the two senteces are paraphrases or not. We need to handle the two sequences a as a pair, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects:

In [11]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
raw_train_datasets[14]["sentence1"]

'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .'

In [13]:
inputs = tokenizer(raw_train_datasets[14]["sentence1"], raw_train_datasets[14]["sentence2"])
inputs

{'input_ids': [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229, 5467, 1012, 102, 1996, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2056, 1996, 2873, 4062, 2018, 3478, 2000, 18235, 2094, 2417, 2644, 4597, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The parts of the input corresponding to "sentence1" all have a token type ID of 0, while the other parts, corresponding to "sentence2", all have a token type ID of 1. So, when our goal os to model the relationship between pairs of sentences, `token_type_ids` indicate which tokens belong to the first sentence and which other ones belong to the second sentence.

*Note*: If we select a different checkpoint model, we won't necessarily have the `token_type_ids` in our tokenized inputs (for instance, they're not returned if you use a DistilBERT model). They are only returned when the model knows what to do with them because it has seen them during pretraining. Here. BERT is pretrained with token type IDs, because it also has an objective classed *next sentence prediction*. The goal with this task to model relationship between pairs of sentences.

Now that we've seen how our tokenizer can deal with one pair of sentences, we can use to to tokenize our whole dataset. We can feed the tokenizers a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences. This is also compatible with padding and truncation options. So, one way to preprocess the training dataset is:

In [14]:
tokenized_dataset = tokenizer(
    raw_train_datasets["sentence1"],
    raw_train_datasets["sentence2"],
    padding=True,
    truncation=True
)
#tokenized_dataset

This works fine, but is has the disadvantage of returning a dictionary (with our keys, `input_ids`, `attention_mask`, and `token_type_ids`, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (unlike datasets from the HF Datasets library, for example, which are Apache Arrow files stored on the disk that selectively loads into memory only specific samples that we explicitly ask for).

To keep the data as a dataset, we will use the `Dataset.map()` method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The `map()` method works by applying a function on each element of the dataset, so let's define a function that tokenizes our inputs:

In [15]:
def tokenize_function(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`. Note that it also works if the `example` dictionary contains several samples(each key as a list of sentences) since the `tokenizer` works on lists of pairs of sentences. This will allow us to use the option `batched=True` in our call to map(), which will greatly speed up the tokenization.

*Note*: The `tokenizer` is backed by a tokenizer written in Rust from the HF Tokenizers libraru. This tokenizer can be very fast, but only if we give it lots of inputs at once.

Note that we've left the `padding` argument out in our tokenization function for now. This is because padding all the samples to the maximum length is NOT efficient--it's vetter to pad the samples when we're building a btach, as then we only need to pad to the maximum length in that batch, and not maximum lenght in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!

Here is how we apply the tokenization function on all our datasets at once. We're using `batched=True` in our call to `map()` so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.

In [16]:
tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
tokenized_dataset

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-2e774d0bb149ce50.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-7621ce8cb606eecc.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

The way the HF Datasets library applies this processing is by adding new fields (train, validation, test) to the datasets, one for each key in the dictionary returned by the preprocessing function.

Our `tokenize_function` returns a dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`, so those three fields are added to all splits of our dataset. Note that we could also have changed existing fields if our preprocessing function (`tokenize_function`) returned a new value for an existing key in the dataset to which we applied `map()`.

The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together--a technique we refer to as *dynamic padding*

## Dynamic Padding

*Note*: We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit, but note that if you're training on a TPU it can cause problems==TPUs prefer fixed shapes, even when that requires extra padding.

The HF Transformers library provides us with such a function via `DataCollatorWithPadding`. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left ore on the right of the inputs) and will do everything you need:

In [19]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [20]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

Let's test this out by grabbing a few samples from our training set that we would like to batch together. Here, we remove the columns `idx`, `sentence1`, and `sentence2` as they won't be needed and contain strings (and we can't create tensors with strings--all of the information from our sentences are sufficiently represented by our tokens in `input_ids`) and have a look at the lengths of each entry in the batch:

In [34]:
# Select the first 8 rows in the training set as samples
samples = tokenized_dataset["train"][:8]
samples.keys() # Show dictionary keys (show column labels)

dict_keys(['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])

In [37]:
# Remove `idx`, `sentence1`, and `sentence2` from dictionary
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
samples.keys()

dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [41]:
# Check out the length of each entry in our batch (of 8 samples)
# Remember: Each entry/"row" of `input_ids` represents a sentence pair
[len(entry) for entry in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

Unsurprisingly, we get samples of varying length, from 32 to 67. Dynamic padding means the samples in this vatch should all be padded to a length of 67, the maximum length inside the particular batch.

Let's double-check that our `data_collator` is dynamically padding the batch properly:

In [49]:
# Batch samples together with `data_collator`
# which converts them to tensors
# and appropriate padding
batch = data_collator(samples)

# Check out the length our entries after
# being passed to the `data_collator` function
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 67])}

Fantastic! It looks like we have created a proper batch that our model can deal with.