# A Gentle Intro to `huggingface` and `transformers`

By: Dr. Jie Tao

ver: 0.1

Transformers, as the latest development in Deep Learning, have been widely applied in CV, NLP, and other (multi-)modality domains. The idea of transformers is that a large (with hundreds of billions of parameters, see GPT-4) model is pre-trained on a large corpus, and can be used in downstream tasks (classification, generation, etc.). Remember how we used `VGG` or `ResNet` in CV? This is a similar idea.

Huggingface is an API/wrapper that makes using transformers much easier. An analogy would be if you consider the original transformers to be like `tensorflow`, then `huggingface` is like `keras` to make your life easier.

Some notable characteristics regarding `huggingface` include:

- **NLP tasks**: Transformers can be used for a wide range of NLP tasks, including text classification, sentiment analysis, language translation, and question answering.
- **Pre-trained models**: Transformers provides access to a wide range of pre-trained language models, including `BERT`, `GPT-2`, and `RoBERTa`, which can be fine-tuned for specific NLP tasks.
- **Easy-to-use API**: Hugging Face provides an easy-to-use API that allows developers to quickly integrate Transformers into their NLP projects.
- **Community-driven development**: Hugging Face and Transformers are community-driven projects, which means that anyone can contribute to the development and improvement of the libraries.
- **Model deployment**: Hugging Face also provides a model serving platform, called "Hugging Face Hub," which allows users to deploy their pre-trained models to the cloud and share them with others.

__NOTE__: `huggingface` supports both `tensorflow` and `torch`, but has native support for `torch`. Since we already know `torch`, this tutorial is built on it.

## Import Dependencies

In [None]:
!pip install -U transformers accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [3

In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():

    # Tell PyTorch to use the GPU.
    device = torch.device("cuda:0") ## you can specify which GPU to use if you have more than one, for intance `cuda:0` is the first GPU

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## Fine tuning Explained

Using the stock model may not serve your specific purposes, so sometimes we need to __fine tune__ the model with the data from the task domain.

In this section we will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. The dataset consists of `5,801` pairs of sentences, with a label indicating if they are **paraphrases or not** (i.e., if both sentences mean the same thing).

The 🤗 Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this:

In [None]:
# we have to install first
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Collec

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

As you can see, we get a DatasetDict object which contains the training set, the validation set, and the test set. Each of those contains several columns (`sentence1`, `sentence2`, `label`, and `idx`) and a variable number of rows, which are the number of elements in each set (so, there are `3,668` pairs of sentences in the training set, `408` in the validation set, and `1,725` in the test set).

We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:


In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

We can see the labels are already integers, so we won’t have to do any preprocessing there. To know which integer corresponds to which label, we can inspect the features of our `raw_train_dataset`. This will tell us the type of each column:

In [None]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

We can then use the `tokenizer` on these data.

In [None]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

In [None]:
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

Because now we know how `tokenizer` works, we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences.

In [None]:
MAX_LEN = 64 ## pad/truncate all sequences to the length of 64

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
    max_length = MAX_LEN,
)

This works well, but it has the disadvantage of returning a dictionary (with our keys, `input_ids`, `attention_mask`, and `token_type_ids`, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization.

To keep the data as a dataset, we will use the `Dataset.map()` method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The `map()` method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

__NOTE__: remember the `map` method from `Pandas`? This is very similar!

In [None]:
def tokenize_function(example):
    """function to tokenize"""
    return tokenizer(example["sentence1"],
                     example["sentence2"],
                     truncation=True,
                    #  padding=True, ## we don't do padding here since we need dynamic padding
                      max_length = MAX_LEN, )

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1725
    })
})

We can tell that the dataset struture is pretained.

### Dynamic Padding

The function that is responsible for putting together samples inside a batch is called a collate function. It’s an argument you can pass when you build a `DataLoader`, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won’t be possible in our case since the inputs we have won’t all be of the same size.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the 🤗 Transformers library provides us with such a function via `DataCollatorWithPadding`. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this new toy, let’s grab a few samples from our training set that we would like to batch together. Here, we remove the columns idx, sentence1, and sentence2 as they won’t be needed and contain strings (and we can’t create tensors with strings) and have a look at the lengths of each entry in the batch:

In [None]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 64, 59, 50, 62, 32]

In [None]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 64]),
 'attention_mask': torch.Size([8, 64]),
 'labels': torch.Size([8])}

__PRO-TIP__: dynamic padding is always preferred since this will save your resources.

### Training
The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation. The only argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./test_transfomer/", ## save model state
    logging_dir='./test_transfomer/logs', ## save model logs
    logging_strategy='epoch', ## log by epoch or by training step, if the model is big use step (500)
    # logging_steps=100,
    num_train_epochs=3, ## no more than 4
    per_device_train_batch_size=4,  ## this and the one below is determined by the model size and data size
    per_device_eval_batch_size=4,
    learning_rate=5e-6,
    # seed=42,
    save_strategy='epoch',
    save_steps=100,
    evaluation_strategy='epoch',
    eval_steps=100,
    load_best_model_at_end=True
)

For a detailed explanation of above arguments, refer to [this article](https://medium.com/grabngoinfo/transfer-learning-for-text-classification-using-hugging-face-transformers-trainer-13407187cf89).

In [None]:
from transformers import AutoModelForSequenceClassification
## binary classification problem
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `tokenizer`:

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Note that when you pass the tokenizer as we did here, the default `data_collator` used by the Trainer will be a DataCollatorWithPadding as defined previously, so you can skip the line `data_collator=data_collator` in this call.

To fine-tune the model on our dataset, we just have to call the `train()` method of our `Trainer`:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.4571,0.552454
2,0.4575,0.552454
3,0.4613,0.552454


TrainOutput(global_step=2751, training_loss=0.4586379211714206, metrics={'train_runtime': 183.9771, 'train_samples_per_second': 59.812, 'train_steps_per_second': 14.953, 'total_flos': 175621758472848.0, 'train_loss': 0.4586379211714206, 'epoch': 3.0})

One thing we should probably do is to add evaluation metrics using a `compute_metric()` function to calculate the value on the given metric during training and evaluation. So `Trainer` defaulted to `loss`.

#### DO IT YOURSELF

Use your Google and ChatGPT skills, can you come up with a `compute_metric()` that __does not use__ `accuracy`. Your options can be:
- AUC
- F1
- recall
- precision

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


### Evaluation

I am sure you made some progress with the `compute_metric()` function. Let's look at the evaluation step by step.

 To get some predictions from our model, we can use the `Trainer.predict()` command:



In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


The output of the predict() method is another named tuple with three fields: `predictions`, `label_ids`, and `metrics`. The `metrics` field will just contain the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average).

As you can see, predictions is a two-dimensional array with shape $408 \times 2$ (`408` being the size of the test set). Those are the logits for each element of the dataset we passed to `predict()` . To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

We know we need the _true_ labels to complete the evaluation. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset from the 🤗 `Evaluate` library. The object returned has a `compute()` method we can use to do the metric calculation:


In [None]:
preds[:5] # y_pred

In [None]:
tokenized_datasets["validation"]['label'][:5]

In [None]:
lbs = tokenized_datasets["validation"].features['label'].names
lbs

['not_equivalent', 'equivalent']

In [None]:
from sklearn.metrics import classification_report

print(classification_report(tokenized_datasets["validation"]['label'], preds, target_names=lbs, digits=4))

                precision    recall  f1-score   support

not_equivalent     0.7333    0.3411    0.4656       129
    equivalent     0.7557    0.9427    0.8389       279

      accuracy                         0.7525       408
     macro avg     0.7445    0.6419    0.6523       408
  weighted avg     0.7487    0.7525    0.7209       408



In [None]:
## install first
!pip install -U evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.0 responses-0.18.0


In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script: 0.00B [00:00, ?B/s]

{'accuracy': 0.7524509803921569, 'f1': 0.8389154704944178}

Thus, to put things together, we can have the `compute_metrics()` function as below, and embed it into our `Trainer`.

### Inference on the fine-tuned model, saving and loading

With the model fine-tuned, we can use it to see how it's doing:

In [None]:
test_example1 = {
  "sentence1": "They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .",
  "sentence2": "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  "label":1 ## true_label: 1 (paraphrase)
}

In order to tokenizer on one instance, we need to modify our `tokenize_function` a bit.

In [None]:
def test_tokenize_function(example):
    """function to tokenize single instance"""
    return tokenizer(example["sentence1"],
                     example["sentence2"],
                     truncation=True,
                     padding="max_length", ## need to pad since no data_collator here
                     max_length = MAX_LEN,
                     return_tensors="pt").to(device) ## make the result torch.Tensors and move to GPU

In [None]:
test_input = test_tokenize_function(test_example1)

In [None]:
test_input

{'input_ids': tensor([[  101,  2027,  2018,  2405,  2019, 15147,  2006,  1996,  4274,  2006,
          2238,  2184,  1010,  5378,  1996,  6636,  2005,  5096,  1010,  2002,
          2794,  1012,   102,  2006,  2238,  2184,  1010,  1996,  2911,  1005,
          1055,  5608,  2018,  2405,  2019, 15147,  2006,  1996,  4274,  1010,
          5378,  1996, 14792,  2005,  5096,  1012,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0')}

In [None]:
test_input["input_ids"].size()

torch.Size([1, 64])

In [None]:
torch.argmax(model(**test_input).logits.cpu())

tensor(1)

So our model successfully predicted the two setences are paraphrased, which agrees with the ground truth.

#### Save and Load

After fine-tuning, you might want to save your `model` for future use, you can do so as:
```python
model.save("path/to/model")
```
However, we already told `Trainer` to save models for us, if you look at the file browser you will see `Trainer` save the model state at each epoch (because we wanted it to evaluate at each epoch).

By looking at the training history again, we know `epoch=2` gives us the best model, and since we specified `load_best_model_at_end=True` the best model is always the model appears first (in this case, `checkpoint-1834`).

If we want to use it later, we can do:
```python
from transformers import AutoModelForSequenceClassification
ft_chkpt = "/content/test_transfomer/checkpoint-1834"
model = AutoModelForSequenceClassification.from_pretrained(ft_chkpt, num_labels=2)
```

__PRO-TIP__:
1. It is good practice if you also save the `tokenizer`, particularly if you changed it (e.g, added tokens).
2. You can consider enable `push_to_hub = True` in `TrainingArguments` so your model is pushed to Huggingface Hub. You can share it with your colleagues, or let the public to test it.
3. If you are on Colab, make sure save your model to Google Drive, otherwise you will lose it when the runtime disconnects.

## Further Reading

The following Huggingface tutorials might be useful:
1. [The Huggingface Datasets liabrary](https://huggingface.co/learn/nlp-course/chapter5/1?fw=pt)
2. [The Huggingface Tokenizers liabrary](https://huggingface.co/learn/nlp-course/chapter6/1?fw=pt)
3. [Main NLP Tasks](https://huggingface.co/learn/nlp-course/chapter7/1?fw=pt)

## Homework

[This Medium article](https://medium.com/grabngoinfo/transfer-learning-for-text-classification-using-hugging-face-transformers-trainer-13407187cf89) showcased a sentiment analysis project using Huggingface in `tensorflow`. Please implement it in `PyTorch` using the content of this tutorial.

__NOTE__: please do so in a separate notebook. It is easier for you and for me.