Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# TRANSFORMERS

In this notebook we will explore [Hugging Face Transformers](https://huggingface.co/docs/transformers/index).
You may also want to check the [Hugging Face course](https://huggingface.co/course/), which will explain you how to use this technology in a much greater depth.

Training transformer models is computationally expensive. Hugging Face makes available several pretrained [models](https://huggingface.co/models) that can be used as is, or fine-tuned to a specific NLP task, such as one of sentence classification. That's what we'll do in this notebook.

Hugging Face also makes available several [datasets](https://huggingface.co/datasets) that can be used to train or fine-tune a model.

## Loading a dataset

In this notebook, we'll start by using a local dataset (instead of using a dataset stored at Hugging Face).
Let's load data for our classification task.

In [4]:
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data/restaurant_reviews.tsv', delimiter = '\t', quoting = 3)

dataset.rename(columns={'Liked':'label'}, inplace = True) # shouldn't need this if label_names could be used in TrainingArguments...

dataset.head()

Unnamed: 0,Review,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


For ease of usage with Transformer models, we convert the dataset into a Hugging Face dataset and split it into train, validation and test sets.

In [5]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)

In [6]:
from datasets import DatasetDict

# 90% train, 10% test+validation
train_test = dataset_hf.train_test_split(test_size=0.1)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

In [7]:
train_valid_test_dataset

DatasetDict({
    train: Dataset({
        features: ['Review', 'label'],
        num_rows: 900
    })
    validation: Dataset({
        features: ['Review', 'label'],
        num_rows: 50
    })
    test: Dataset({
        features: ['Review', 'label'],
        num_rows: 50
    })
})

## Fine-tuning a pretrained model

As a starting example, we'll use a lighter BERT-based model. We will need to load:
- the [tokenizer](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer) (which is used to [preprocess](https://huggingface.co/docs/transformers/preprocessing) the data before it can be used by the model)
- the [model](https://huggingface.co/docs/transformers/autoclass_tutorial#automodel) itself

In [8]:
model_name = "distilbert-base-uncased"

### Tokenizer

We first load the tokenizer for our model:

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

Now we need to [preprocess](https://huggingface.co/docs/transformers/preprocessing) our data. We will do it for the three partitions (train, validation and test) in a single step. For that, we'll make use of [map](https://huggingface.co/docs/datasets/process#map) with the help of an auxiliary function.

In [10]:
def preprocess_function(sample):
    return tokenizer(sample["Review"], truncation=True)

In [11]:
tokenized_dataset = train_valid_test_dataset.map(preprocess_function, batched=True)

                                                   

In [12]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['Review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 900
    })
    validation: Dataset({
        features: ['Review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50
    })
    test: Dataset({
        features: ['Review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50
    })
})

When preprocessing the text, we have actually translated the text into numbers, which is known as [encoding](https://huggingface.co/course/chapter2/4?fw=pt#encoding).

In [13]:
tokenized_dataset['train'][321]

{'Review': 'My breakfast was perpared great, with a beautiful presentation of 3 giant slices of Toast, lightly dusted with powdered sugar.',
 'label': 1,
 'input_ids': [101,
  2026,
  6350,
  2001,
  2566,
  19362,
  2098,
  2307,
  1010,
  2007,
  1037,
  3376,
  8312,
  1997,
  1017,
  5016,
  25609,
  1997,
  15174,
  1010,
  8217,
  6497,
  2098,
  2007,
  9898,
  2098,
  5699,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

Encoding is done in a two-step process: tokenization, followed by conversion to input IDs.

In [14]:
tokens = tokenizer.tokenize(tokenized_dataset['train'][321]['Review'])
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['my', 'breakfast', 'was', 'per', '##par', '##ed', 'great', ',', 'with', 'a', 'beautiful', 'presentation', 'of', '3', 'giant', 'slices', 'of', 'toast', ',', 'lightly', 'dust', '##ed', 'with', 'powder', '##ed', 'sugar', '.']
[2026, 6350, 2001, 2566, 19362, 2098, 2307, 1010, 2007, 1037, 3376, 8312, 1997, 1017, 5016, 25609, 1997, 15174, 1010, 8217, 6497, 2098, 2007, 9898, 2098, 5699, 1012]


The tokenizer actually adds two special tokens when preprocessing: one at the beginning, and one at the end.

In [15]:
inputs = tokenizer(tokenized_dataset['train'][321]['Review'])
inputs['input_ids']   # or inputs.input_ids

[101,
 2026,
 6350,
 2001,
 2566,
 19362,
 2098,
 2307,
 1010,
 2007,
 1037,
 3376,
 8312,
 1997,
 1017,
 5016,
 25609,
 1997,
 15174,
 1010,
 8217,
 6497,
 2098,
 2007,
 9898,
 2098,
 5699,
 1012,
 102]

We can [decode](https://huggingface.co/course/chapter2/4?fw=pt#decoding) the sequence to check what are these tokens:

In [16]:
tokenizer.decode(inputs['input_ids'])

'[CLS] my breakfast was perpared great, with a beautiful presentation of 3 giant slices of toast, lightly dusted with powdered sugar. [SEP]'

As with enconding, we can decode in two separate steps:

In [17]:
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))

['[CLS]', 'my', 'breakfast', 'was', 'per', '##par', '##ed', 'great', ',', 'with', 'a', 'beautiful', 'presentation', 'of', '3', 'giant', 'slices', 'of', 'toast', ',', 'lightly', 'dust', '##ed', 'with', 'powder', '##ed', 'sugar', '.', '[SEP]']
[CLS] my breakfast was perpared great, with a beautiful presentation of 3 giant slices of toast, lightly dusted with powdered sugar. [SEP]


### Loading the model

We now load the pretrained model:

In [18]:
from transformers import AutoModel

model = AutoModel.from_pretrained(model_name)

Downloading pytorch_model.bin: 100%|██████████| 268M/268M [04:28<00:00, 996kB/s]  
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading the model in this way only gets us the base Transformer module: given some inputs, we obtain the hidden state of the model -- a high-dimensional vector representing the "contextual understanding" of that input by the Transformer model.

In other words, we are leaving out the *head* of the model, which is needed for whatever NLP task we want to address.

Let's look at a particular example:

In [19]:
inputs = tokenizer(train_valid_test_dataset['train'][321]['Review'], padding=True, truncation=True, return_tensors="pt")

print(train_valid_test_dataset['train'][321])
print(inputs['input_ids'])
print(inputs['input_ids'].shape)

outputs = model(**inputs)
print(outputs.last_hidden_state)   # or outputs["last_hidden_state"]

print(outputs.last_hidden_state.shape)

{'Review': 'My breakfast was perpared great, with a beautiful presentation of 3 giant slices of Toast, lightly dusted with powdered sugar.', 'label': 1}
tensor([[  101,  2026,  6350,  2001,  2566, 19362,  2098,  2307,  1010,  2007,
          1037,  3376,  8312,  1997,  1017,  5016, 25609,  1997, 15174,  1010,
          8217,  6497,  2098,  2007,  9898,  2098,  5699,  1012,   102]])
torch.Size([1, 29])
tensor([[[-0.1190,  0.0041,  0.0127,  ...,  0.1235,  0.3143,  0.2568],
         [ 0.1999,  0.1017,  0.1900,  ..., -0.0455,  0.4243,  0.1464],
         [ 0.2767,  0.5174, -0.2451,  ..., -0.2512,  0.0239, -0.2284],
         ...,
         [ 0.0440,  0.2195,  0.0101,  ...,  0.0225,  0.2297, -0.4001],
         [ 0.7547,  0.0706, -0.4118,  ...,  0.4643, -0.3030, -0.3844],
         [ 0.3806,  0.3072,  0.1652,  ...,  0.5644,  0.0149, -0.2294]]],
       grad_fn=<NativeLayerNormBackward0>)
torch.Size([1, 29, 768])


As you can see, the hidden state representation has three dimensions:
- the *batch size* (in this case we are passing the model a single input sequence)
- the *sequence length*, that is, the number of tokens created by the tokenizer when encoding each input sequence
- the *hidden state size*, which is the vector dimension of each token (768 in the case of this model)

Since we want to use the model for classification, we should load it with an appropriate classification head:

In [20]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifi

Now the outputs of the model will be much different: we get *logits* with the prediction for each class.

In [21]:
outputs = model(**inputs)
print(outputs.logits)
print(outputs.logits.shape)

tensor([[0.0815, 0.1069]], grad_fn=<AddmmBackward0>)
torch.Size([1, 2])


Logits are raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a *softmax* layer.

In [22]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

model.config.id2label

tensor([[0.4936, 0.5064]], grad_fn=<SoftmaxBackward0>)


{0: 'LABEL_0', 1: 'LABEL_1'}

Now we can interpret the obtained values as probabilities, and identify the class for which the model assigns a higher probability for the input example.

Note, however, that for now the model is just guessing the output logits/probabilities, as it hasn't been trained with our dataset just yet. To better see this behavior, ask the user for some input, feed it to the model, and check its predictions.

In [23]:
# your code here


### Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [24]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import load_metric
import numpy as np

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  metric = load_metric("accuracy")
Downloading builder script: 4.21kB [00:00, 1.63MB/s]                   


ImportError: To be able to use accuracy, you need to install the following dependency: sklearn.
Please install it using 'pip install scikit-learn' for instance.

In [None]:
trainer.train()

We can check the model's performance in the evaluation set.

In [None]:
trainer.evaluate()

And more importantly, we can check how the model fares in our test set.

In [None]:
trainer.predict(test_dataset=tokenized_dataset["test"])

#### Saving the model

The model can be saved for future loading.

In [None]:
trainer.save_model()

#### Loading and using a saved model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results")
model2 = AutoModelForSequenceClassification.from_pretrained("./results", num_labels=2)

To exploit the model, we can use a pipeline.

In [None]:
from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

In [None]:
pipe("I love this food!")

We can also use the model in a step-by-step fashion, as follows.

In [None]:
import torch

inputs = "I love this food!"

# tokenize inputs
tokenized_inputs = tokenizer2(inputs, return_tensors="pt")
print(tokenized_inputs)

# obtain model outputs
outputs = model2(**tokenized_inputs)
print(outputs)

# get the most likely label
labels = ['NEGATIVE', 'POSITIVE']
prediction = torch.argmax(outputs.logits)
print(labels[prediction])

Let's check again the performance of the model in the test set, possibly with additional metrics.

In [None]:
y_pred= []
for p in tokenized_dataset['test']['Review']:
    ti = tokenizer2(p, return_tensors="pt")
    out = model2(**ti)
    pred = torch.argmax(out.logits)
    y_pred.append(pred)   # our labels are already 0 and 1

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

y_test = tokenized_dataset['test']['label']

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred, average='macro'))
print('Recall: ', recall_score(y_test, y_pred, average='macro'))
print('F1: ', f1_score(y_test, y_pred, average='macro'))

We can do the same using a Trainer, as before.

In [None]:
trainer2 = Trainer(
    model=model2,
    tokenizer=tokenizer2,
    compute_metrics=compute_metrics
)

In [None]:
trainer2.predict(test_dataset=tokenized_dataset["test"])

## Using a task-related pretrained model

Given the fact that Hugging Face includes several pretrained models, we can also use directly a model that has been pretrained with similar data or for a similar task.

In [None]:
from transformers import pipeline

# model_name = "siebert/sentiment-roberta-large-english"
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment_analysis = pipeline("sentiment-analysis", model=model_name)

Let's see how it performs without any fine-tuning (this time making use of the pipeline to predict the label for each of the test set samples).

In [None]:
y_pred= []
for p in train_valid_test_dataset['test']['Review']:
    if(sentiment_analysis(p)[0]['label'] == 'NEGATIVE'):
        y_pred.append(0)
    else:
        y_pred.append(1)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

y_test = train_valid_test_dataset['test']['label']

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred, average='macro'))
print('Recall: ', recall_score(y_test, y_pred, average='macro'))
print('F1: ', f1_score(y_test, y_pred, average='macro'))

As before, we can do the same via a Trainer.

In [None]:
from transformers import Trainer

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model, compute_metrics=compute_metrics)

In [None]:
def preprocess_function(sample):
    return tokenizer(sample["Review"], truncation=True, padding=True)

In [None]:
tokenized_dataset = train_valid_test_dataset.map(preprocess_function, batched=True)

In [None]:
trainer.predict(test_dataset=tokenized_dataset["test"])

Note that we can still fine-tune the model with our training data, but the performance of the model is already quite good without any further training!