# Finetuning a pretrained transformer Model for the task of classifying tweets into 4 classes using Huggingface Transformers tools
## Adaptation of my project for the Lab Statistical Language Processing course taught at the University of Tübingen in the Summer Semester of 2024

## Honour Code
Author: Szymon T. Kossowski \
Honour Code: I pledge that this program represents my own work, and that I have not given or received unauthorized help with this assignment.

## Note:
I recommend to run this notebook in Google Colab. You can run it locally, but make sure you have CUDA available. Otherwise it's gonna take an eternity or two to run some sections of the code.

# Notebook setup
Installing the packages

In [None]:
!pip install accelerate
!pip install transformers
!pip install google
!pip install datasets
!pip install evaluate



Importing the packages

In [None]:
import torch
import collections
import numpy as np
import evaluate
import transformers
from google.colab import drive
from datasets import load_dataset

Since the training happened in the Google Colab, and this notebook is meant to be run there (it is however, not obligatory), the Google Drive is mounted. Two directories, for models and checkpoints are created.

In [None]:
drive.mount('/content/gdrive')

output_dir = "classifier-model"
!mkdir /content/gdrive/My\ Drive/$output_dir

training_checkpoints = "checkpoints"
models_dir = "models"

!mkdir /content/gdrive/My\ Drive/classifier-model/$training_checkpoints
!mkdir /content/gdrive/My\ Drive/classifier-model/$models_dir

%cd /content/gdrive/My\ Drive/classifier-model


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
mkdir: cannot create directory ‘/content/gdrive/My Drive/classifier-model’: File exists
mkdir: cannot create directory ‘/content/gdrive/My Drive/classifier-model/checkpoints’: File exists
mkdir: cannot create directory ‘/content/gdrive/My Drive/classifier-model/models’: File exists
/content/gdrive/My Drive/classifier-model


You can, however, use your local memory if you run this notebook locally. Create two directories, for models and checkpoints. The path given here should be adjusted to your needs, so change the `output_dir`, `training_checkpoints` and `models-dir` variable.

The code given here is working for Windows users. If you use a Unix-based system, comment out the code for Windows users and enable the code for Unix-based system users.

In [None]:
# For Windows users
output_dir = "C://path//to//directory//HuggingFace-Classifier-Model//classifier-model"
training_checkpoints = "C://path//to//directory//HuggingFace-Classifier-Model//classifier-model//checkpoints"
models_dir = "C://path//to//directory//HuggingFace-Classifier-Model//classifier-model//classifier-model//models"

# For Unix-based system users
# output_dir = "classifier-model"
# !mkdir /c/path/to/directory//huggingface-classifier-model/$output_dir

# training_checkpoints = "checkpoints"
# !mkdir /$output_dir/$training_checkpoints

# models_dir = "models"
# !mkdir /$output_dir/$models_dir

%cd "C://path//to//directory//HuggingFace-Classifier-Model//classifier-model"

Setting the variable ```device```

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Preparation of the training data
Loading the emotion subset of the *tweet_eval* dataset.

In [None]:
dataset = load_dataset("cardiffnlp/tweet_eval", "emotion")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})


Tokenisation of the entire dataset using the case-insensitive version [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) in this lab and in A4.

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets)

Map:   0%|          | 0/1421 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 374
    })
})


Creating small data splits for CPU training

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(50))
small_dev_dataset = tokenized_datasets["validation"].shuffle(seed=42).select(range(10))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10))

# Pre-training
## Loading the pretrained distilbert-base-uncased model
The model after the training is supposed to perform classification tasks on a sequence of tokens (here: tweets) as one of several labels. Therefore, the pretrained model is loaded using AutoModelForSequenceClassification. When a pretrained model is loaded for fine-tuning, it needs to know some facts about the training data. In addition to the model name, the following values are passed when loading the pretrained model:

num_labels: The number of labels in the training data, which can be retrieved from the training data features

label2id: {label:id} dictionary. The list of label names is retrieved from the training data features. Each label is mapped to its index in the list.

id2label {id:label} dictionary. Reverse mapping of label2id

Providing the label-id mappings will make it easier later, to use the model for inference in the last step. The model will return its predictions as strings ("anger", "joy",..). If the mappings are not provided, the model returns predictions as "LABEL_0", "LABEL_1"..., which is inconvenient.

In [None]:
# 0: anger
# 1: joy
# 2: optimism
# 3: sadness

labels = tokenized_datasets["train"].features["label"].names

unique_labels = []

for label in labels:
  if label not in unique_labels:
    unique_labels.append(label)

num_labels = len(unique_labels)

label2id = {}

for i in range(len(unique_labels)):
  label2id[unique_labels[i]] = i

id2label = {v: k for k, v in label2id.items()}

print(label2id)
print(id2label)

{'anger': 0, 'joy': 1, 'optimism': 2, 'sadness': 3}
{0: 'anger', 1: 'joy', 2: 'optimism', 3: 'sadness'}


Load the pretrained [model](https://huggingface.co/distilbert/distilbert-base-uncased)

In [None]:
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=num_labels, label2id=label2id, id2label=id2label)
model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


## Training arguments

In addition to the directory for saving training checkpoints, the following parameters for the training arguments are recommended:

`save_total_limit=2`: During training, the model is saved at intervals, according to save_strategy. These models are called checkpoints, and can be used to resume training. Since space is limited, and these checkpoints can fill up the memory rather quickly, we limit their number to 2 (last and best models).

`load_best_model_at_end=True`: When training is finished, load the best model. Later, when the model is saved, the best model is saved, not the last trained one.

`eval_strategy="steps"`

`save_strategy="steps"`

`logging_strategy="steps"`

Evaluate, save, and log every 500 steps (default value for steps is 500).

Alternatively, "epoch" as a strategy can be used, just make sure that eval_strategy and save_strategy are the same (or just make them all the same).

In [None]:
training_checkpoints = "./checkpoints"
training_arguments = transformers.TrainingArguments(
    report_to="none",
    output_dir=training_checkpoints,
    save_total_limit=2,
    load_best_model_at_end=True,
    eval_strategy="steps",
    save_strategy="steps",
    logging_strategy="steps")

## Evaluation Setup
Defining the compute_metrics function that will be used for evaluation, both during training
and for evaluation on the test set. Since this is a classification task, the F1 score (with
macro averaging) is used:
`metric = evaluate.load("f1")`

In [None]:
metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="macro")

## Initialize Trainer and Train
When the model is ready for training with the full dataset and there is such a possibility (eg. on Google Colab), a GPU is requested: Runtime -> Change runtime type -> T4 GPU Training for the default 3 epochs will take roughly 2 minutes on a GPU. However, if there is no possibility for GPU training, CPU training should do well, but it is significantly longer.

In [33]:
trainset = tokenized_datasets["train"]
evalset = tokenized_datasets["validation"]

trainer = transformers.Trainer(
    model=model,
    args=training_arguments,
    train_dataset=trainset,
    eval_dataset=evalset,
    compute_metrics=compute_metrics,
)
trainer.train()

Step,Training Loss,Validation Loss,F1
500,0.2626,1.176944,0.696946
1000,0.2041,1.227721,0.713501


TrainOutput(global_step=1224, training_loss=0.20278771562513961, metrics={'train_runtime': 492.5295, 'train_samples_per_second': 19.838, 'train_steps_per_second': 2.485, 'total_flos': 1294385117663232.0, 'train_loss': 0.20278771562513961, 'epoch': 3.0})

# Saving the Best Model
Saving the model to the directory that was created for that purpose when mounting the drive. After the model is saved, it is safe to delete the contents of the directory used for training. It may be necessary to do this to avoid running out of space.

In [34]:
models_dir = "./models"
trainer.save_model(models_dir)

# Loading the saved model
If all went well, it should now be able to load the model in the same way the distilbert-base-uncased model was loaded earlier, using AutoModelForSequenceClassification. But this time it is only necessary to provide the model location, since no further training is being done. Then a TextClassificationPipeline is created. The pipeline task is "sentiment-analysis", and it will be necessary to provide the distilbert-base-uncased tokenizer as an argument to the pipeline.

In [35]:
saved_model = transformers.AutoModelForSequenceClassification.from_pretrained(models_dir)
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")
analyzer = transformers.pipeline("sentiment-analysis", model=saved_model, tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


# Using the saved model for inference
Using the pipeline created in the previous step to classify some tweets (at least 15). Tweets from the test set for this part can be used. For each tweet, the model’s prediction, the model’s confidence score, and the tweet are printed.

In [36]:
tweets = dataset["test"].shuffle(seed=42).select(range(15))["text"]
for tweet in tweets:
  analysis = analyzer(tweet)
  print(f"Tweet: {tweet}")
  print(f"Prediction: {analysis[0]['label']}")
  print(f"Confidence: {analysis[0]['score']}")

Tweet: @user @user Why is this even a thing? #lost
Prediction: sadness
Confidence: 0.9965887069702148
Tweet: who knew magnus bane with teary eyes and trembling lips would be my downfall
Prediction: anger
Confidence: 0.9986338019371033
Tweet: This the way I rage, \n\non sum wopstar shit ⭐️
Prediction: anger
Confidence: 0.999642014503479
Tweet: Happy birthday🎉🎉❤️😘 @user I love and miss you bunches &amp; hope you have an amazing day beautiful!!🌞🌸💗💞
Prediction: joy
Confidence: 0.9995049238204956
Tweet: Great game of tennis this one 👌🏻👌🏻🎾 #nervous
Prediction: joy
Confidence: 0.9964063763618469
Tweet: @user The little details in life can build up to things that could affect you negatively and eventually make unhappy and lonely.
Prediction: sadness
Confidence: 0.9990500807762146
Tweet: @user I’m glad to hear that, Kara. Don’t hesitate to get in touch again if you need to! Cheryl
Prediction: joy
Confidence: 0.990203857421875
Tweet: Gorkas is angrily unhinged, &amp;lecturing CNN on what?they sh