<a target="_blank" href="https://colab.research.google.com/github/retowuest/uio-dl-2024/blob/main/Notebooks/nb-4.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Deep Learning for Social Scientists

### University of Oslo, November 27-28, 2024

### **Demo 5:**<br>Transformers

### Table of Contents
* [Introduction](#section_1)
* [Loading the Data](#section_2)
* [Loading and Fine-Tuning a Pre-Trained BERT Model](#section_3)

### Introduction <a class="anchor" id="section_1"></a>

In this demo, our goal is to fine-tune a BERT model for sentiment classification in PyTorch. We will use the open-source `transformers` [Python library](https://huggingface.co/docs/transformers/index) provided by [Hugging Face](https://huggingface.co/), which includes a number of pre-trained models that are ready for fine-tuning.

We will use as our use case the IMDb movie review data set and fine-tune the distilled BERT model (`DistilBERT`) to perform sentiment classification. `DistilBERT` is a lightweight transformer model created by distilling a pre-trained BERT base model. The original uncased BERT base model contains over 110 million parameters. According to Hugging Face (see quote below), `DistilBERT` has 40% fewer parameters and runs 60% faster while preserving 95% of BERT's performance on the GLUE language understanding benchmark.

---

Quote from https://huggingface.co/docs/transformers/model_doc/distilbert:

> DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than *google-bert/bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

---

### Loading the Data <a class="anchor" id="section_1"></a>

We will begin by loading the required packages.

In [None]:
# Import packages
import gzip
import shutil
import time

import pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtext

import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

Next, we specify some general settings (number of epochs we use for training the model, device specification, and the random seed).

In [None]:
random_seed = 123
torch.manual_seed(random_seed)
device = torch.device("cpu")

num_epochs = 3

Next, we will fetch the compressed IMDb movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification, unzip it, and write it into a CSV-formatted file.

In [None]:
url = "https://github.com/rasbt/machine-learning-book/raw/main/ch08/movie_data.csv.gz"
filename = url.split("/")[-1]

with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

with gzip.open("movie_data.csv.gz", "rb") as f_in:
    with open("movie_data.csv", "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)  # copy content from source file to destination file

Check if the data set looks okay.

In [None]:
# Load data into a Pandas DataFrame and print first few rows
df = pd.read_csv("movie_data.csv")
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [None]:
# Print dimensions of dataframe
df.shape

(50000, 2)

The next step is to split data set into training, validation, and test sets. We use 70% (or 35,000 examples) of the data for training, 10% (or 5,000 examples) for validation, and the remaining 20% (or 10,000 examples) for testing.

In [None]:
# Split data into training, validation, and test sets
train_texts = df.iloc[:35000]["review"].values
train_labels = df.iloc[:35000]["sentiment"].values

valid_texts = df.iloc[35000:40000]["review"].values
valid_labels = df.iloc[35000:40000]["sentiment"].values

test_texts = df.iloc[40000:]["review"].values
test_labels = df.iloc[40000:]["sentiment"].values

Next, we will tokenize the texts into individual word tokens using the tokenizer implementation inherited from the pre-trained model class.

In [None]:
# Inherited tokenizers maintain consistency between the pre-trained model and the data
# (hence, using an inherited tokenizer is recommended when fine-tuning
# a pre-trained model)
tokenizer = DistilBertTokenizerFast.from_pretrained(
    "distilbert-base-uncased", model_max_length=512
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

In [None]:
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)  # truncation=True: inputs longer than the model's maximum input length are truncated to fit within the model's limit (it has maximum input length of 512 tokens, see previous cell)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)  # padding=True: shorter sequences are padded to the same length as the longest sequence in the batch (model requires all sequences in a batch to be of uniform length)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

Finally, we create a class called `IMDbDataset` and create the data loaders (the encodings store a lot of information about the tokenized texts; with the dictionary in the `__getitem__` method defined below, we extract only the relevant information).

In [None]:
# Create class IMDbDataset
class IMDbDataset(torch.utils.data.Dataset):  # creating subclass of torch.utils.data.Dataset (i.e., we inherit from this class and can override its functionality)
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    # We override two methods from torch.utils.data.Dataset class
    def __getitem__(self, idx):  # extract data at specified index from encodings and labels and convert them into tensors
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):  # specify length of data set which is determined by number of labels
        return len(self.labels)

# Apply class to create training, validation, and test sets
train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [None]:
# Create data loaders
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)

### Loading and Fine-Tuning a Pre-Trained BERT Model <a class="anchor" id="section_3"></a>

We first load the pre-trained BERT model ("uncased" means that the model does not distinguish between upper- and lower-case letters).

In [None]:
# Load pre-trained model that we want to fine-tune
# (DistilBertForSequenceClassification specifies the downstream task for which
# we want to fine-tune the model, which is sequence classification)
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
model.to(device)
model.train();

# Message below means:
# - The DistilBertForSequenceClassification class adds additional layers on top of the
#   pretrained model to perform sequence classification
# - These layers are not part of the pretrained checkpoint and are randomly initialized
#   when the model is created (as such they need to be trained)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To train (fine-tune) the model, we will use the `Trainer` API provided by `Hugging Face`, which is optimized for transformer models.

In [None]:
# Import Trainer and TrainingArguments from transformers
from transformers import Trainer, TrainingArguments

# Specify optimization algorithm
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

# Specify training arguments
# (directories for output and logs, number of epochs, batch sizes)
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir="./logs",
    logging_steps=10,
)

# Pass TrainingArguments settings to the Trainer class to instantiate a new trainer object
trainer = Trainer(
    model=model,  # the model to be fine-tuned
    args=training_args,  # training arguments specified above
    train_dataset=train_dataset,  # training set
    optimizers=(optim, None)  # optim and learning rate schedule
)

We can now train the model by calling the `trainer.train` method (we will use this method shortly).

The `Trainer` API only shows the training loss and does not provide model evaluation. Therefore, to evaluate the model, we define an evaluation function.

In [None]:
%%capture
# Import load_metrics and numpy
from datasets import load_metric
import numpy as np

# Define metric
metric = load_metric("accuracy")

# Define evaluation function
# (function operates on the model's test predictions as logits, which is the default output of the model, and the labels)
def compute_metrics(eval_pred):
    logits, labels = eval_pred  # logits are a numpy array, not pytorch tensor
    predictions = np.argmax(logits, axis=-1)  # the compute_metrics function operates on the model's test predictions as logits (domain is real line)
    return metric.compute(
        predictions=predictions, references=labels
    )

In [None]:
# Run trainer again, this time including test set and compute_metrics
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir="./logs",
    logging_steps=10
)

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    optimizers=(optim, None)  # optimizer and learning rate scheduler
)

In [None]:
# Train model by calling trainer.train method
start_time = time.time()
trainer.train()
print(f"Total Training Time: {(time.time() - start_time)/60:.2f} min")

  0%|          | 0/6564 [00:00<?, ?it/s]

{'loss': 0.6807, 'grad_norm': 1.4945377111434937, 'learning_rate': 4.9923826934795856e-05, 'epoch': 0.0}
{'loss': 0.6027, 'grad_norm': 3.1703603267669678, 'learning_rate': 4.984765386959172e-05, 'epoch': 0.01}
{'loss': 0.5401, 'grad_norm': 5.817531108856201, 'learning_rate': 4.977148080438757e-05, 'epoch': 0.01}
{'loss': 0.3642, 'grad_norm': 5.309085845947266, 'learning_rate': 4.9695307739183424e-05, 'epoch': 0.02}
{'loss': 0.3551, 'grad_norm': 13.325989723205566, 'learning_rate': 4.9619134673979285e-05, 'epoch': 0.02}
{'loss': 0.4475, 'grad_norm': 8.179169654846191, 'learning_rate': 4.954296160877514e-05, 'epoch': 0.03}
{'loss': 0.3234, 'grad_norm': 1.8053861856460571, 'learning_rate': 4.9466788543571e-05, 'epoch': 0.03}
{'loss': 0.3941, 'grad_norm': 7.565412998199463, 'learning_rate': 4.939061547836685e-05, 'epoch': 0.04}
{'loss': 0.3482, 'grad_norm': 5.218079090118408, 'learning_rate': 4.931444241316271e-05, 'epoch': 0.04}
{'loss': 0.3749, 'grad_norm': 12.034934043884277, 'learning_

After training has completed, we can call `trainer.evaluate` to obtain the model performance on the test set.

In [None]:
print(trainer.evaluate())

  0%|          | 0/625 [00:00<?, ?it/s]

{'eval_loss': 0.29338061809539795, 'eval_accuracy': 0.9378, 'eval_runtime': 223.274, 'eval_samples_per_second': 44.788, 'eval_steps_per_second': 2.799, 'epoch': 3.0}


The evaluation accuracy is around 94%.