# Exercise: Create a BERT sentiment classifier

In this exercise, you will create a BERT sentiment classifier using the [Hugging Face Transformers](https://huggingface.co/transformers/) library. You will use the [IMDB movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) to train and evaluate your model.

The IMDB dataset contains 50,000 movie reviews that are labeled as either positive or negative. The dataset is split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.


In [1]:
# Install dependencies for this notebook if they are missing
! pip install -q \
    scikit-learn \
    evaluate \
    datasets \
    "transformers[torch]" \
    ipywidgets

In [2]:
# Since we intend on using a GPU, we inspect that one is available and has enough memory to run the model
! nvidia-smi

Mon Oct  9 02:44:22 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1060 3GB    Off | 00000000:01:00.0 Off |                  N/A |
|  5%   50C    P5               8W / 120W |      0MiB /  3072MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
# Import the datasets and transformers packages
# NOTE: If you receieve an error such as "ModuleNotFoundError: No module named 'datasets'",
# please restart the kernel (Kernel > Restart) and start the notebook from the top

from datasets import load_dataset

from transformers import AutoTokenizer

# Load the imdb dataset: https://huggingface.co/datasets/imdb
# Take only the "train" and "test" splits
splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits,load_dataset("imdb", split=splits))}

# Thin out the dataset to make it run faster for this example
for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(100))

# Show the first training example
# dataset["train"][0]
# dataset["train"]
ds

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 100
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 100
 })}

In [4]:

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(lambda x: tokenizer(x["text"], truncation=True), batched=True)
print(tokenized_ds["train"][0]["input_ids"])

# batch_size = 32

# train_loader = DataLoader(train_dataset.with_format(type='torch'), batch_size=32)

# for batch_idx, (data, label) in enumerate(train_loader):
#     print(batch_idx, data, label)
#     if batch_idx > 10:
#         break

# train_dataset.with_format("torch")


[101, 2045, 2003, 2053, 7189, 2012, 2035, 2090, 3481, 3771, 1998, 6337, 2099, 2021, 1996, 2755, 2008, 2119, 2024, 2610, 2186, 2055, 6355, 6997, 1012, 6337, 2099, 3504, 15594, 2100, 1010, 3481, 3771, 3504, 4438, 1012, 6337, 2099, 14811, 2024, 3243, 3722, 1012, 3481, 3771, 1005, 1055, 5436, 2024, 2521, 2062, 8552, 1012, 1012, 1012, 3481, 3771, 3504, 2062, 2066, 3539, 8343, 1010, 2065, 2057, 2031, 2000, 3962, 12319, 1012, 1012, 1012, 1996, 2364, 2839, 2003, 5410, 1998, 6881, 2080, 1010, 2021, 2031, 1000, 17936, 6767, 7054, 3401, 1000, 1012, 2111, 2066, 2000, 12826, 1010, 2000, 3648, 1010, 2000, 16157, 1012, 2129, 2055, 2074, 9107, 1029, 6057, 2518, 2205, 1010, 2111, 3015, 3481, 3771, 3504, 2137, 2021, 1010, 2006, 1996, 2060, 2192, 1010, 9177, 2027, 9544, 2137, 2186, 1006, 999, 999, 999, 1007, 1012, 2672, 2009, 1005, 1055, 1996, 2653, 1010, 2030, 1996, 4382, 1010, 2021, 1045, 2228, 2023, 2186, 2003, 2062, 2394, 2084, 2137, 1012, 2011, 1996, 2126, 1010, 1996, 5889, 2024, 2428, 2204, 1998, 6

In [5]:

import evaluate
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer


import numpy as np

accuracy = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}


# https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

# Freeze all the parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = False

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [7]:
# Check which layers are frozen
# for param in model.base_model.parameters():
#     print(param.requires_grad)
# len(list(model.base_model.parameters()))
# print the nubmer of layers

# Show the weights of the first 5 layers

old = [model.base_model.embeddings.word_embeddings.weight.data[0,:10],
model.base_model.transformer.layer[0].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[1].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[2].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[3].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[4].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[5].attention.q_lin.weight.data[0,:10],
model.pre_classifier.weight.data[0,:10],
model.classifier.weight.data[0,:10],
]
print(old)


[tensor([-0.0166, -0.0666, -0.0163, -0.0421, -0.0080, -0.0140, -0.0635, -0.0205,
        -0.0086, -0.0634]), tensor([-0.0024,  0.0224, -0.0207,  0.0318, -0.0102,  0.0476,  0.0284,  0.0286,
         0.0307,  0.0135]), tensor([ 0.0915, -0.0140, -0.0762,  0.1138, -0.0585,  0.0498, -0.1951,  0.0972,
         0.0030,  0.0676]), tensor([ 0.0704,  0.0561,  0.0193, -0.1206, -0.0019, -0.0209, -0.0077, -0.0309,
         0.0206, -0.0391]), tensor([ 0.0137, -0.0824,  0.0224, -0.0237, -0.0156, -0.0535, -0.0006,  0.0089,
         0.0343,  0.0110]), tensor([ 0.0507,  0.0030, -0.0672, -0.0274,  0.0443, -0.0399,  0.0480,  0.0366,
         0.0262, -0.0375]), tensor([-0.0548, -0.0465, -0.0241, -0.0025,  0.0238,  0.0011,  0.0133, -0.0253,
         0.0117,  0.0457]), tensor([-0.0105, -0.0076, -0.0418,  0.0027, -0.0123, -0.0361,  0.0384,  0.0258,
         0.0026, -0.0067]), tensor([-0.0377, -0.0157,  0.0238,  0.0043, -0.0064, -0.0117,  0.0127,  0.0111,
        -0.0002, -0.0161])]


In [8]:
from transformers import DataCollatorWithPadding

# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer 
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,

    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

new = [model.base_model.embeddings.word_embeddings.weight.data[0,:10],
model.base_model.transformer.layer[0].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[1].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[2].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[3].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[4].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[5].attention.q_lin.weight.data[0,:10],
model.pre_classifier.weight.data[0,:10],
model.classifier.weight.data[0,:10],
]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.650257,0.61


In [9]:
print(old)

[tensor([-0.0166, -0.0666, -0.0163, -0.0421, -0.0080, -0.0140, -0.0635, -0.0205,
        -0.0086, -0.0634]), tensor([-0.0024,  0.0224, -0.0207,  0.0318, -0.0102,  0.0476,  0.0284,  0.0286,
         0.0307,  0.0135]), tensor([ 0.0915, -0.0140, -0.0762,  0.1138, -0.0585,  0.0498, -0.1951,  0.0972,
         0.0030,  0.0676]), tensor([ 0.0704,  0.0561,  0.0193, -0.1206, -0.0019, -0.0209, -0.0077, -0.0309,
         0.0206, -0.0391]), tensor([ 0.0137, -0.0824,  0.0224, -0.0237, -0.0156, -0.0535, -0.0006,  0.0089,
         0.0343,  0.0110]), tensor([ 0.0507,  0.0030, -0.0672, -0.0274,  0.0443, -0.0399,  0.0480,  0.0366,
         0.0262, -0.0375]), tensor([-0.0548, -0.0465, -0.0241, -0.0025,  0.0238,  0.0011,  0.0133, -0.0253,
         0.0117,  0.0457]), tensor([-0.0105, -0.0076, -0.0418,  0.0027, -0.0123, -0.0361,  0.0384,  0.0258,
         0.0026, -0.0067]), tensor([-0.0377, -0.0157,  0.0238,  0.0043, -0.0064, -0.0117,  0.0127,  0.0111,
        -0.0002, -0.0161])]


In [10]:
print(new)

[tensor([-0.0166, -0.0666, -0.0163, -0.0421, -0.0080, -0.0140, -0.0635, -0.0205,
        -0.0086, -0.0634], device='cuda:0'), tensor([-0.0024,  0.0224, -0.0207,  0.0318, -0.0102,  0.0476,  0.0284,  0.0286,
         0.0307,  0.0135], device='cuda:0'), tensor([ 0.0915, -0.0140, -0.0762,  0.1138, -0.0585,  0.0498, -0.1951,  0.0972,
         0.0030,  0.0676], device='cuda:0'), tensor([ 0.0704,  0.0561,  0.0193, -0.1206, -0.0019, -0.0209, -0.0077, -0.0309,
         0.0206, -0.0391], device='cuda:0'), tensor([ 0.0137, -0.0824,  0.0224, -0.0237, -0.0156, -0.0535, -0.0006,  0.0089,
         0.0343,  0.0110], device='cuda:0'), tensor([ 0.0507,  0.0030, -0.0672, -0.0274,  0.0443, -0.0399,  0.0480,  0.0366,
         0.0262, -0.0375], device='cuda:0'), tensor([-0.0548, -0.0465, -0.0241, -0.0025,  0.0238,  0.0011,  0.0133, -0.0253,
         0.0117,  0.0457], device='cuda:0'), tensor([-0.0144, -0.0062, -0.0414,  0.0037, -0.0083, -0.0350,  0.0393,  0.0283,
         0.0024, -0.0148], device='cuda:0'),

In [11]:
import torch
for old_item, new_item in zip(old, new):
    # move to cpu device
    old_item = old_item.cpu()
    new_item = new_item.cpu()

    print((torch.eq(old_item, new_item)))


tensor([True, True, True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True, True, True])
tensor([False, False, False, False, False, False, False, False, False, False])
tensor([False, False, False, False, False, False, False, False, False, False])


: 

In [27]:
print(model.base_model.embeddings.word_embeddings.weight.data[0,:10])
print(old_values)

tensor([-0.0166, -0.0666, -0.0163, -0.0421, -0.0080, -0.0140, -0.0635, -0.0205,
        -0.0086, -0.0634], device='cuda:0')
tensor([-0.0166, -0.0666, -0.0163, -0.0421, -0.0080, -0.0140, -0.0635, -0.0205,
        -0.0086, -0.0634])


In [13]:
# Show the performance of the model on the test set
# What do you think the evaluation accuracy will be?
trainer.evaluate()


{'eval_loss': 0.35444939136505127,
 'eval_accuracy': 0.89,
 'eval_runtime': 2.1459,
 'eval_samples_per_second': 46.6,
 'eval_steps_per_second': 11.65,
 'epoch': 8.0}

In [14]:
import pandas as pd
df = pd.DataFrame(tokenized_ds["test"])
df = df[['text', 'label']]
df = pd.concat(
    [
        df[df['label'] == 0].head(2),
        df[df['label'] == 1].head(2)
    ]
)
df

Unnamed: 0,text,label
2,"This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good time. I at least thought I'd be able to stand it. But, I was wrong. First, the weird looping? I...",0
4,"This movie spends most of its time preaching that it is the script that makes the movie, but apparently there was no script when they shot this waste of time! The trailer makes this out to be a co...",0
0,"<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very ...",1
1,"This is the latest entry in the long series of films with the French agent, O.S.S. 117 (the French answer to James Bond). The series was launched in the early 1950's, and spawned at least eight fi...",1


In [15]:
# show entire cell in pandas
pd.set_option('display.max_colwidth', 200)
df

Unnamed: 0,text,label
2,"This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good time. I at least thought I'd be able to stand it. But, I was wrong. First, the weird looping? I...",0
4,"This movie spends most of its time preaching that it is the script that makes the movie, but apparently there was no script when they shot this waste of time! The trailer makes this out to be a co...",0
0,"<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very ...",1
1,"This is the latest entry in the long series of films with the French agent, O.S.S. 117 (the French answer to James Bond). The series was launched in the early 1950's, and spawned at least eight fi...",1


In [16]:
dataset_indices = list(df.index)

In [17]:
# Show the performance of the model on some test set examples

# select the first negative and the first positive test set examples


predictions = trainer.predict(tokenized_ds["test"].select(dataset_indices))
# Print the 10 first test set samples with their predictions.
# split text into lines of 80 characters
def split_text(text, n=160):
    lines = []
    while len(text) > n:
        line = text[:n]
        space_index = line.rfind(" ")
        if space_index != -1:
            line = line[:space_index]
        lines.append(line)
        text = text[len(line):]
    lines.append(text)
    return "\n".join(lines)

for index, (pred, label, text) in enumerate(zip(predictions.predictions.argmax(axis=1), predictions.label_ids, tokenized_ds["test"].select(dataset_indices)["text"])):
    print(f"{index}. pred={id2label[pred]} | label={id2label[label]} \n {split_text(text)}")

0. pred=NEGATIVE | label=NEGATIVE 
 This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good time. I at least thought I'd be able to stand it. But, I
 was wrong. First, the weird looping? It was like watching "America's Funniest Home Videos". The damn parents. I hated them so much. The stereo-typical Latino
 family? I need to speak with the person responsible for this. We need to have a talk. That little girl who was always hanging on someone? I just hated her and
 had to mention it. Now, the final scene transcends, I must say. It's so gloriously bad and full of badness that it is a movie of its own. What crappy dancing.
 Horrible and beautiful at once.
1. pred=NEGATIVE | label=NEGATIVE 
 This movie spends most of its time preaching that it is the script that makes the movie, but apparently there was no script when they shot this waste of time!
 The trailer makes this out to be a comedy, but the film can't decide if it wants to be a comedy, a