# Exercise: Full-fine tuning BERT

In this exercise, you will create a BERT sentiment classifier using the [Hugging Face Transformers](https://huggingface.co/transformers/) library. You will use the [IMDB movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) to train and evaluate your model.

The IMDB dataset contains 50,000 movie reviews that are labeled as either positive or negative. The dataset is split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.


In [1]:
# Install dependencies for this notebook if they are missing
! pip install -q \
    scikit-learn \
    evaluate \
    datasets \
    "transformers[torch]" \
    ipywidgets

In [2]:
# Since we intend on using a GPU, we inspect that one is available and has enough memory to run the model
! nvidia-smi

Mon Oct  9 03:26:12 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1060 3GB    Off | 00000000:01:00.0 Off |                  N/A |
| 12%   50C    P5               8W / 120W |      0MiB /  3072MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
# Import the datasets and transformers packages
# NOTE: If you receieve an error such as "ModuleNotFoundError: No module named 'datasets'",
# please restart the kernel (Kernel > Restart) and start the notebook from the top

from datasets import load_dataset

from transformers import AutoTokenizer

splits = ["train", "test"]

# The sms_spam dataset only has a train split, so we use the train_test_split method to split it into train and test
dataset = load_dataset("sms_spam", split="train").train_test_split(test_size=0.2, shuffle=True, seed=23)


dataset["train"][0]


{'sms': 'Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call\n',
 'label': 1}

In [4]:

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(lambda x: tokenizer(x["sms"], truncation=True), batched=True)
print(tokenized_dataset["train"][0]["input_ids"])


[101, 2018, 2115, 4684, 2184, 11047, 7898, 1029, 10651, 2000, 1996, 6745, 4950, 1013, 2678, 11640, 2005, 2489, 1012, 2562, 24471, 2168, 2193, 1010, 2131, 4469, 2489, 8117, 2015, 1013, 6981, 1012, 3793, 2748, 2005, 1037, 2655, 102]


In [5]:

import evaluate
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer


import numpy as np

accuracy = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

id2label = {0: "not spam", 1: "spam"}
label2id = {"not spam": 0, "spam": 1}


# https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

# Unfreeze all the model parameters.
# Note: 
for param in model.parameters():
    param.requires_grad = True
    # adjust learning rate for fine-tuning by uncommenting the following line
    # param.requires_grad_(lr=2e-5)



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# list(model.named_parameters())

In [7]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [8]:
from transformers import DataCollatorWithPadding

# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer 
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        learning_rate=2e-5,
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,

    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.050394,0.988341
2,0.052400,0.047742,0.989238


TrainOutput(global_step=558, training_loss=0.049542664199747066, metrics={'train_runtime': 104.8336, 'train_samples_per_second': 85.068, 'train_steps_per_second': 5.323, 'total_flos': 143323774661868.0, 'train_loss': 0.049542664199747066, 'epoch': 2.0})

In [9]:
import pandas as pd
df = pd.DataFrame(tokenized_dataset["test"])
df = df[['sms', 'label']]
df = pd.concat(
    [
        df[df['label'] == 0].head(5),
        df[df['label'] == 1].head(5)
    ]
)
pd.set_option('display.max_colwidth', 200)
df

Unnamed: 0,sms,label
0,Yup... Hey then one day on fri we can ask miwa and jiayin take leave go karaoke \n,0
1,Happy new years melody!\n,0
2,Think I could stop by in like an hour or so? My roommate's looking to stock up for a trip\n,0
3,I can make lasagna for you... vodka...\n,0
4,No rushing. I'm not working. I'm in school so if we rush we go hungry.\n,0
22,PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0\n,1
31,URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only\n,1
48,"I want some cock! My hubby's away, I need a real man 2 satisfy me. Txt WIFE to 89938 for no strings action. (Txt STOP 2 end, txt rec £1.50ea. OTBox 731 LA1 7WS. )\n",1
49,Your unique user ID is 1172. For removal send STOP to 87239 customer services 08708034412\n,1
54,Double your mins & txts on Orange or 1/2 price linerental - Motorola and SonyEricsson with B/Tooth FREE-Nokia FREE Call MobileUpd8 on 08000839402 or2optout/HV9D\n,1


In [10]:
dataset_indices = list(df.index)

In [11]:
# Show the performance of the model on some test set examples

# select the first negative and the first positive test set examples


predictions = trainer.predict(tokenized_dataset["test"].select(dataset_indices))
# Print the 10 first test set samples with their predictions.
# split text into lines of 80 characters
def split_text(text, n=160):
    lines = []
    while len(text) > n:
        line = text[:n]
        space_index = line.rfind(" ")
        if space_index != -1:
            line = line[:space_index]
        lines.append(line)
        text = text[len(line):]
    lines.append(text)
    return "\n".join(lines)

for index, (pred, label, text) in enumerate(zip(predictions.predictions.argmax(axis=1), predictions.label_ids, tokenized_dataset["test"].select(dataset_indices)["sms"])):
    print(f"{index}. pred={id2label[pred]} | label={id2label[label]} \n {split_text(text)}")

0. pred=not spam | label=not spam 
 Yup... Hey then one day on fri we can ask miwa and jiayin take leave go karaoke 

1. pred=not spam | label=not spam 
 Happy new years melody!

2. pred=not spam | label=not spam 
 Think I could stop by in like an hour or so? My roommate's looking to stock up for a trip

3. pred=not spam | label=not spam 
 I can make lasagna for you... vodka...

4. pred=not spam | label=not spam 
 No rushing. I'm not working. I'm in school so if we rush we go hungry.

5. pred=spam | label=spam 
 PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0

6. pred=spam | label=spam 
 URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only

7. pred=spam | label=spam 
 I want some cock! My hubby's away, I need a real man 2 satisfy me. Txt WIFE to 89938 for no strings action. (Txt STOP 2 end, txt r

In [15]:
# for param in model.parameters():
#     # print the weights of the model
#     print(param.data[0,:10])


[model.base_model.embeddings.word_embeddings.weight.data[0,:10],
model.base_model.transformer.layer[0].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[1].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[2].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[3].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[4].attention.q_lin.weight.data[0,:10],
model.base_model.transformer.layer[5].attention.q_lin.weight.data[0,:10],
model.pre_classifier.weight.data[0,:10],
model.classifier.weight.data[0,:10],
]

# Verify that the weights of the model have changed

[tensor([-0.0166, -0.0666, -0.0163, -0.0421, -0.0080, -0.0140, -0.0635, -0.0205,
         -0.0086, -0.0634], device='cuda:0'),
 tensor([-0.0026,  0.0226, -0.0206,  0.0321, -0.0106,  0.0477,  0.0287,  0.0287,
          0.0303,  0.0136], device='cuda:0'),
 tensor([ 0.0913, -0.0141, -0.0760,  0.1141, -0.0585,  0.0501, -0.1948,  0.0975,
          0.0030,  0.0678], device='cuda:0'),
 tensor([ 0.0705,  0.0559,  0.0193, -0.1212, -0.0017, -0.0213, -0.0078, -0.0307,
          0.0206, -0.0392], device='cuda:0'),
 tensor([ 0.0141, -0.0824,  0.0225, -0.0240, -0.0157, -0.0536, -0.0005,  0.0093,
          0.0343,  0.0108], device='cuda:0'),
 tensor([ 0.0511,  0.0027, -0.0665, -0.0275,  0.0445, -0.0402,  0.0479,  0.0373,
          0.0264, -0.0376], device='cuda:0'),
 tensor([-0.0548, -0.0462, -0.0248, -0.0028,  0.0238,  0.0012,  0.0135, -0.0256,
          0.0121,  0.0457], device='cuda:0'),
 tensor([-0.0122,  0.0332, -0.0152, -0.0170, -0.0125,  0.0020,  0.0004, -0.0015,
          0.0328,  0.0198], de