✏️ **Try it out!** Modify the previous training loop to fine-tune your model on the SST-2 dataset.

(This exercise is from https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt)

In [None]:
!pip install -q datasets evaluate transformers[sentencepiece]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -q wandb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.4/196.4 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.7/257.7 kB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h

If running this notebook in Colab, please ensure that your Hugging Face `HF_TOKEN` and your Weights & Biases `WANDB_API_KEY` are added to your Colab secrets.

Alternatively, please login to Hugging Face and Weights & Biases by running the following two cells.

In [None]:
# !huggingface-cli login

In [None]:
# !wandb login

In [None]:
import os
import random
import numpy as np
import torch

def seed_everything(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything(42)

In [None]:
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

From the <a href="https://huggingface.co/datasets/sst2" target="_blank">dataset page</a>:

> The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.

> Binary classification experiments on full sentences (negative or somewhat negative vs somewhat positive or positive with neutral sentences discarded) refer to the dataset as SST-2 or SST binary.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("sst2")
raw_datasets

Downloading readme:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

In [None]:
features = raw_datasets['train'].features
features

{'idx': Value(dtype='int32', id=None),
 'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None)}

In [None]:
id2label = {id: label for id, label in enumerate(features['label'].names)}
id2label

{0: 'negative', 1: 'positive'}

Let's check the class distribution.

In [None]:
import pandas as pd

train_df = raw_datasets['train'].with_format("pandas")[:]
train_df

Unnamed: 0,idx,sentence,label
0,0,hide new secretions from the parental units,0
1,1,"contains no wit , only labored gags",0
2,2,that loves its characters and communicates som...,1
3,3,remains utterly satisfied to remain the same t...,0
4,4,on the worst revenge-of-the-nerds clichés the ...,0
...,...,...,...
67344,67344,a delightful comedy,1
67345,67345,"anguish , anger and frustration",0
67346,67346,"at achieving the modest , crowd-pleasing goals...",1
67347,67347,a patient viewer,1


In [None]:
train_df['label'].value_counts()

1    37569
0    29780
Name: label, dtype: int64

In [None]:
(train_df['label'].value_counts() / len(train_df)) * 100

1    55.782565
0    44.217435
Name: label, dtype: float64

The dataset is well-balanced.

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_length

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

512

In [None]:
examples = raw_datasets['train'][:4]
examples

{'idx': [0, 1, 2, 3],
 'sentence': ['hide new secretions from the parental units ',
  'contains no wit , only labored gags ',
  'that loves its characters and communicates something rather beautiful about human nature ',
  'remains utterly satisfied to remain the same throughout '],
 'label': [0, 0, 1, 0]}

In [None]:
tokenizer(examples['sentence'], truncation=True, max_length=512)

{'input_ids': [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], [101, 3397, 2053, 15966, 1010, 2069, 4450, 2098, 18201, 2015, 102], [101, 2008, 7459, 2049, 3494, 1998, 10639, 2015, 2242, 2738, 3376, 2055, 2529, 3267, 102], [101, 3464, 12580, 8510, 2000, 3961, 1996, 2168, 2802, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
def tokenize_function(examples):
    return tokenizer(examples['sentence'], truncation=True, max_length=512)

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(['idx', 'sentence'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets.set_format("torch")
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [None]:
example = tokenized_datasets['train'][0]
for k, v in example.items():
    print(f"{k}: {v}")

labels: 0
input_ids: tensor([  101,  5342,  2047,  3595,  8496,  2013,  1996, 18643,  3197,   102])
token_type_ids: tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
attention_mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])


In [None]:
example['labels']

tensor(0)

In [None]:
id2label[example['labels'].item()]

'negative'

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from torch.utils.data import DataLoader

batch_size = 32
train_dl = DataLoader(tokenized_datasets['train'], batch_size=batch_size, shuffle=True, collate_fn=data_collator, num_workers=2)
val_dl = DataLoader(tokenized_datasets['validation'], batch_size=batch_size, shuffle=False, collate_fn=data_collator, num_workers=2)
len(train_dl), len(val_dl)

(2105, 28)

In [None]:
# Sanity check:
for batch in train_dl:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([32]),
 'input_ids': torch.Size([32, 33]),
 'token_type_ids': torch.Size([32, 33]),
 'attention_mask': torch.Size([32, 33])}

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
model

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [None]:
model = model.to(device)

In [None]:
# Sanity check:
batch = {k: v.to(device) for k, v in batch.items()}
output = model(**batch)
output

SequenceClassifierOutput(loss=tensor(0.6751, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[ 0.2692,  0.0768],
        [ 0.2673, -0.0022],
        [ 0.2352, -0.1076],
        [ 0.2086, -0.3152],
        [ 0.2320, -0.0012],
        [ 0.2137,  0.1305],
        [ 0.3201, -0.0364],
        [ 0.2502, -0.1508],
        [ 0.2656, -0.0365],
        [ 0.2057, -0.0881],
        [ 0.1889,  0.2329],
        [ 0.2673,  0.1059],
        [ 0.2805,  0.0771],
        [ 0.2659,  0.2067],
        [ 0.3059, -0.0498],
        [ 0.3487, -0.0887],
        [ 0.2121, -0.0198],
        [ 0.2409,  0.0659],
        [ 0.1603,  0.1637],
        [ 0.1797, -0.0110],
        [ 0.2672,  0.1040],
        [ 0.2379,  0.1663],
        [ 0.2489, -0.2742],
        [ 0.3049, -0.0914],
        [ 0.2763, -0.0045],
        [ 0.2838,  0.1267],
        [ 0.2690, -0.0442],
        [ 0.2672, -0.2779],
        [ 0.2967, -0.0637],
        [ 0.2731, -0.1183],
        [ 0.2111, -0.2203],
        [ 0.2280, -0.0512]], devic

**Note:** If we get the same output each time we run the notebook (from top to the above cell), then the random seed has been set correctly.

In [None]:
from torch.optim import AdamW

learning_rate = 5e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)

In [None]:
lr_scheduler_type = "linear"
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
num_warmup_steps = 0
num_training_steps, num_warmup_steps

(6315, 0)

In [None]:
from transformers import get_scheduler

lr_scheduler = get_scheduler(
    lr_scheduler_type,
    optimizer=optimizer,
    num_training_steps=num_training_steps,
    num_warmup_steps=num_warmup_steps
)

In [None]:
from tqdm.auto import tqdm

def train_epoch():
    print("Training...")
    model.train()
    train_loss = 0
    progress_bar = tqdm(range(len(train_dl)))
    for batch in train_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        output = model(**batch)
        loss = output.loss
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    train_loss /= len(tokenized_datasets['train'])
    train_loss = round(train_loss, 4)
    return train_loss

In [None]:
import evaluate

def validate_epoch():
    print("Validating...")
    model.eval()
    val_loss = 0
    metrics = evaluate.load("glue", "sst2")
    progress_bar = tqdm(range(len(val_dl)))
    for batch in val_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            output = model(**batch)
        loss = output.loss
        val_loss += loss.item()
        logits = output.logits
        preds = torch.argmax(logits, dim=-1)
        metrics.add_batch(predictions=preds, references=batch['labels'])
        progress_bar.update(1)
    val_loss /= len(tokenized_datasets['validation'])
    val_loss = round(val_loss, 4)
    computed_metrics = metrics.compute()
    acc = round(computed_metrics['accuracy'], 4)
    return val_loss, acc

In [None]:
import wandb

wandb.init(
    project="bert-base-uncased-finetuned-sst2-v2",
    config={
        'checkpoint': "bert-base-uncased",
        'dataset': "SST-2",
        'learning_rate': learning_rate,
        'num_epochs': num_epochs,
        'batch_size': batch_size,
        'lr_scheduler_type': lr_scheduler_type,
        'num_warmup_steps': num_warmup_steps
    }
)

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
for epoch in range(num_epochs):
    print(f"Epoch: {epoch}")

    train_loss = train_epoch()
    print(f"Training loss: {train_loss}")

    val_loss, acc = validate_epoch()
    print(f"Validation loss: {val_loss}\nAccuracy: {acc}")

    wandb.log({
        'train_loss': train_loss,
        'val_loss': val_loss,
        'accuracy': acc
    })

    print("Pushing model...")
    model.push_to_hub("bert-base-uncased-finetuned-sst2-v2", commit_message=f"epoch: {epoch}, accuracy: {acc}")
    print("---")
wandb.finish()
print("---")
print("Pushing tokenizer...")
tokenizer.push_to_hub("bert-base-uncased-finetuned-sst2-v2", commit_message="pushing tokenizer")
print("Done!")

Epoch: 0
Training...


  0%|          | 0/2105 [00:00<?, ?it/s]

Training loss: 0.0064
Validating...


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

  0%|          | 0/28 [00:00<?, ?it/s]

Validation loss: 0.0062
Accuracy: 0.9243
Pushing model...


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

---
Epoch: 1
Training...


  0%|          | 0/2105 [00:00<?, ?it/s]

Training loss: 0.003
Validating...


  0%|          | 0/28 [00:00<?, ?it/s]

Validation loss: 0.0064
Accuracy: 0.9278
Pushing model...


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

---
Epoch: 2
Training...


  0%|          | 0/2105 [00:00<?, ?it/s]

Training loss: 0.0015
Validating...


  0%|          | 0/28 [00:00<?, ?it/s]

Validation loss: 0.0077
Accuracy: 0.9232
Pushing model...


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

---


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▃█▁
train_loss,█▃▁
val_loss,▁▂█

0,1
accuracy,0.9232
train_loss,0.0015
val_loss,0.0077


---
Pushing tokenizer...
Done!
