# Homework 8

In this homework, you will train a sentiment classifier on the [SST-2](https://huggingface.co/datasets/sst2) dataset using the pre-trained BERT model. For simplicity, I recommend using the [Hugging Face Transformers library](https://huggingface.co/docs/transformers/index). I've linked to corresponding tutorials below. You're welcome to use a different framework if you prefer.

# Problem 1

1. Fine-tune [TinyBERT](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D) on SST-2 and evaluate the results. You can find a tutorial for loading BERT and fine-tuning [here](https://huggingface.co/docs/transformers/training). In that tutorial, you will need to change the dataset from `"yelp_review_full"` to `"sst2"` and the model from `"bert-base-uncased"` to `"huawei-noah/TinyBERT_General_4L_312D"`. You'll also need to modify the code since SST-2 is a two-class classification dataset (unlike the Yelp Reviews dataset, which is a five-class classification dataset).
2. Choose a different pre-trained BERT-style model from the [Hugging Face Model Hub](https://huggingface.co/models) and fine-tune it. There are tons of options - part of the homework is navigating the hub to find different models! I recommend picking a model that is smaller than BERT-Base (as TinyBERT is) just to make things computationally cheaper. Is the final validation accuracy higher or lower with this other model?

# Libraries


In [3]:
#!pip install torchinfo datasets transformers[torch] evaluate accelerate wandb huggingface_hub sentence-transformers

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np

from sklearn.linear_model import LogisticRegression

import torch

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torchinfo import summary

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

from sentence_transformers import SentenceTransformer

import evaluate

import wandb

from huggingface_hub import notebook_login

I pushed my models on Hugging Face after training to keep a track of them, hence the token initialization (that I have removed for the submission). I have also been keeping track of the training by using Weights and Biases (linked to my HF profile).

In [5]:
hf_key = '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'

In [6]:
!huggingface-cli login --token $hf_key

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Loading the dataset and the metric

In [None]:
dataset = load_dataset("sst2")
max_length = 65

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

metric = evaluate.load("accuracy")

#TinyBERT

I loaded the TinyBERT model and the associated tokenizer, and padding/truncating to a max length of 65 tokens (65 being close to what the tokenizer was giving as a max length without padding and truncating).
I also created the training arguments for the HF trainer, using W&B and adding. the push to hub.

In [9]:
model_name = 'huawei-noah/TinyBERT_General_4L_312D'
output_dir = "Vishnou/TinyBERT_SST2"

tokenizer = AutoTokenizer.from_pretrained(model_name, max_length = max_length, padding = 'max_length', truncation = True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

training_args = TrainingArguments(num_train_epochs = 2, evaluation_strategy = 'steps', output_dir = output_dir, report_to="wandb", push_to_hub=True)

config.json:   0%|          | 0.00/409 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/62.7M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
def tokenize_function(examples):
    return tokenizer(examples["sentence"], max_length = max_length, padding = 'max_length', truncation = True)

In [11]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenizer.push_to_hub(output_dir)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

CommitInfo(commit_url='https://huggingface.co/Vishnou/TinyBERT_SST2/commit/51d15a6e883c7274bd417d6976a82a7d285ba89b', commit_message='Upload tokenizer', commit_description='', oid='51d15a6e883c7274bd417d6976a82a7d285ba89b', pr_url=None, pr_revision=None, pr_num=None)

In [12]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
)

In [13]:
#%%wandb
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Accuracy
500,0.4686,0.401952,0.833716
1000,0.384,0.366629,0.836009
1500,0.381,0.395145,0.833716
2000,0.3609,0.437777,0.855505
2500,0.3616,0.374283,0.847477
3000,0.3521,0.369229,0.858945
3500,0.3113,0.507223,0.848624
4000,0.319,0.421195,0.861239
4500,0.3034,0.455544,0.864679
5000,0.3098,0.416313,0.863532


TrainOutput(global_step=16838, training_loss=0.27650495384860624, metrics={'train_runtime': 450.508, 'train_samples_per_second': 298.991, 'train_steps_per_second': 37.376, 'total_flos': 245201596425240.0, 'train_loss': 0.27650495384860624, 'epoch': 2.0})

After training for 2 epochs, the validation accuracy is already at 88.6%, which is a pretty good result. The training loss reaches a plateau around 0.21, and the validation loss is around 0.51 at that point. With a few more epochs (I tried with 5 epochs), the training loss reaches 0.1 but the gain in evaluation accuracy is not that huge, so stopping the training at 2 epochs seems like a good compromise between computational cost (around 3min per epoch) and accuracy.

In [14]:
trainer.push_to_hub('Vishnou/TinyBERT_SST2')

model.safetensors:   0%|          | 0.00/57.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Vishnou/TinyBERT_SST2/commit/5aa94d561e626f70a4380e2a51c786e64f3acad7', commit_message='Upload BertForSequenceClassification', commit_description='', oid='5aa94d561e626f70a4380e2a51c786e64f3acad7', pr_url=None, pr_revision=None, pr_num=None)

We can predict on a the test dataset, which is not labelled:

In [15]:
classifier = pipeline("sentiment-analysis", model="Vishnou/TinyBERT_SST2", tokenizer = tokenizer)
preds = classifier(dataset['test']['sentence'])
preds = [int(pred['label'][-1]) for pred in preds]

In [32]:
for i in range(5):
    print(dataset['test']['sentence'][i])
    print('Label: ' + str(preds[i]))
    print('\n')

uneasy mishmash of styles and genres .
Label: 0


this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
Label: 0


by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
Label: 1


director rob marshall went out gunning to make a great one .
Label: 1


lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .
Label: 1




#DistilBERT

I then chose a DistilBERT model (with 67M parameters, which is more than TinyBERT but around half the size of BERT-base).

In [16]:
model_name = 'distilbert-base-uncased'
distil_output_dir = "Vishnou/distilbert_base_SST2"

distil_tokenizer = AutoTokenizer.from_pretrained(model_name, max_length = max_length, padding = 'max_length', truncation = True)
distil_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

distil_training_args = TrainingArguments(num_train_epochs = 2, evaluation_strategy = 'steps', output_dir = distil_output_dir, report_to="wandb", push_to_hub=True)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
def distil_tokenize_function(examples):
    return distil_tokenizer(examples["sentence"], max_length = max_length, padding = 'max_length', truncation = True)

In [18]:
distil_tokenized_datasets = dataset.map(distil_tokenize_function, batched=True)
distil_tokenizer.push_to_hub(distil_output_dir)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

CommitInfo(commit_url='https://huggingface.co/Vishnou/distilbert_base_SST2/commit/5d64711481eee78305b8d5841e3088de71a640f0', commit_message='Upload tokenizer', commit_description='', oid='5d64711481eee78305b8d5841e3088de71a640f0', pr_url=None, pr_revision=None, pr_num=None)

In [19]:
distil_trainer = Trainer(
    model=distil_model,
    args=distil_training_args,
    train_dataset=distil_tokenized_datasets["train"],
    eval_dataset=distil_tokenized_datasets["validation"],
    compute_metrics=compute_metrics,
)

In [20]:
#%%wandb
distil_trainer.train()

Step,Training Loss,Validation Loss,Accuracy
500,0.4378,0.345167,0.860092
1000,0.343,0.3483,0.857798
1500,0.3342,0.337338,0.870413
2000,0.308,0.410229,0.881881
2500,0.2932,0.354599,0.883028
3000,0.3116,0.360926,0.87156
3500,0.2805,0.379957,0.894495
4000,0.2655,0.413082,0.884174
4500,0.2504,0.429886,0.883028
5000,0.2543,0.519639,0.872706


TrainOutput(global_step=16838, training_loss=0.20721012708593936, metrics={'train_runtime': 1368.6814, 'train_samples_per_second': 98.414, 'train_steps_per_second': 12.302, 'total_flos': 2265236500333560.0, 'train_loss': 0.20721012708593936, 'epoch': 2.0})

After two epochs, we already reach a training loss of 0.13, a validation loss of 0.42 and a validation accuracy of 90%, showing that this model performs better and reaches a higher accuracy faster (even after one epoch) than TinyBERT, thanks to the number of parameters.
However, fine-tuning this heavier model for two epochs took around twice as much time as the fine-tuning for TinyBERT.

This same model was finetuned on the SST-2 dataset and is available on HF ('distilbert-base-uncased-finetuned-sst-2-english') and reaches a validation accuracy of 99%, which shows that with more training epochs and a convenient choice of hyperparameters, this architecture can be fine-tuned into a nearly-perfect model (on the validation set).

In [21]:
distil_trainer.push_to_hub(distil_output_dir)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/57.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Vishnou/distilbert_base_SST2/commit/06bb8d20a4e65d738f59cf75ea37779b066eb790', commit_message='Upload BertForSequenceClassification', commit_description='', oid='06bb8d20a4e65d738f59cf75ea37779b066eb790', pr_url=None, pr_revision=None, pr_num=None)

# Problem 2

Instead of fine-tuning the full model on a target dataset, it's also possible to use the output representations from a BERT-style model as input to a linear classifier and *only* train the classifier (leaving the rest of the pre-trained parameters fixed). You can do this easily using the [`sentence-transformers`](https://www.sbert.net/) library. Using `sentence-tranformers` gives you back a fixed-length representation of a given text sequence. To achieve this, you need to
1. Pick a pre-trained sentence Transformer.
2. Load the SST-2 dataset and feed the text from each example into the model.
3. Train a linear classifier on the representations.
4. Evaluate performance on the validation set.

For the second step, you can learn more about how to use Hugging Face datasets [here](https://huggingface.co/docs/datasets/index). For the third and fourth step, you can do this directly in PyTorch, or you can just collect the learned representations and use them as feature vectors to train a linear classifier in any other library (e.g. [scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html).

After you complete the above steps, report whether the accuracy on the validation set is higher or lower using a fixed sentence Transformer.

In [22]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
model_name = 'all-distilroberta-v1'
model_sentence_tf = SentenceTransformer(model_name)

I just encoded everything under a NumPy array format, and used a LogisticRegression classifier from sklearn with that.

In [24]:
train_embeddings = model_sentence_tf.encode(dataset['train']['sentence'], show_progress_bar = True, device = device)
train_labels = np.array(dataset['train']['label'])

val_embeddings = model_sentence_tf.encode(dataset['validation']['sentence'], show_progress_bar = True, device = device)
val_labels = np.array(dataset['validation']['label'])

Batches:   0%|          | 0/2105 [00:00<?, ?it/s]

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

In [25]:
clf = LogisticRegression(random_state=0).fit(train_embeddings, train_labels)
np.round(clf.score(train_embeddings, train_labels),3)

0.873

In [26]:
np.round(clf.score(val_embeddings, val_labels),3)

0.861

The accuracy we get on the validation set is a bit lower from what we had with TinyBERT, but for a way smaller training computation cost for us, thanks to the pre-training of the sentence transformer model 'all-distilroberta-v1'.