# HW11.1 Fine-tuning BERT LLM using Huggingface Transformers library

In this homework, we will step away from tensorflow keras for a moment and instead use the Transformers library from HuggingFace (https://huggingface.co/) . The HuggingFace is a community that hosts pre-trained models from LLMs to computer vision and audio ML models. You can gain easy access to SOTA LLMs using their `transformers` library, fine tuning them, and use standard benchmark datasets from their `datasets` library (it is a generic name but the library is called datasets). 

Specifically what you will do in this home work:
1. Walk through the example of loading the `sst2` dataset (Stanford Sentiment Treebank dataset, essentially a dataset for sentiment analysis) from the `GLUE` benchmark we talked about in class. The GLUE covers a range of NLP tasks and is used to benchmark LLMs. After you load the dataset, there will be some example usages to inspect the dataset.
2. From the `transformers` library, load the pretrained LLM called DistillBERT, a variant and smaller version of the famous BERT LLM.
3. Fine tune (train further) the DistillBERT model on the `sst2` dataset to achieve a better performance.
4. Evaluate your fine-tuned model on `sst2` and compare that with: (1)the model before fine-tuning; (2) the default model in the HuggingFace library that is fine tuned by experts.

Please complete all tasks/code and answer all questions. 

## Requirements

You will need the following libraries at the minimum: 

```
!pip install datasets
!pip install transformers
!pip install accelerate -U
!pip install torchinfo
```

# 1. Load SST2 data

In [1]:
from datasets import load_dataset
import numpy as np

# to view the GLUE - SST2 data set and what it is about, see: https://huggingface.co/datasets/nyu-mll/glue
# essnentially this is a Stanford Sentiment Treebank dataset for sentiment analysis
datasets = load_dataset("glue", "sst2")



In [2]:
# you can inspect this dataset and see what it contains
# you will see it has been divided into three parts: train, val, and test
datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

## Task 1: inspect data text and labels 

what are the labels? what does label 0 and 1 represent? Take a note of the keys in this dictionary and their values. 

In [3]:
# TODO: inspect the first three examples in the datasets
# YOUR CODE HERE


# 2. Load pre-trained model DistillBERT and preprocess text

We've talked about how each LLM comes with its on (subword, learned) tokenizer. Here, when we load the pre-trained LLM, we also load its tokanizer.  

In [4]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences = tokenizer(datasets['train'][:3]['sentence'])

## Task 2: understand what tokenizer is doing
Now we've used the tokenizer to tokenize the first three sentences in train dataset. Inspect the tokenized sentences. Let's take the first sentence. It is now represented by a sequences of integer indexes. Can you map them back to actual sub-word units to see how the tokenizer is breaking up the words? 

Hint: you can do `dir(tokenizer)` to find out how to convert ids to tokens. This applies to any object in python. 

In [5]:
# YOUR CODE HERE

The following function applies the tokenizer to all data.

In [6]:
def tokenize_fn(batch):
  return tokenizer(batch['sentence'], truncation=True)

In [None]:
tokenized_datasets = datasets.map(tokenize_fn, batched=True)

# 3. Fine-tune the pre-trained DistillBERT model

In [8]:
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification

In [9]:
training_args = TrainingArguments(
  'my_trainer',
  evaluation_strategy='epoch',
  save_strategy='epoch',
  num_train_epochs=1,
)

In [10]:
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# this warning above tells you that this pretrained model was topped with a newly 
# initialized classifier that needs to be trained/fine-tuned
# let's inspect this model and understand its internal structure

model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [14]:
from torchinfo import summary
# another way to inspect the model
summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           1,538
├─Dropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

In [15]:
from transformers import Trainer
from datasets import load_metric
# define function to compute metrics
def compute_metrics(logits_and_labels):
  # metric = load_metric("glue", "sst2")
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [16]:
# set up trainer to fine-tune the model
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## Task 3: fine tune the model for 1 epoch!
Note that this might take some time. 

Note that the epoch number was set above in the training arguments. 

After fine tuning 1 epoch, report the final accuracy. 

In [None]:
trainer.train()

In [None]:
# save the model to disk so that you can load it back later
trainer.save_model('my_saved_model')

# use this code to massage the labels into something interpretable, NEGATIVE, POSITIVE
import json
config_path = 'my_saved_model/config.json'
with open(config_path) as f:
  j = json.load(f)

j['id2label'] = {0: 'NEGATIVE', 1: 'POSITIVE'}

with open(config_path, 'w') as f:
  json.dump(j, f, indent=2)

## Use the saved model for inference on new sentences

Now you can use this newly fine-tuned model to build a `pipeline`, an object in the trnasformers library. The pipeline can be used to make inference on a input sentence. 

In [18]:
from transformers import pipeline
new_model = pipeline('text-classification', model='my_saved_model')

# test your new pipeline
new_model('This movie is great!')

# test with more examples 
# YOUR CODE HERE

[{'label': 'POSITIVE', 'score': 0.9994922876358032}]

# 4. Evaluate the model: how was the result of the fine-tuning?

Once you trained a model, it's always important to show through proper evaluation that this fine-tuned model is indeed better than before fine tuning, or compare this with models fine-tuned by other people.  

To use HuggingFace's evaluator, install:
`!pip install evaluate`

In [28]:
from evaluate import evaluator

# first let's load the test portion of the sst2 data
test_datasets = load_dataset("glue", "sst2", split="test")

# let's compare three models and evaluate the against each other. 

# Model 1: pre-trained model distillBERT as is. Since this is added some new
# classifier layers, it is expected to have low performance. 
# let's load this model again. 
checkpoint = "distilbert-base-uncased"
from transformers import AutoModelForSequenceClassification
model_distillBERT = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
# Model 2: the model you fine tuned. For this one, we already have the pipeline 
# called new_model, we can use this directly for evaluation. 

In [22]:
# Model 3: the default model for the evaluator if you don't give it any model.
# i.e., you would not supply the argument for model_or_pipeline in the following.
# In this case, it defaults to a model that was fine-tuned by others. 

## Task 4: evaluate the three models!
report the results for Model 1, 2 and 3 above on the `test` portion of the `sst2` dataset. What results do you get? Can you think of why? 

Now try testing the three models on the `validation` portion of the same dataset. Report the results. What do you observe?

Hint 1: if you are testing a certain model and got an error about the labels, you might want to use one of the lines that is commented out below and swap it out with another line. 

Hint 2: if you can't figure out what's wrong about your accuracy, try go back to inspect the data! 


In [32]:
# setting up the evaluator

from evaluate import load
task_evaluator = evaluator("text-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=None, #YOUR CODE HERE
    data=None, #YOUR CODE HERE
    input_column="sentence",
    tokenizer=tokenizer,
    metric='accuracy',
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
    #label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0}
)


In [None]:
print(eval_results)