<a href="https://colab.research.google.com/github/zxcej/COMP691_LABS/blob/main/Lab8_HuggingFace_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤗 A Gentle Introduction to HuggingFace (HF)
---
HuggingFace provides you with a variety of pretrained models and
functionalities to train/fine-tune these models and make inferences.

Their [datasets](https://huggingface.co/docs/datasets/quickstart) library gives you access to many common NLP datasets. You can visualize these datasets on their [platform](https://huggingface.co/datasets) to get a sense of the data you would be working with.

In [None]:
!pip install datasets transformers

## 🌠 Our Goal
Our goal for this tutorial is to get familiar with the [transformers](https://huggingface.co/docs/transformers/index) library from HuggingFace and use a pretrained model to fine-tune it on a sequece classification task. More specifically we will fine-tune a [BERT](https://arxiv.org/pdf/1810.04805.pdf) model on the [Amazon Polarity](https://huggingface.co/datasets/amazon_polarity#data-instances) dataset.
> The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review.

> The Amazon reviews polarity dataset is constructed by taking review score 1 and 2 as negative, and 4 and 5 as positive. Samples of score 3 is ignored. Each class has 1,800,000 training samples and 200,000 testing samples.

Since the dataset is quite large, we will be working with only a subset of this dataset throughout this tutorial.


## 🪜 Main Components
The main components we would need to develop to realize our goal are:

1. Load the data and make a dataset object for this task.
2. Write a collate function/class to tokenize/transform/truncate batches of inputs.
3. Make a custom model, which uses a pretrained model as its backbone and it is designed for our current task at hand.
4. Write the training loop and train the model.

> ⚠️ These steps constitues the basic building blocks to solve any other problem using HF.

## 🛒 Loading data
In this stage we will load the data from the `datasets` library. We will only load a small subset of the original dataset here in order to reduce the training time, but feel free to run this code on the full dataset on your own time and experiment with it.


In [None]:
from datasets import load_dataset

dataset_train = load_dataset("amazon_polarity", split="train[:1000]")
dataset_test = load_dataset("amazon_polarity", split="test[:200]")

Reusing dataset amazon_polarity (/root/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/56923eeb72030cb6c4ea30c8a4e1162c26b25973475ac1f44340f0ec0f2936f4)
Reusing dataset amazon_polarity (/root/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/56923eeb72030cb6c4ea30c8a4e1162c26b25973475ac1f44340f0ec0f2936f4)


In [None]:
#@title 🔍 Quick look at the data { run: "auto" }
#@markdown Lets have quick look at a few samples as well as the label distributions in our train and test set.
n_samples_to_see = 3 #@param {type: "integer"}
for i in range(n_samples_to_see):
  print("-"*30)
  print("title:", dataset_test[i]["title"])
  print("content:", dataset_test[i]["content"])
  print("label:", dataset_test[i]["label"])

------------------------------
title: Great CD
content: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"
label: 1
------------------------------
title: One of the best game music soundtracks - for a game I didn't really play
content: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. 

In [None]:
def label_stats(ds):
    negative = 0
    positive = 0
    for i in range(ds.num_rows):
        if ds[i]["label"] == 1:
            positive += 1
        else:
            negative += 1
    return positive, negative

In [None]:
for i, ds in enumerate([dataset_train, dataset_test]):
    positive, negative = label_stats(ds)
    if i == 0:
        str_indicator = "train"
    else:
        str_indicator = "test"
    print("+-" * 15)
    print("Set:", str_indicator)
    print(f"Positive samples: {positive}\nNegative samples: {negative}")
    print(f"Percentage of overall positive samples: {(positive*100.0)/(positive+negative)}%")

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
Set: train
Positive samples: 462
Negative samples: 538
Percentage of overall positive samples: 46.2%
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
Set: test
Positive samples: 109
Negative samples: 91
Percentage of overall positive samples: 54.5%


## 🧲 Collate
Collate is a function that is called on every batch of data prepared by the [dataloader](https://https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). Once we pass our dataset (e.g. `train_set`) to our dataloader, each batch will be a `list` of `dict` items. Therefore, this cannot be directed to the model. We need to perform the followings at this stage:


### 1️⃣ Tokenize the `text`
Tokenize the `text`portion of each sample (i.e. parsing the text to smaller chuncks). Tokenization can happen in many ways, traditionally this was done based the white spaces. With transformer-based models tokenization is performed based on the frequency of occurance of "chunk of text". This frequence can be learnt in many different ways, however the most common one is the [**wordpiece**](https://arxiv.org/pdf/1609.08144v2.pdf) model. 
> The wordpiece model is generated using a data-driven approach to maximize the language-model likelihood
of the training data, given an evolving word definition. Given a training corpus and a number of desired
tokens $D$, the optimization problem is to select $D$ wordpieces such that the resulting corpus is minimal in the
number of wordpieces when segmented according to the chosen wordpiece model.

Under this model:
1. Not all things can be converted to tokens depending on the model. For example, most models have been pretrained without any knowledge of emojis. So their token will be `[UNK]`, which stands for unknown.
2. Some words will be mapped to multiple tokens!
3. Depending on the kind of model, your tokens may or may not respect capitalization!

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
#@title 🔍 Quick look at tokenization { run: "auto", vertical-output: true }
input_sample = "We are very jubilant to demonstrate to you the 🤗 Transformers library." #@param {type: "string"}
tokenizer.tokenize(input_sample)

['we',
 'are',
 'very',
 'ju',
 '##bil',
 '##ant',
 'to',
 'demonstrate',
 'to',
 'you',
 'the',
 'transformers',
 'library',
 '.']

### 2️⃣ Encoding
Once we have tokenized the text, we then need to convert these chuncks to numbers so we can feed them to our model. This conversion is basically a look-up in a dictionary **from `str` $\to$ `int`**. The tokenizer object can also perform this work. While it does so it will also add the *special* tokens needed by the model to the encodings. 

In [None]:
#@title 🔍 Quick look at token encoding { run: "auto"}
input_sample = "We are very jubilant to demonstrate to you the 🤗 Transformers library." #@param {type: "string"}
print("--> Token Encodings:\n",tokenizer.encode(input_sample))
print("-."*15)
print("--> Token Encodings Decoded:\n",tokenizer.decode(tokenizer.encode(input_sample)))

--> Token Encodings:
 [101, 2057, 2024, 2200, 18414, 14454, 4630, 2000, 10580, 2000, 2017, 1996, 19081, 3075, 1012, 102]
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
--> Token Encodings Decoded:
 [CLS] we are very jubilant to demonstrate to you the transformers library. [SEP]


### 3️⃣ Truncate/Pad samples
Since all the sample in the batch will not have the same sequence length, we would need to truncate the longers (i.e. the ones that exeed a predefined maximum length) and pad the shorter ones so we that we can equal length for all the samples in the batch. Once this is achieved, we would need to convert the result to `torch.Tensor`s and return. These tensors will then be retrieved from the [dataloader](https://https//pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

In [None]:
from typing import List, Dict, Union
import torch


class Collate:
    def __init__(self, tokenizer: str, max_len: int) -> None:
        self.tokenizer_name = tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)
        self.max_len = max_len

    def __call__(self, batch: List[Dict[str, Union[str, int]]]) -> Dict[str, torch.Tensor]:
        texts = list(map(lambda batch_instance: batch_instance["title"], batch))
        tokenized_inputs = self.tokenizer(
            texts,
            padding="longest",
            truncation=True,
            max_length=self.max_len,
            return_tensors="pt",
            return_token_type_ids=False,
        )
        labels = list(map(lambda batch_instance: int(batch_instance["label"]), batch))
        labels = torch.LongTensor(labels)
        return dict(tokenized_inputs, **{"labels": labels})

In [None]:
#@title 🧑‍🍳 Setting up the collate function { run: "auto" }
tokenizer_name = "distilbert-base-uncased" #@param {type: "string"}
sample_max_length = 64 #@param {type:"slider", min:32, max:512, step:1}
collate = Collate(tokenizer="distilbert-base-uncased", max_len=sample_max_length)

## 🤖 Model
Our model needs to classify an entire sequence of text. Once we feed an input sequence of length $k$ to a language model, it will output $k$ vectors. Now the question is which of these vectors or combition of these vectors should we use to classify the sequence?
We will use the first toke, special token `[cls]` for these purposes. *Refer to the [BERT paper](https://arxiv.org/abs/1810.04805) for more information.*

Since we have 2 classes (positive, and negative), this means we would need to make a classifier on top of the vector representations of the `[cls]` token. Our custom model will then look like:

In [None]:
import torch
from transformers import AutoModel
from typing import Optional, Tuple


class ReviewClassifier(torch.nn.Module):
    def __init__(self, backbone: str, backbone_hidden_size: int, nb_classes: int):
        super(ReviewClassifier, self).__init__()
        self.backbone = backbone
        self.backbone_hidden_size = backbone_hidden_size
        self.nb_classes = nb_classes

        self.back_bone = AutoModel.from_pretrained(
            self.backbone,
            output_attentions=False,
            output_hidden_states=False,
        )
        self.classifier = torch.nn.Linear(self.backbone_hidden_size, self.nb_classes)

    def forward(
        self, input_ids: torch.Tensor, attention_mask: torch.Tensor, labels: Optional[torch.Tensor] = None
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        back_bone_output = self.back_bone(input_ids, attention_mask=attention_mask)
        hidden_states = back_bone_output[0]
        pooled_output = hidden_states[:, 0]  # getting the [CLS] token

        logits = self.classifier(pooled_output)
        if labels is not None:
            loss_fn = torch.nn.CrossEntropyLoss()
            loss = loss_fn(
                logits.view(-1, self.nb_classes),
                labels.view(-1),
            )
            return loss, logits
        return logits

In [None]:
model = ReviewClassifier(backbone="distilbert-base-uncased", backbone_hidden_size=768, nb_classes=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## 🏓 Training Loop
In this section we will define the training loop to trian our model. Note that these model are sensative wrt the hyperparameters and it usually takes a while to find the right hyperparameters. The default hyperparameters should work fine for our test case.

In [None]:
from tqdm.auto import tqdm
from torch.utils.data import DataLoader
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import numpy as np

print(f"--> Device selected: {device}")

--> Device selected: cuda


In [None]:
def train_one_epoch(
    model: torch.nn.Module, training_data_loader: DataLoader, optimizer: torch.optim.Optimizer, logging_frequency: int
):
    model.train()
    optimizer.zero_grad()
    epoch_loss = 0
    logging_loss = 0
    for step, batch in enumerate(training_data_loader):
        batch = {key: value.to(device) for key, value in batch.items()}
        outputs = model(**batch)
        loss = outputs[0]
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        logging_loss += loss.item()

        if (step + 1) % logging_frequency == 0:
            print(f"Training loss @ step {step+1}: {logging_loss/logging_frequency}")
            logging_loss = 0

    return epoch_loss / len(training_data_loader)


def evaluate(model: torch.nn.Module, test_data_loader: DataLoader, nb_classes: int):
    model.eval()
    model.to(device)
    eval_loss = 0
    correct_predictions = {i: 0 for i in range(nb_classes)}
    total_predictions = {i: 0 for i in range(nb_classes)}

    with torch.no_grad():
        for step, batch in enumerate(test_data_loader):
            batch = {key: value.to(device) for key, value in batch.items()}
            outputs = model(**batch)
            loss = outputs[0]
            eval_loss += loss.item()

            predictions = np.argmax(outputs[1].detach().cpu().numpy(), axis=1)
            for target, prediction in zip(batch["labels"].cpu().numpy(), predictions):
                if target == prediction:
                    correct_predictions[target] += 1
                total_predictions[target] += 1

    accuracy = (100.0 * sum(correct_predictions.values())) / sum(total_predictions.values())
    return accuracy, eval_loss / len(test_data_loader)

In [None]:
#@title 🧑‍🍳 Setting hyperparameters for training { run: "auto" }
nb_epoch = 3 #@param {type: "slider", min:1, max:10, step:1}
batch_size = 64 #@param {type: "integer"}
logging_frequency = 5 #@param {type: "integer"}
learning_rate = 1e-5 #@param {type: "number"}

train_loader = DataLoader(dataset_train, batch_size=batch_size, shuffle=True, collate_fn=collate)
test_loader = DataLoader(dataset_test, batch_size=batch_size, shuffle=False, collate_fn=collate)

# setting up the optimizer
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]

optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=1e-8)


In [None]:
model.to(device)

train_bar = tqdm(range(nb_epoch), desc="Epoch")
for e in train_bar:
    train_loss = train_one_epoch(model, train_loader, optimizer, logging_frequency)
    eval_acc, eval_loss  = evaluate(model, test_loader, 2)
    print(f"    Epoch: {e+1} Loss/Test: {eval_loss}, Loss/Test: {train_loss}, Acc/Test: {eval_acc}")
    train_bar.set_postfix({"Loss/Train": train_loss, "Loss/Test": eval_loss, "Acc/Test": eval_acc})

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Training loss @ step 5: 0.6976208567619324
Training loss @ step 10: 0.685474693775177
Training loss @ step 15: 0.6706427097320556
    Epoch: 1 Loss/Test: 0.6685808300971985, Loss/Test: 0.6822735592722893, Acc/Test: 48.5
Training loss @ step 5: 0.6282077550888061
Training loss @ step 10: 0.5967875123023987
Training loss @ step 15: 0.5609706997871399
    Epoch: 2 Loss/Test: 0.48369213938713074, Loss/Test: 0.5842777695506811, Acc/Test: 81.5
Training loss @ step 5: 0.436927855014801
Training loss @ step 10: 0.39156692028045653
Training loss @ step 15: 0.3579081892967224
    Epoch: 3 Loss/Test: 0.35278070345520973, Loss/Test: 0.39535488560795784, Acc/Test: 84.5


# 🗃️ Exercises
It is suggested that you have look over the `tokenizer` class and its functionalities before attempting the exercises.

## 1️⃣ Predict with more context
In the above training we only took advantage of the `title` of each review to predict its polarity.
1. Investigate whether it would be useful to instead use the `content` of each review?
2. Further investigate if it would be usefult to have both the `title` and `content` presented to model during training?

## 2️⃣ Frozen representations
Modify the backbone so that we would only train the classifier layer, and the backbone stays frozen. How does the results compare to the unfrozen version?

## 3️⃣ (Optional) Freeze then unfreeze
It has empirically been shown that freezing the backbone for the first few steps of training and then unfreezing it produces better performing models. Modify the training code to have this option for training. 

## 4️⃣ (Optional) Build an emotion aware AI
Lets now put everything we learned to the test by building an agent with some emotion detection abilities. Use the [emotion dataset](https://huggingface.co/datasets/emotion) to train an [ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)-based model to detect the six basic emotions in our datasets. (anger, fear, joy, love, sadness, and surprise)