# EECS 595 Fall 2022 HW 3

Last update: 2022.10.01

Transformers, and the resulting pretraining-finetuning paradigm, have revolutionized the field of NLP and many other fields in recent years.

[HuggingFace🤗](https://huggingface.co/) provides thousands of pretrained Transformer models to perform tasks on different modalities such as text, vision, and audio. We strongly encourage students to explore more [Transformer Tutorials](https://huggingface.co/docs/transformers/notebooks) provided by HuggingFace official, where this assignment is adapted upon.

## Setup

### Great Lakes Setups

Check out the [Great Lakes guide](https://sled-group.github.io/compute-guide/great-lakes).

### Dependency Installation

Check the python version.

In [None]:
from platform import python_version
assert int(python_version().split(".")[1]) >= 5, \
    "<Warning>: Your Python version is " + python_version()

Install [`matplotlib`](https://matplotlib.org/) and formatting helpers.

In [None]:
import pprint
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 16

Install standard mathematical and machine learning packages: [`numpy`](https://numpy.org/) and [PyTorch](https://pytorch.org/).

In [None]:
import numpy as np
import torch

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

Set the random seed.

In [None]:
import random

SEED = 595

def set_seed(seed):   
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

Install HuggingFace [Transformers](https://pypi.org/project/transformers/), [Datasets](https://huggingface.co/docs/datasets/index), and [Evaluation](https://huggingface.co/docs/evaluate/index).

In [None]:
! pip install transformers
! pip install datasets
! pip install evaluate

## HuggingFace Tutorial

### HuggingFace Pipelines

[`pipeline()`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline) is the easiest way to use a pretrained model for a given task.

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tiZFewofSLM?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Import `pipeline()` from `transformers`.

In [None]:
from transformers import pipeline

We simply call `pipeline("sentiment-analysis")`. The default pretrained model, *i.e.*, `distilbert-base-uncased-finetuned-sst-2-english` will be loaded and cached for sentiment analysis.

In [None]:
classifier = pipeline("sentiment-analysis")

Recall the hundreds of code you wrote in HW1 for sentiment analysis. With the help of `pipeline()`, everything can be done in a few lines:

In [None]:
results = classifier([
    "EECS 595 is a great introduction course to natural language processing.", 
    "We hope that this course will deepen your interest in natural language processing.", 
    "We hope that you would not hate the assignments we developed.", 
    "Although some of them could be challenging for students with little background.",
    "Please feel free to reach out to the teaching team for help.", 
])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

### HuggingFace Tokenizers

Similar to `nltk`'s tokenizer, we process textual data in HuggingFace using its own [tokenizer](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer). A tokenizer starts by splitting text into *tokens* according to a set of rules. The tokens are converted into numbers, which are used to build tensors as input to a model. Any additional inputs required by a model are also added by the tokenizer.

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Yffk5aydLzg?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Let's get started quickly by loading a pretrained tokenizer with the [`AutoTokenizer`](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer) class. 


We start by loading a pretrained tokenizer with [`AutoTokenizer.from_pretrained()`](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained). This downloads the *vocab* used when a model is pretrained.

In [None]:
from transformers import AutoTokenizer

Load the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model. Note here that if you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-index mapping (usually referrred to as the *vocab*) during pretraining. This is a problem which some of you encountered in HW2 developing RNN.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Simply pass a sentence to the tokenizer. The tokenizer returns a dictionary with three important itmes:

* [`input_ids`](https://huggingface.co/docs/transformers/main/en/glossary#input-ids) are the indices corresponding to each token in the sentence.
* [`attention_mask`](https://huggingface.co/docs/transformers/main/en/glossary#attention-mask) indicates whether a token should be attended to or not.
* [`token_type_ids`](https://huggingface.co/docs/transformers/main/en/glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.

In [None]:
encoded_input = tokenizer("There are eight words in this sentence.")
pprint.pprint(encoded_input)

You might noticed that there are 8 words (including punctuation) in the original sentence, but there are 10 indices appearing in the `encoded_input`. You can decode the `input_ids` to return the original input:

In [None]:
print(tokenizer.decode(encoded_input["input_ids"]))

As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. They should look familiar as you have seen similar things in HW2. Not all models need
special tokens, but if they do, the tokenizer will automatically add them for you.

In [None]:
for index in tokenizer.all_special_ids:
    print(index, tokenizer.decode(index))

If there are several sentences you want to process, pass the sentences as a list to the tokenizer.

In [None]:
batch_sentences = [
    "A short sentence.",
    "This is a longer sentence just for demo pupose."
]
encoded_inputs = tokenizer(batch_sentences)
pprint.pprint(encoded_inputs)

Similar to HW2, when you process a batch of sentences, they aren't always the same length. This is a problem because tensors, the input to the model, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to sentences with fewer tokens. 
*   Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence. You should see the tokenizer padded the shorter sentences with a `0` because that's the `special_id` of `[PAD]`. 

On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you will need to truncate the sequence to a shorter length. 
*   Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model.
*   In practice, we usually set `padding` to `max_length`.

In [None]:
encoded_inputs = tokenizer(batch_sentences, padding=True, truncation=True)
pprint.pprint(encoded_inputs)

### HuggingFace Datasets

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/_BZearw7f0w?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

We will again use the [`rotten_tomatoes`](https://huggingface.co/datasets/rotten_tomatoes) dataset, the same standard movie dataset for sentiment analysis we used in HW1. HuggingFace maintains considerable many public datasets, and fortunately we can just load it from its API.

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
pprint.pprint(dataset)
pprint.pprint(dataset['train'][0])

As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use the [`map()`](https://huggingface.co/docs/datasets/process.html#map) method to apply a preprocessing function over the entire dataset:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

HuggingFace provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class. But for the purpose of exercise (and prepare you for more flexible use of HuggingFace), you will manually postprocess `tokenized_dataset` to prepare it for training.

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

Create a `DataLoader` for your training and test datasets so you can iterate over batches of data. The batch size is a critical parameter to tune. Read more about it [here](https://huggingface.co/docs/transformers/v4.18.0/en/performance), especially when you encounter `RuntimeError: CUDA error: out of memory`.

In [None]:
from torch.utils.data import DataLoader

batch_size = 8

train_dataset = tokenized_datasets["train"].shuffle(seed=SEED)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
eval_dataset = tokenized_datasets["validation"].shuffle(seed=SEED)
eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size)
test_dataset = tokenized_datasets["test"]
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

### HuggingFace Fine-tuning

Load the `bert-base-cased` model with the number of expected labels. In this case of binary classification, set `num_labels` to 2.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
model.to(device)

Create an optimizer and learning rate scheduler to fine-tune the model. 
*   Use the [`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer; 
*   Create the default learning rate scheduler from [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer).

You may want to do some fine-tuning here!

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
optimizer = AdamW(model.parameters(), lr=1e-4)

lr_scheduler = get_scheduler(
    name="linear", 
    optimizer=optimizer, 
    num_warmup_steps=0, 
    num_training_steps=num_training_steps
)

In 2022, HuggingFace updated the `evaluate` APIs. Read more about the details [here](https://www.vennify.ai/hugging-face-evaluate-library/).

In [None]:
import evaluate

metric = evaluate.load("accuracy")

To keep track of your training progress, use the [tqdm](https://tqdm.github.io/) library to add a progress bar over the number of training steps. Just like how you need to add an evaluation function to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), you need to do the same when you write your own training loop. But instead of calculating and reporting the metric at the end of each epoch, this time you will accumulate all the batches with [`add_batch`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=add_batch#datasets.Metric.add_batch) and calculate the metric at the very end.

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):

    model.train()
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    model.eval()
    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        metric.add_batch(predictions=predictions, references=batch["labels"])
    
    score = metric.compute()
    print('Validation Accuracy:', score['accuracy'])

Test your model on the test set, and save the predictions for grading purposes.

In [None]:
metric = evaluate.load("accuracy")
model.eval()

for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

torch.save(predictions, 'predictions.torch')
score = metric.compute()
print('Test Accuracy:', score['accuracy'])

## Convert to HTML

Download this notebook, re-upload it to the `Files` on the left, and run the following cell.

In [None]:
%%shell
jupyter nbconvert --to html hw3.ipynb

[NbConvertApp] Converting notebook hw3.ipynb to html
[NbConvertApp] Writing 322200 bytes to hw3.html


