# CIS6930 Week 9a: Pre-trained Language Models (1) (Student version)

---

Preparation: Go to `Runtime > Change runtime type` and choose `GPU` for the hardware accelerator.



In [None]:
gpu_info = !nvidia-smi -L
gpu_info = "\n".join(gpu_info)
if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU")
else:
    print(gpu_info)

## Preparation

For this notebookt, we use Hugging Face's `transformers` library.

In [None]:
!pip install transformers

In [None]:
import copy
from time import time
from typing import Any, Dict
import random

import numpy as np
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import Dataset, TensorDataset, DataLoader
from tqdm import tqdm
from transformers import AdamW, get_linear_schedule_with_warmup

## Playing with a Pre-trained Tokenizer

As discussed in the lecture, we use "pre-trained" tokenizer models for pre-trained language models. Let's take a look at the tokenizer pre-trained for `bert-base-uncased`.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# AutoTokenizer, AutoModelForSequenceClassification are a "meta" class for 
# model-specific classes such as BertTokenizer, BertForSequenceClassification

# See also Hugging Face's ModelHub
# https://huggingface.co/bert-base-uncased
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [None]:
tokenizer.encode("Hello, world!")

In [None]:
for token in tokenizer.encode("Hello, world!"):
    print(tokenizer.decode(token))

In [None]:
tokenizer.decode(tokenizer.encode("Hello, world!"))

In [None]:
# "Pre-trained" vocabulary set
len(tokenizer.vocab)

In [None]:
# Registered special tokens
tokenizer.special_tokens_map

In [None]:
# What about the longest English word? :)
for tokenid in tokenizer.encode("pneumonoultramicroscopicsilicovolcanoconiosis"):
    print(tokenizer.decode(tokenid))

Now, we see the pre-trained tokenizer does not have the OoV issue and tokenzie an input sequence into subwords.

### `transformers.Tokenizer.__call__()`

We usually use the `__call__()` method. The method returns a dictionary of `input_ids`, `token_type_ids`, and `attention_mask`, which are compatible with the interface of pre-trained language models in the `transformers` library. This function is also convenient for **padding**.

Let's take a look.

In [None]:
tokenizer("Hello, world!")  # __call__() in Python

In [None]:
# For sequence-pair classification
tokenizer("Hello, world!", "Good morning world!")

In [None]:
# By providing `max_length` and `padding` arguments,
# you can make "pre-padded" token ID sequences
tokenizer(["Hello, world!", "Hello, again!"],
          max_length=16, padding="max_length")

In [None]:
# By adding `return_tensors="pt"`, you can get PyTorch tensor objects instead of lists. 
tokenizer(["Hello, world!", "Hello, again!"],
          max_length=16, padding="max_length",
          return_tensors="pt")

## Implementing Custom Dataset class

In Week 6, we created a custom dataset for the Twitter dataset (Please see [the Google Colab notebook](https://colab.research.google.com/drive/1DZN-Bo2HBnPQPm4jrQzEIchhHdN682qP?usp=sharing))


In [None]:
# https://www.kaggle.com/crowdflower/twitter-airline-sentiment
# License CC BY-NC-SA 4.0
!gdown --id 1BS_TIqm7crkBRr8p6REZrMv4Uk9_-e6W

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from torch.utils.data import Dataset, TensorDataset, DataLoader

# Loading dataset
df = pd.read_csv("Tweets.csv")

# Label encoder
le = LabelEncoder()
y = le.fit_transform(df["airline_sentiment"].values)
df["label"] = y

# Splint into 60% train, 20% valid, 20% test
train_df, test_df = train_test_split(
    df, test_size=0.2, random_state=1)

train_df, valid_df = train_test_split(
    train_df, test_size=0.25, random_state=1)  # 0.25 x 0.8 = 0.2

print(len(train_df), len(valid_df), len(test_df))

In [None]:
df.columns

In [None]:
class TweetDataset(Dataset):
    def __init__(self,
                 df,
                 tokenizer,
                 max_length=256):
        self.df = df
        input_ids = []
        for text in self.df["text"].tolist():
            d = tokenizer(text,
                          max_length=max_length,
                          padding="max_length",
                          return_tensors="pt")
            for k, v in d.items():
                # To remove unnecessary list
                d[k] = v.squeeze(0)
            input_ids.append(d)

        self.df["input_ids"] = input_ids

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        return {**self.df.iloc[idx]["input_ids"],
                "labels": self.df.iloc[idx]["label"]}

In [None]:
train_dataset = TweetDataset(train_df, tokenizer, max_length=256)
valid_dataset = TweetDataset(valid_df, tokenizer, max_length=256)
test_dataset = TweetDataset(test_df, tokenizer, max_length=256)

In [None]:
# Take a look at a sample batch
batch = next(iter(DataLoader(train_dataset, batch_size=4)))
batch

### Testing a Pre-trained Model (with a sequence classification head)

- `AutoModel`: Base model (Note that it does not have any classification heads)
- `AutoModelForSequenceClasiffication`: For single-sentence/sentnece-pair classification 
- `AutoModelForTokenClassification`: For sequential tagging
- `AutoModelForQuestionAnswering`: For Question Answering

See [the official API documentation](https://huggingface.co/transformers/model_doc/auto.html) for details. 

In this example, we will use `AutoModelForSequenceClassification` for a text classification problem.


In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

Let's take a look at how the `forward()` function is implemented for `AutoModelForSequentialClassification`, which is actually `BertModelForSequenceClassification` in this case (as we load a pre-trained BERTmodel) 

https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.forward

When `forward()` takes the optional argument `labels`, it will return the corresponding `loss` value. This is convenient and now we don't have to manually calculate the loss value. 

In [None]:
output = model(**batch)
output

In [None]:
output.loss

## Training script

The following training script is based on the previous version with little modifiations:
- 1) Replaced manual loss calculation with model's output.
- 2) Added a learning rate scheduler.

Let's take a look.

In [None]:
def train(model: nn.Module,
          train_dataset: Dataset,
          valid_dataset: Dataset,
          config: Dict[str, Any],
          random_seed: int = 0):
  
    # Random Seeds ===============
    torch.manual_seed(random_seed)
    random.seed(random_seed)
    np.random.seed(random_seed)
    # Random Seeds ===============

    # GPU configuration
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    dl_train = DataLoader(train_dataset,
                          batch_size=config["batch_size"],
                          shuffle=True,
                          drop_last=True)
    dl_valid = DataLoader(valid_dataset)
                  
    # Model, Optimzier, Loss function
    model = model.to(device)

    # Optimizer    
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [{
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": 0.0
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0},
        ]
    optimizer = config["optimizer_cls"](optimizer_grouped_parameters,
                                        lr=config["lr"])
    t_total = len(dl_train) * config["n_epochs"]
    scheduler = config["scheduler_cls"](optimizer,
                                        num_warmup_steps=0,
                                        num_training_steps=t_total)
    
    # For each epoch
    eval_list = []
    t0 = time()
    best_val = None
    best_model = None
    for n in range(config["n_epochs"]):
        t1 = time()
        print("Epoch {}".format(n))
        # Training
        train_loss = 0.
        train_pred_list = []
        train_true_list = []
        model.train()  # Switch to the training mode

        # For each batch
        for batch in tqdm(dl_train):
            optimizer.zero_grad()              # Initialize gradient information
            # ==================================================================
            for k, v in batch.items():
                batch[k] = v.to(device)
            output = model(**batch)
            loss = output.loss
            preds = output.logits.argmax(axis=1).detach().cpu().tolist()
            labels = batch["labels"].detach().cpu().tolist()
            # ==================================================================            
            loss.backward()                    # Backpropagate the loss value
            optimizer.step()                   # Update the parameters
            scheduler.step()                   # [New] Update the scheduler step
            train_loss += loss.data.item()
            train_pred_list += preds
            train_true_list += labels

        train_loss /= len(dl_train)
        train_acc = accuracy_score(train_true_list, train_pred_list)
        print("    Training loss: {:.4f}    Training acc: {:.4f}".format(train_loss,
                                                                         train_acc))

        # Validation
        valid_loss = 0.
        valid_pred_list = []
        valid_true_list = []

        model.eval()  # Switch to the evaluation mode
        valid_emb_list = []
        valid_label_list = []
        for i, batch in tqdm(enumerate(dl_valid)):
            # ==================================================================
            for k, v in batch.items():
                batch[k] = v.to(device)
            output = model(**batch)
            loss = output.loss
            preds = output.logits.argmax(axis=1).detach().cpu().tolist()
            labels = batch["labels"].detach().cpu().tolist()
            # ==================================================================
            valid_loss += loss.data.item()
            valid_pred_list += preds
            valid_true_list += labels

        valid_loss /= len(dl_valid)
        valid_acc = accuracy_score(valid_true_list, valid_pred_list)
        print("  Validation loss: {:.4f}  Validation acc: {:.4f}".format(valid_loss,
                                                                         valid_acc))

        # Model selection
        if best_val is None or valid_loss < best_val:
            best_model = copy.deepcopy(model)
            best_val = valid_loss

        t2 = time()
        print("     Elapsed time: {:.1f} [sec]".format(t2 - t1))

        # Store train/validation loss values
        eval_list.append([n, train_loss, valid_loss, train_acc, valid_acc, t2 - t1])

    eval_df = pd.DataFrame(eval_list, columns=["epoch",
                                               "train_loss", "valid_loss",
                                               "train_acc", "valid_acc",
                                               "time"])
    eval_df.set_index("epoch")

    print("Total time: {:.1f} [sec]".format(t2 - t0))

    # Return the best model and trainining/validation information
    return {"model": best_model,
            "best_val": best_val,
            "eval_df": eval_df}

## Let's try!

Let's run the trainin script with the following configuration. Note that it several minutes to finish one epoch. 

In [None]:
config = {"optimizer_cls": optim.AdamW,
          "scheduler_cls": get_linear_schedule_with_warmup,
          "lr": 5e-5,
          "batch_size": 8,
          "n_epochs": 3}
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
output = train(model, train_dataset, valid_dataset, config)

### Results


In [None]:
output["eval_df"]