# Introduction
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention is All you Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

# Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

from datasets import load_dataset
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import accuracy_score
from tqdm import tqdm
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

Using device: cuda


In [None]:
# hyperparameters
batch_size = 48 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 1000
eval_interval = 200
learning_rate = 3e-4
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------
torch.manual_seed(111)

model_name = "dante_gpt"
input = "inferno.txt"
output_txt = "./results/" + model_name + ".txt"
save_model_path = "./trained_models/" + model_name + ".pth"
save_model = True

In [None]:
# load the text file
with open(input, 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]  # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))  # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [None]:

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B, T, C = x.shape
        k = self.key(x)  # (B,T,hs)
        q = self.query(x)  # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5  # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x)  # (B,T,hs)
        out = wei @ v  # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out


class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out


class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)  # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T,C)
        x = tok_emb + pos_emb  # (B,T,C)
        x = self.blocks(x)  # (B,T,C)
        x = self.ln_f(x)  # (B,T,C)
        logits = self.lm_head(x)  # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx


In [None]:
model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters()) / 1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

try:
    print("Training...")
    for iter in range(max_iters):
        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss()
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
            # writer.add_scalar('train loss', losses['train'].item(), iter)
            # writer.add_scalar('val loss', losses['val'].item(), iter)
        # sample a batch of data
        xb, yb = get_batch('train')

        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
finally:
    print("Error/Key interrupt occurred")
    if save_model:
        print("Saving model...")
        torch.save(model, save_model_path)

# generate from the model
print("Text generation...")
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
# open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

with open(output_txt, 'w') as f:
    # write to the file
    generated_txt = decode(m.generate(context, max_new_tokens=10000)[0].tolist())
    f.write(generated_txt)
    # close the file
    f.close()

if save_model:
    print("Saving model...")
    torch.save(model, save_model_path)

10.783546 M parameters
Training...
step 0: train loss 4.2547, val loss 4.2529
step 200: train loss 2.2572, val loss 2.2858
step 400: train loss 1.8157, val loss 1.8586
step 600: train loss 1.6234, val loss 1.7065
step 800: train loss 1.4386, val loss 1.5901
step 999: train loss 1.2863, val loss 1.5429
Error/Key interrupt occurred
Saving model...
Text generation...

  dicondo verrebr'armi belle foco.

Noi furman com'e` sarmignire, e Rucerano
  che da la balcatenea e lor sovra;
  e s'ei si` mi s'alcun su` la stesta,

dove qualuni ciascun ascoglioso e` mova,
  gridando l'altro intrime; omati,
  e venummata ali daldo senne vide tutte.

I' son, di tuttavami porfian la groda trova
  con posazascille e che fesse e quel casso il comitto.

Mostro le crede com'eacgito per le stese,>>.

E io, pur volsci` che l'al da Paltroda,
  e nol mossi vero`, ma smarembo in san s
Saving model...


# Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

## Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers
    
The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text).

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the togenizer to get Pytorch tensors as output (instead of lists).

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

input = "inferno.txt"

with open(input, 'r', encoding='utf-8') as f:
    text = f.read()

# Load the GPT2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
txt_lenght = len(text)
print("Text length: ", txt_lenght)

tokenized_text = tokenizer.encode(text, return_tensors="pt")
print("Text length: ", len(text))
print("Tokenized length: ", len(tokenized_text[0]))
print("Ratio:", len(tokenized_text[0]) / len(text))



Text length:  186983


Token indices sequence length is longer than the specified maximum sequence length for this model (79225 > 1024). Running this sequence through the model will result in indexing errors


Text length:  186983
Tokenized length:  79225
Ratio: 0.4237016199333629


## Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not results in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

In [None]:
output_filepath = "./results/generated_text.txt"

prompt = "I am"

model = GPT2LMHeadModel.from_pretrained("gpt2")
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True, temperature=0.9)
# generated_ids = model.generate(**model_inputs, max_new_tokens=200, do_sample=True, temperature=0.5)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("generator: ", generated_text)

'''
# Save output
with open(output_filepath, 'w') as f:
    # write to the file
    f.write(generated_text)
    # close the file
    f.close()
'''

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


generator:  I am not sure of the timing," says Dr. Alvaro. "We will monitor the results of the study again at another clinic, but I am not confident it will be done right because of the risks that the study will bring us."

For more information on the upcoming research, please contact:

Dr. Alvaro S. de Araujo, MD

M.D.

Dr. Sato della A. de Araujo, M.D


"\n# Save output\nwith open(output_filepath, 'w') as f:\n    # write to the file\n    f.write(generated_text)\n    # close the file\n    f.close()\n"

# Exercise 3: Reusing Pre-trained LLMs (choose one)

Choose **one** of the following exercises (well, *at least* one). In each of these you are asked to adapt a pre-trained LLM (`GPT2Model` or `DistillBERT` are two good choices) to a new Natural Language Understanding task. A few comments:

+ Since GPT2 is a *autoregressive* model, there is no latent space aggregation at the last transformer layer (you get the same number of tokens out that you give in input). To use a pre-trained model for a classification or retrieval task, you should aggregate these tokens somehow (or opportunistically select *one* to use).

+ BERT models (including DistillBERT) have a special [CLS] token prepended to each latent representation in output from a self-attention block. You can directly use this as a representation for classification (or retrieval).

+ The first *two* exercises below can probably be done *without* any fine-tuning -- that is, just training a shallow MLP to classify or represent with the appropriate loss function.

# Exercise 3.1: Training a Text Classifier (easy)

Peruse the [text classification datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=downloads). Choose a *moderately* sized dataset and use a LLM to train a classifier to solve the problem.

**Note**: A good first baseline for this problem is certainly to use an LLM *exclusively* as a feature extractor and then train a shallow model.

### Hyperparameters

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

# hyperparameters
batch_size = 8
lr = 1e-3
epochs = 10

Using device: cuda


### Dataset

In [None]:
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.reviews = data['review']
        self.labels = data['label']

    def __getitem__(self, idx):
        return {'text': self.reviews[idx], 'label': self.labels[idx]}

    def __len__(self):
        return len(self.data)

In [None]:
data = load_dataset("ajaykarthick/imdb-movie-reviews")
num_classes = len(np.unique(data['train']['label']))
print("Classes:", num_classes)

ds_train = IMDBDataset(data['train'])
ds_test = IMDBDataset(data['test'])

trainset = torch.utils.data.Subset(ds_train, range(5000))
testset = torch.utils.data.Subset(ds_test, range(1000))

train_size = int(0.9 * len(trainset))
val_size = len(trainset) - train_size
trainset, valset = random_split(trainset, [train_size, val_size])

In [None]:
data = load_dataset("cardiffnlp/tweet_eval", "emotion")
num_classes = len(np.unique(data['train']['label']))
print("Classes:", num_classes)

trainset = data['train']
valset = data['validation']
testset = data['test']

In [None]:
# Create a DataLoader
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(valset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False)

In [None]:
next(iter(train_loader))

{'text': ['What a great day it has been! Happy Friday everyone!!! #OORAH #Motivated #America…',
  '️ my girls #vcation @ Santa Barbara, California',
  "Link in bio Land of the bridge's @ Las Vegas Strip",
  "Effortless makeup by nuevolutioncosmetics. There's nothing like a deep red lipstick to make…",
  "I'm definitely in love ️ cuz nowhere feels like home unless I'm with him. #truelove #romance…",
  "Here's a full view of the fabulous black and white houndstooth hooded dress by the amazing…",
  'Athleisure all damn day',
  '#5 final destination (@ California in CA)'],
 'label': tensor([11,  0,  4,  7,  0,  7, 15,  1])}

### Model

In [None]:
from transformers import DistilBertTokenizer, DistilBertModel

# Classifier Model Definition

class TextClassifier(nn.Module):
    def __init__(self, num_classes):
        super(TextClassifier, self).__init__()
        self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
        self.pretrained_model = DistilBertModel.from_pretrained('distilbert-base-uncased')
        print(self.pretrained_model)
        self.classifier = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.tokenizer(x, return_tensors='pt', padding=True, truncation=True, max_length=512)
        x = x.to(device)
        x = self.pretrained_model(**x).last_hidden_state[:, 0, :]
        x = self.classifier(x)
        return x

In [None]:
def train_epoch(model, dataloader, optimizer, criterion, device=device):
    model.train()
    losses = []
    for data in tqdm(dataloader):
        texts = data['text']
        labels = data['label']
        labels = labels.to(device)
        optimizer.zero_grad()
        output = model(texts)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    return np.mean(losses)

def eval_model(model, dataloader, criterion, device=device):
    model.eval()
    losses = []
    gts = []
    predictions = []
    with torch.no_grad():
        for data in tqdm(dataloader):
            texts = data['text']
            labels = data['label']
            labels = labels.to(device)
            outputs = model(texts)
            preds = torch.argmax(outputs, dim=1)
            gts.append(labels.cpu().numpy())
            predictions.append(preds.detach().cpu().numpy())
            loss = criterion(outputs, labels)
            losses.append(loss.item())
        accuracy = accuracy_score(np.hstack(gts), np.hstack(predictions))
    return np.mean(losses), accuracy


In [None]:
model = TextClassifier(num_classes).to(device)
criterion = nn.CrossEntropyLoss()
params = [{"params": model.pretrained_model.parameters(), "lr": lr / 100},
            {"params": model.classifier.parameters(), "lr": lr}]
optimizer = torch.optim.AdamW(params)
#optimizer = torch.optim.AdamW(model.classifier.parameters(), lr=lr)

for epoch in range(epochs):
    train_loss = train_epoch(model, train_loader, optimizer, criterion)
    val_loss, val_acc = eval_model(model, val_loader, criterion)
    print(f"Epoch {epoch}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

# Save the model
torch.save(model, "trained_models/text_classifier_agnews.pth")

In [None]:
test_loss, test_acc = eval_model(model, test_loader, criterion)
print(f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.4f}")

# IMDB
# Zero-Shot Performance
# Test Loss: 0.6936, Test Acc: 0.5110

# Train the classifier
# Test Loss: 0.3251, Test Acc: 0.8620

# Fine-Tune the entire model
# Test Loss: 0.3875, Test Acc: 0.9090

# TWEET EVAL
# Zero-Shot Performance
# Test Loss: 1.3567, Test Acc: 0.3688

# Train the classifier
# Test Loss: 0.6807, Test Acc: 0.7431

# Fine-Tune the entire model
# Test Loss: 1.2441, Test Acc: 0.7994

# Exercise 3.2: Training a Question Answering Model (harder)

Peruse the [multiple choice question answering datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:multiple-choice&sort=downloads). Chose a *moderately* sized one and train a model to answer contextualized multiple-choice questions. You *might* be able to avoid fine-tuning by training a simple model to *rank* the multiple choices (see margin ranking loss in Pytorch).

In [2]:
from huggingface_hub import notebook_login
from datasets import load_dataset
from transformers import AutoTokenizer
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch
import evaluate
import numpy as np
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
def preprocess_function(examples):
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    question_headers = examples["sent2"]
    second_sentences = [
        [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
    ]

    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

In [4]:
@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [5]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
data = load_dataset("swag", "regular")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

ending_names = ["ending0", "ending1", "ending2", "ending3"]

tokenized_swag = data.map(preprocess_function, batched=True)

accuracy = evaluate.load("accuracy")


In [None]:
model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
model_path = "trained_models/qa_swag"

training_args = TrainingArguments(
    output_dir=model_path,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_swag["train"],
    eval_dataset=tokenized_swag["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

In [14]:
question = "What is the capital of France?"
answers = ["Paris", "London", "Berlin", "Madrid"]

In [26]:
question = "What is the capital of Italy?"
answers = ["Paris", "London", "Rome", "Madrid"]

In [18]:
question = "What is the man's best friend?"
answers = ["Dog", "Cat", "Fish", "Bird"]

In [24]:
question = "Who was the president of the United States?"
answers = ["Kobe Briant", "Barack Obama", "Robin Hood", "Michael Jackson"]

In [None]:
question = "Who is Matt Demon?"
answers = ["Tennis player", "Singer", "Lawyer", "Actor"]

In [27]:
n_answers = len(answers)
tokenizer = AutoTokenizer.from_pretrained(model_path)
question_answer = [[question, answer] for answer in answers]
inputs = tokenizer(question_answer, return_tensors="pt", padding=True)
labels = torch.tensor(0).unsqueeze(0)

model = AutoModelForMultipleChoice.from_pretrained(model_path)
outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
print(f"Model answer: {answers[predictions]}")

with open("./results/qa.txt", "a") as f:
    f.write(f"{question}\n")
    for i, answer in enumerate(answers):
        f.write(f"{i}: {answer}\n")
    f.write(f"Model answer: {answers[predictions]}\n\n\n")

Model answer: Rome


# Exercise 3.3: Training a Retrieval Model (hardest)

The Hugging Face dataset repository contains a large number of ["text retrieval" problems](https://huggingface.co/datasets?task_categories=task_categories:text-retrieval&p=1&sort=downloads). These tasks generally require that the model measure *similarity* between text in some metric space -- naively, just a cosine similarity between [CLS] tokens can get you pretty far. Find an interesting retrieval problem and train a model (starting from a pre-trained LLM of course) to solve it.

**Tip**: Sometimes identifying the *retrieval* problems in these datasets can be half the challenge. [This dataset](https://huggingface.co/datasets/BeIR/scifact) might be a good starting point.