# Introduction
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention is All you Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

# Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results 

In [70]:
import torch
import torch.nn as nn
import torch.nn.functional as F


In [71]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [72]:
with open('dante.txt') as f:
    text = f.read()

# here are all the unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print('Vocab Size:', vocab_size)


 !"',-.:;<>?ABCDEFGHILMNOPQRSTUVXZ`abcdefghilmnopqrstuvxz
Vocab Size: 58


In [73]:
# create a mapping of characters to integers
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda x: [stoi[ch] for ch in x]
decode = lambda x: ''.join([itos[i] for i in x])

print(encode('hello'))
print(decode(encode('hello')))

# Using character level encoding we obtain very long sequences of integers for each sentence

[43, 40, 45, 45, 48]
hello


In [74]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(len(data) * 0.9)
train_data, val_data = data[:n], data[n:]

In [75]:
block_size = 8
train_data[:block_size + 1]

tensor([22, 13,  1, 16, 21, 32, 21, 24, 13])

In [76]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target is {target}")

When input is tensor([22]) the target is 13
When input is tensor([22, 13]) the target is 1
When input is tensor([22, 13,  1]) the target is 16
When input is tensor([22, 13,  1, 16]) the target is 21
When input is tensor([22, 13,  1, 16, 21]) the target is 32
When input is tensor([22, 13,  1, 16, 21, 32]) the target is 21
When input is tensor([22, 13,  1, 16, 21, 32, 21]) the target is 24
When input is tensor([22, 13,  1, 16, 21, 32, 21, 24]) the target is 13


In [77]:
torch.manual_seed(110)
batch_size = 4
block_size = 8
n_embed = 32

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb , yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"When input is {context} the target is {target}")

inputs:
torch.Size([4, 8])
tensor([[52, 40, 42, 54, 40,  0,  1,  1],
        [36, 47, 57, 36,  1, 51, 44, 47],
        [ 1, 38, 43,  4, 40, 51, 36,  1],
        [ 1, 10, 10, 25,  1, 36, 47, 44]])
targets:
torch.Size([4, 8])
tensor([[40, 42, 54, 40,  0,  1,  1, 52],
        [47, 57, 36,  1, 51, 44, 47, 42],
        [38, 43,  4, 40, 51, 36,  1, 49],
        [10, 10, 25,  1, 36, 47, 44, 46]])
----
When input is tensor([52]) the target is 40
When input is tensor([52, 40]) the target is 42
When input is tensor([52, 40, 42]) the target is 54
When input is tensor([52, 40, 42, 54]) the target is 40
When input is tensor([52, 40, 42, 54, 40]) the target is 0
When input is tensor([52, 40, 42, 54, 40,  0]) the target is 1
When input is tensor([52, 40, 42, 54, 40,  0,  1]) the target is 1
When input is tensor([52, 40, 42, 54, 40,  0,  1,  1]) the target is 52
When input is tensor([36]) the target is 47
When input is tensor([36, 47]) the target is 57
When input is tensor([36, 47, 57]) the target is 

In [78]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embed):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and target are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        logits = self.lm_head(tok_emb) # (B,T,V)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(-1)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :] # keep only the last time step
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, idx_next], dim=1)
        return idx

    
m = BigramLanguageModel(vocab_size, n_embed).to(device)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(torch.zeros(1, 1).long(), max_new_tokens=100)[0].tolist()))

torch.Size([32, 58])
tensor(4.1255, grad_fn=<NllLossBackward0>)


IndexError: index out of range in self

In [59]:
optimizer = torch.optim.AdamW(m.parameters(), lr=0.001)

In [66]:
batch_size = 32
for steps in range(100):
    xb, yb = get_batch('train')

    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    print(loss.item())

2.4525880813598633
2.3348538875579834
2.413069009780884
2.5114784240722656
2.5023553371429443
2.4476187229156494
2.437863349914551
2.4656219482421875
2.4230082035064697
2.5302159786224365
2.396296739578247
2.4570095539093018
2.4547176361083984
2.375426769256592
2.365833044052124
2.3515896797180176
2.3716814517974854
2.266244649887085
2.446502447128296
2.2659053802490234
2.418915271759033
2.423917293548584
2.49324107170105
2.2993669509887695
2.531456470489502
2.301499128341675
2.4209437370300293
2.5618040561676025
2.441148042678833
2.3815388679504395
2.4020280838012695
2.442016124725342
2.414278030395508
2.4093053340911865
2.34110689163208
2.3584768772125244
2.436560869216919
2.4152917861938477
2.2931642532348633
2.346449613571167
2.3698947429656982
2.462114095687866
2.3524081707000732
2.2900230884552
2.400932550430298
2.542382001876831
2.4307029247283936
2.384967803955078
2.4911422729492188
2.476912498474121
2.4833834171295166
2.3555665016174316
2.3641014099121094
2.3823211193084717
2.

In [67]:
print(decode(m.generate(torch.zeros(1, 1).long(), max_new_tokens=400)[0].tolist()))


roroli ovi,

 pedoltrine o` chia,

  fa vrossscopRquatore con doco ttrostria posctosave SLL a l',


 son cledi ctatolo l'l ciomalli conistor la tar 'i  ma:
 roCia
 ce essstoederntoli'uar stialeml che datarsi>.

 pina mX.



 so


 npp fe,min gespi catsse p r pmpe tallol len'ave.
 n iste  qpnol'indire cal m'i;:mita goDe;
  tisici` va di 'iZH<<QZ hco la ` chenosufi lcchero  ma adibMal  aAchegn co s'


# Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

## Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers
    
The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text). 

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the togenizer to get Pytorch tensors as output (instead of lists).

In [19]:
# Your code here.

## Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not results in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

In [20]:
# Your code here.

# Exercise 3: Reusing Pre-trained LLMs (choose one)

Choose **one** of the following exercises (well, *at least* one). In each of these you are asked to adapt a pre-trained LLM (`GPT2Model` or `DistillBERT` are two good choices) to a new Natural Language Understanding task. A few comments:

+ Since GPT2 is a *autoregressive* model, there is no latent space aggregation at the last transformer layer (you get the same number of tokens out that you give in input). To use a pre-trained model for a classification or retrieval task, you should aggregate these tokens somehow (or opportunistically select *one* to use).

+ BERT models (including DistillBERT) have a special [CLS] token prepended to each latent representation in output from a self-attention block. You can directly use this as a representation for classification (or retrieval).

+ The first *two* exercises below can probably be done *without* any fine-tuning -- that is, just training a shallow MLP to classify or represent with the appropriate loss function.

# Exercise 3.1: Training a Text Classifier (easy)

Peruse the [text classification datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=downloads). Choose a *moderately* sized dataset and use a LLM to train a classifier to solve the problem.

**Note**: A good first baseline for this problem is certainly to use an LLM *exclusively* as a feature extractor and then train a shallow model.

# Exercise 3.2: Training a Question Answering Model (harder)

Peruse the [multiple choice question answering datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:multiple-choice&sort=downloads). Chose a *moderately* sized one and train a model to answer contextualized multiple-choice questions. You *might* be able to avoid fine-tuning by training a simple model to *rank* the multiple choices (see margin ranking loss in Pytorch).

# Exercise 3.3: Training a Retrieval Model (hardest)

The Hugging Face dataset repository contains a large number of ["text retrieval" problems](https://huggingface.co/datasets?task_categories=task_categories:text-retrieval&p=1&sort=downloads). These tasks generally require that the model measure *similarity* between text in some metric space -- naively, just a cosine similarity between [CLS] tokens can get you pretty far. Find an interesting retrieval problem and train a model (starting from a pre-trained LLM of course) to solve it.

**Tip**: Sometimes identifying the *retrieval* problems in these datasets can be half the challenge. [This dataset](https://huggingface.co/datasets/BeIR/scifact) might be a good starting point.