# NLP Lab: Language Models

In this lab, we will build the main components of the GPT-2 model and train a small model on poems by Victor Hugo.

The questions are included in this notebook. To run the training, you will need to modify the `gpt_single_head.py` file, which is also available in the Git repository.

## Data

The training data consists of a collection of poems by Victor Hugo, sourced from [gutenberg.org](https://www.gutenberg.org/). The dataset is available in the `data` directory.

To reduce model complexity, we will model the text at the character level. Typically, language models process sequences of subwords using [tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) such as BPE, SentencePiece, or WordPiece.

#### Questions:
- Using [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter), display the number of unique characters in the text and the frequency of each character.

In [2]:
import torch

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [6]:
import collections

with open('data/hugo_contemplations.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(f'Number of characters in the file: {len(text)}')
##  YOUR CODE HERE
counter = collections.Counter()
for cara in text:
    counter[cara] += 1
    

###
chars = counter.keys()

print (f'Number of character in counter: {sum(counter.values())}')
print (f'{len(chars)} different characters')
print (counter)


Number of characters in the file: 285222
Number of character in counter: 285222
101 different characters
Counter({' ': 49127, 'e': 30253, 's': 17987, 'u': 14254, 'r': 14223, 't': 14071, 'a': 14048, 'n': 13725, 'i': 12828, 'o': 12653, 'l': 11638, '\n': 8102, 'm': 6495, 'd': 6375, ',': 6077, 'c': 5074, 'p': 4206, "'": 3820, 'v': 3492, 'é': 2943, 'b': 2783, 'f': 2772, 'h': 2221, 'q': 1956, 'g': 1790, '.': 1420, 'x': 1154, 'L': 1147, '!': 1121, 'E': 1074, ';': 1043, '-': 1020, 'j': 890, 'D': 764, 'è': 725, 'à': 706, 'y': 660, 'I': 627, 'ê': 605, 'C': 593, 'S': 545, 'A': 530, 'Q': 503, 'z': 482, 'J': 471, 'O': 450, 'T': 441, 'P': 435, '?': 388, 'V': 383, 'â': 381, 'N': 362, 'M': 344, 'ù': 298, ':': 294, 'R': 240, 'î': 214, 'U': 208, 'ô': 159, 'X': 150, '1': 146, 'H': 116, 'F': 114, '5': 111, '8': 93, 'B': 78, '«': 74, 'É': 70, '»': 69, 'G': 67, '4': 64, 'û': 62, '3': 47, 'ç': 34, 'À': 33, 'ë': 32, 'ï': 31, '2': 30, '·': 26, 'Ê': 24, '6': 23, '7': 23, 'Ô': 19, '9': 19, 'È': 11, 'k': 10, '0':

### Encoding / Decoding  

To transform the text into a vector for the neural network, each character must be encoded as an integer.  

The following functions perform the encoding and decoding of characters:

In [7]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: transform a string into a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: transform a list of integers into a string


# test that your encoder/decoder is coherent
testString = "\nDemain, dès l'aube"
assert decode(encode (testString)) ==  testString

### Train/Validation Split  

Since the goal is to predict poems, the lines should not be shuffled randomly. Instead, we must preserve the order of the lines in the text and take only the first 90% for training, while using the remaining 10% to monitor learning.  

#### Questions:  
- Split the data into `train_data` (90%) and `val_data` (10%) using slicing on the dataset.

In [135]:
import torch
# Train and validation splits
data = torch.tensor(encode(text), dtype=torch.long)
## YOUR CODE HERE
# first 90% will be train, rest val
train_data = data[:int(data.shape[0]*0.9)]
val_data = data[train_data.shape[0]:]
###

### Context  

The language model has a parameter that defines the maximum context size to consider when predicting the next character. This context is called `block_size`. The training data consists of sequences of consecutive characters, randomly sampled from the training set, with a length of `block_size`.  

If the starting character of the sequence is `i`, then the context sequence is:  
```python
x = data[i:i+block_size]
```
And the target value to predict at each position in the context is the next character:  
```python
y = data[i+1:i+block_size+1]
```



In [136]:
block_size = 8

i  = torch.randint(len(data) - block_size, (1,))
print (i)
x = train_data[i:i+block_size]
y = train_data[i+1:i+1+block_size]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print (f'context is >{decode(context.tolist())}< target is >{decode([target.tolist()])}<')

tensor([94619])
context is >a< target is >i<
context is >ai< target is > <
context is >ai < target is >d<
context is >ai d< target is >a<
context is >ai da< target is >n<
context is >ai dan< target is >s<
context is >ai dans< target is > <
context is >ai dans < target is >l<


In [137]:
x

tensor([52, 29, 14, 37, 52, 30, 43, 14])

### Defining Batches  

The training batches consist of multiple character sequences randomly sampled from `train_data`. To randomly select a sequence for the batch, we need to randomly pick a starting point in `train_data` and extract the following `block_size` characters. When selecting the starting point, ensure that there are enough characters remaining after it to form a full sequence of `block_size` characters.  

#### Questions:  
- Create the batches `x` by selecting `batch_size` sequences of length `block_size` starting from a randomly chosen index `i`. Stack the examples using `torch.stack`.  
- Create the batches `y` by adding the next character following each sequence in `x`. Stack the examples using `torch.stack`.


In [138]:
train_data[0:10]

tensor([0, 1, 2, 3, 4, 5, 6, 7, 1, 8])

In [139]:
batch_size = 4
torch.manual_seed(2023)
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ## YOUR CODE HERE
    # select batch_size starting points in the data, store them in a list called starting_points
    starting_points = list(torch.randint(high=len(data) - block_size, size=(batch_size,)))
        
    # x is the sequence of integer starting at each starting point and of length block_size
    x = torch.zeros(1,block_size)
    y = torch.zeros(1,block_size)
    for i in range(batch_size):
        temp_x = train_data[int(starting_points[i]):int(starting_points[i])+block_size].reshape(1,block_size)
        temp_y = train_data[int(starting_points[i])+1:int(starting_points[i])+1+block_size].reshape(1,block_size)
        #x = torch.squeeze(x,1)
        x = torch.vstack((x,temp_x))
        y = torch.vstack((y,temp_y))
    ###correct bc shapes are shitty
    # y is the character after each starting position
    x = x[1:,:]
    y = y[1:,:]
    ### 
    # send data and target to device
    x, y = x.to(device), y.to(device)
    return x, y

In [140]:
get_batch(split='train')

(tensor([[35., 14., 37., 35., 32., 29., 53., 65.],
         [39., 32., 44., 14., 53., 52., 14., 44.],
         [32., 50., 62., 35., 43., 42.,  3., 14.],
         [44., 43., 14., 53., 35., 43., 14., 57.]], device='cuda:0'),
 tensor([[14., 37., 35., 32., 29., 53., 65.,  3.],
         [32., 44., 14., 53., 52., 14., 44., 35.],
         [50., 62., 35., 43., 42.,  3., 14., 14.],
         [43., 14., 53., 35., 43., 14., 57., 53.]], device='cuda:0'))

In [141]:
test_x, test_y = get_batch(split='train')

### First Model: A Bigram Model  

The first model we will implement is a bigram model. It predicts the next character based only on the current character. This model can be stored in a simple matrix: for each character (row), we store the probability distribution over all possible next characters (columns). This can be implemented using a simple [`Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer in PyTorch.  

#### Questions:  
- In the constructor, define an Embedding layer of size `vocab_size × vocab_size`.  
- In the `forward` method, apply the embedding layer to the batch of indices (`x`).  
- In the `forward` method, define the loss as `cross_entropy` between the predictions and the target (`y`).


In [142]:
vocab_size = 101

In [143]:
import torch.nn as nn

# use a gpu if we have one
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # we use a simple vocab_size times vocab_size tensor to store the probabilities 
        # of each token given a single token as context in nn.Embedding
        # YOUR CODE HERE
        self.embedding = nn.Embedding(vocab_size, vocab_size)
        self.cross_entropy = nn.CrossEntropyLoss()

        ## 
        
    def forward(self, idx, targets=None):

        # idx and targets are both (Batch,Time) tensor of integers
        # YOUR CODE HERE
        #print(self.embedding())
        idx = idx.clone().detach().long().to(device)
        #idx = torch.tensor(data=idx,dtype=torch.int)
        logits = self.embedding(idx)
        #print('logits : ', logits, '\n',
        #     'size : ', logits.shape)
        ## 
   
        # don't compute loss if we don't have targets
        if targets is None:
            loss = None
        else:
            # change the shape of the logits and target to match what is needed for CrossEntropyLoss
            # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
            Batch, Time, Channels = logits.shape
            #print('batch : ', Batch, 'Time : ', Time, 'Channels : ', Channels)
            logits = logits.view(Batch*Time, Channels)
            targets = targets.view(Batch*Time)
            logits = logits.float().to(device)#.requires_grad_(requires_grad=True)
            targets = targets.long().to(device)#.requires_grad_(requires_grad=True)
            # negative log likelihood between prediction and target
            # YOUR CODE HERE
            loss =  self.cross_entropy(logits, targets)
            

            ## 

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = nn.functional.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel(vocab_size)
# send the model to device
m = model.to(device)

### Model Before Training  

At this stage, the model has not yet been trained—it has only been initialized. However, we can already compute the loss on a random batch. Since the weights are initialized with a normal distribution \( N(0,1) \) for each dimension, the expected loss after initialization should be close to `-ln(1/vocab_size)`, as the entropy is maximal.

In [144]:
#m(xb)[0].shape

In [145]:
#yb

In [146]:
import math
xb, yb = get_batch('train')
logits, loss = m(xb, yb)
#print (logits.shape)
#print (f'Expected loss {-math.log(1.0/vocab_size)}')
#print (f'Computed loss {loss}')

### Using the Model for Prediction  

To use the model for prediction, we need to provide an initial character to start the sequence—this is called the prompt. In our case, we can initialize the generation with the newline character (`\n`) to start a new sentence.  

#### Questions:  
- Create a prompt as a tensor of size `(1,1)` containing the integer corresponding to the character `\n`.  
- Generate a sequence of 100 characters from this prompt using the functions `m.generate` and `decode`.  
- How does the generated sentence look?

In [147]:
print(encode(['\n']))
## YOUR CODE HERE
input_test = torch.tensor(encode(['\n'])[0]).view(1,1).to(device)
generated = m.generate(input_test,max_new_tokens=100)
generated = generated.tolist()
generated = generated[0]
decoded = decode(generated)
###

[3]


In [148]:
decoded

"\nHbHn'lkbïËçv0ÂIHbHyuR,,MYIXïIXMôëa0cPQ7W5uT0çÎgQUnXzoYlS ëYéoB:n1:îèV«OlhFZ(KÊ'[bbb[]hYl;KaBJsûdûI_ë"

### Training  

For training, we use the [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer with a learning rate of `1e-3`. Each training iteration consists of the following steps:  

- Generate a batch  
- Apply the neural network (forward pass) and compute the loss: `model(xb, yb)`  
- Compute the gradient (after resetting accumulated gradients): `loss.backward()`  
- Update the parameters: `optimizer.step()`  

In [157]:
max_iters = 50000
batch_size = 4
eval_interval = 1000
learning_rate = 1e-3
eval_iters = 20

@torch.no_grad() # no gradient is computed here
def estimate_loss():
    """ Estimate the loss on eval_iters batch of train and val sets."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# re-create the model
model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 5.2866, val loss 5.2094
step 1000: train loss 4.3931, val loss 4.3226
step 2000: train loss 3.7566, val loss 3.7631
step 3000: train loss 3.3674, val loss 3.3223
step 4000: train loss 3.1393, val loss 2.9766
step 5000: train loss 2.8584, val loss 2.8757
step 6000: train loss 2.6994, val loss 2.6025
step 7000: train loss 2.5940, val loss 2.6067
step 8000: train loss 2.5657, val loss 2.5152
step 9000: train loss 2.4569, val loss 2.4284
step 10000: train loss 2.5468, val loss 2.5742
step 11000: train loss 2.4437, val loss 2.4841
step 12000: train loss 2.4446, val loss 2.5261
step 13000: train loss 2.4524, val loss 2.4095
step 14000: train loss 2.3866, val loss 2.4224
step 15000: train loss 2.3807, val loss 2.4035
step 16000: train loss 2.4115, val loss 2.3368
step 17000: train loss 2.3858, val loss 2.4394
step 18000: train loss 2.4133, val loss 2.4124
step 19000: train loss 2.4417, val loss 2.3874
step 20000: train loss 2.4356, val loss 2.4011
step 21000: train loss 2.3

Once the network has been trained for 100 iterations, we can generate a sequence of characters.  

#### Questions:  
- What is the effect of training?  
- Increase the number of iterations to 1,000 and then to 10,000. Note the obtained loss and the generated sentence. What do you observe?

In [158]:
idx = torch.ones((1,1), dtype=torch.long)*3
idx = idx.to(device)
print (decode(m.generate(idx, max_new_tokens=100)[0].tolist()))


Nours de fau stux st rlontêts
Qus,

Untét tale? aitie.
Me Sue vabaint EXIless minent, Esprileseunte 


The obtained loss decreases progressively but we reach a 'plateau' at 2.3 approximately once the number of training epochs is greately increased. Nevertheless, we can notice a slight upgrade in the quality of the sentences produced. However, the generated text is still unreadable and doesn't make sense. 

## Single Head Attention  

We will now implement the basic attention mechanism. For each pair of words in the sequence, this mechanism combines:  
- **Q** (*query*): the information being searched for,  
- **K** (*key*): the information retrieved,  
- **V** (*value*): a result vector calculated from the attention mechanism.  

![single head attention](images/single_head_attention.png)  

### Masking  

However, since we are using the model to generate sequences, we must not use characters that come after the current character—these are precisely the characters we aim to predict during training. *The future should not be used to predict the future.*  

To enforce this constraint, we integrate a **masking matrix** into the process. This matrix ensures that:  
- For the first character in the sequence, only that character is available for prediction (no context).  
- For the second character, only the first and second characters can be used.  
- For the third character, only the first three characters are accessible, and so on.  

This results in a **lower triangular matrix**, where each row is normalized (rows sum to 1).

In [162]:
T = 8

# first version of the constraints with matrix multiplication
# create a lower triangular matrix
weights0 = torch.tril(torch.ones(T,T))
# normalize each row
weights0 = weights0 / weights0.sum(1, keepdim=True) 
print (weights0)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


The [`softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html) function provides another way to achieve normalization.  

#### Question:  
- Verify that applying `softmax` results in the same lower triangular matrix.

In [160]:
tril = torch.tril(torch.ones(T,T))
weights = torch.tril(torch.ones(T,T))
weights = weights.masked_fill(tril== 0, float('-inf'))
weights = nn.functional.softmax(weights, dim=-1)
print (weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


### Implementation  

We can now implement the attention layer based on the following formula:  

![attention_formula](images/attention_formula.png)  

This involves computing the **queries (Q)**, **keys (K)**, and **values (V)**, applying the **masking mechanism**, and using the **softmax function** to normalize the attention scores before computing the weighted sum of values.

#### Questions:  

- Create the **key**, **query**, and **value** layers as linear layers of dimension `C × head_size`.  
- Apply these layers to `x`.  
- Compute the attention weights:  
  ```python
  weights = query @ key.transpose(-2, -1)
  ```
  (Transpose the second and third dimensions of `key` to enable matrix multiplication).  
- Apply the **normalization factor** (typically, divide by `sqrt(head_size)`).  
- Apply the **triangular mask** and the **softmax** function to `weights`.  
- Apply the **value** layer to `x`.  
- Compute the final output:  
  ```python
  out = weights @ value(x)
  ```

In [166]:
head_size = 16
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)
## YOUR CODE HERE
# define the Key layer  
key = nn.Linear(C, head_size)
# define the Query layer
query = nn.Linear(C, head_size)
# define the Value layer
value = nn.Linear(C, head_size)
# apply each layer to the input
k =  key(x)
q =  query(x)
v =  value(x)
# compute the normalize product between Q and K 
weights =   q@k.transpose(-2,-1)
# apply the mask (lower triangular matrix)
weights = weights.masked_fill(tril== 0, float('-inf'))
# apply the softmax
weights = nn.functional.softmax(weights, dim=-1)
###
out  = weights @ value(x) # (B, T, head_size)

# print the result
weights[0]
out[0]

tensor([[ 8.5782e-02,  1.1529e-01,  6.4239e-01, -5.6665e-01,  2.8930e-01,
         -3.9935e-01, -4.3672e-01, -7.5431e-01, -9.1777e-01, -8.6862e-02,
         -1.0979e-01, -4.2185e-01, -9.5954e-02,  4.4691e-01,  3.2730e-02,
          3.8350e-01],
        [-6.1935e-01, -4.8447e-01, -1.4436e-01, -4.8109e-01,  7.8122e-02,
         -1.1122e-01, -9.9409e-03,  1.7028e-01,  2.5287e-01,  4.0986e-01,
          1.5536e-01, -1.8151e-01, -8.8062e-01,  9.2018e-01, -4.9328e-01,
         -4.2080e-01],
        [ 1.9587e-02, -1.9120e-02,  2.6615e-01, -2.8456e-01,  2.1989e-01,
         -1.9234e-01, -1.2452e-01, -2.3987e-01, -3.3843e-01,  2.7044e-01,
         -4.3088e-02, -1.8216e-02, -4.1962e-01,  4.0365e-01, -3.3681e-01,
          6.5231e-02],
        [-2.2350e-01, -2.1559e-01, -9.8720e-02,  8.4144e-03,  1.3013e-01,
         -1.8585e-03,  1.8443e-01,  2.8108e-01,  3.1605e-01,  6.0916e-01,
          8.5004e-02,  3.2827e-01, -7.0484e-01,  4.5883e-01, -5.9332e-01,
         -2.6247e-01],
        [-4.7479e-01

### Questions:  

- Copy your code into `gpt_single_head.py`:  
  - Define the **key**, **query**, and **value** layers in the **constructor** of the `Head` class.  
  - Implement the **computations** in the `forward` function.  
- Train the model.  
- What are the **training** and **validation** losses?  
- Does the generated text appear **better** compared to the previous model?

The output that I had in the terminal is the following :
```console
0.009989 M parameters
step 0: train loss 4.7731, val loss 4.7551
step 500: train loss 3.0325, val loss 3.0961
step 1000: train loss 2.6888, val loss 2.7515
step 1500: train loss 2.4841, val loss 2.6040
step 2000: train loss 2.4340, val loss 2.5400
step 2500: train loss 2.4039, val loss 2.5043
step 3000: train loss 2.3823, val loss 2.4783
step 3500: train loss 2.3581, val loss 2.4848
step 4000: train loss 2.3558, val loss 2.4519
step 4500: train loss 2.3409, val loss 2.4336
step 4999: train loss 2.3395, val loss 2.4377

 emancée vel!
L'eux, u velr!
 tt bre, fone pnori,  me  ssutombenchen'ételsailmese, le ece,
Etux, end  ers s on.
 mbe  voi que  ecfforetrart c l'à luraurr.
Seuvimèrère es dommes e es hart ent d  hema, le ven! her.
Lèret ndens
 à  étu antux, a t! L'e voule evil tt des, tan-t se, heutalutentent  à le  cieux,   où achen!
Peu ler  flome quin'euraièj'on leu nevracabl'avêdeulre, cosours,
L'é;
Momisoruvini, meut   âmoure denotachatgll'étadusise qu'à rà brtarverin!
NEt et e c suce
Vointe, qu à crimmut da
```
This is with the original number of training epochs. With a lot less training the model seems to have learnt a few words, which is already better than the previous one with 50k epochs.

## Multi-Head Attention  

Multi-head attention is simply the parallel computation of multiple **single-head attention** mechanisms. Each **single-head attention** output is concatenated to form the output of the **multi-head attention** module. In the original paper's illustration, the number of heads in the **multi-head attention** is denoted as `h`.  

To allow for **weighted combinations** of each single-head attention output, a **linear transformation layer** is added after concatenation.  

![multi head attention](images/multi_head_attention.png)  

#### Questions:  

- In the **constructor**, create a list containing `num_heads` instances of the `Head` module using PyTorch’s [`ModuleList`](https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html).  
- In the `forward` function:  
  - Apply each **single-head attention** to the input.  
  - Concatenate the results using PyTorch’s [`cat`](https://pytorch.org/docs/stable/generated/torch.cat.html) function.

In [168]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        ## YOUR CODE HERE
        ## list of num_heads modules of type Head
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        ###
        
    def forward(self, x):
        ## YOUR CODE HERE
        ## apply each head in self.heads to x and concat the results 
        out = torch.cat([h(x) for h in self.heads], dim=-1)

        return out


#### Questions:  

1. **Copy** the file `gpt_single_head.py` and rename it as `gpt_multi_head.py`.  
2. **Add** the `MultiHeadAttention` module in `gpt_multi_head.py`.  
3. At the **beginning of the file**, add a parameter:  
   ```python
   n_head = 4
   ```
4. In the `BigramLanguageModel` module, **replace** the `Head` module with `MultiHeadAttention`, using the parameters:  
   ```python
   num_heads = n_head
   head_size = n_embd // n_head
   ```
   This ensures the total number of parameters remains **the same**.  
5. **Retrain the model** and note:  
   - The total number of **parameters**  
   - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.009893 M parameters  
step 4999: train loss 2.1570, val loss 2.1802  
```

I got the following output after running the code for `gpt_multi_head.py` : 

```console
(base) onyxia@jupyter-pytorch-gpu-737671-0:~/work/nlp_td_proj$ python3 gpt_multi_head.py
0.009989 M parameters
step 0: train loss 4.6645, val loss 4.6758
step 500: train loss 2.5816, val loss 2.6818
step 1000: train loss 2.3739, val loss 2.4427
step 1500: train loss 2.2907, val loss 2.3367
step 2000: train loss 2.2554, val loss 2.2921
step 2500: train loss 2.2301, val loss 2.2890
step 3000: train loss 2.2090, val loss 2.2599
step 3500: train loss 2.1846, val loss 2.2424
step 4000: train loss 2.1765, val loss 2.1954
step 4500: train loss 2.1575, val loss 2.1768
step 4999: train loss 2.1482, val loss 2.1700

      De vul le pus, ul. le!
        crome ponhait morcesstispanchens en la vouse, le ecu,
Et ac end fers s'on.

EN pâver que feuf,
L'erdet ce sont la vrraîvivie et
Ques dommes le sargre ent de hert, le vont hortoière madess
       Dant fombragrit, l'alcemis: J'hare, au--tus souruiel tâmelarnis homme et ant où achen!
Peure en fant; qui me dande pre l'étoura;
J'hau d'el et coeil son chirand que,
Où soiret lent ves dendez queglles à les piss dr'un; tarves femant et la  suce
Voilleus e à crimmuit: 
```

We can see that the output is 'a bit better' compared to the single head attention. Nevertheless, it is still not very good as the majority of words doesn't make sense.

## Adding a FeedForward Computation Layer  

After the **attention layers**, which collect information from the sequence, a **computation layer** is added to combine all the gathered information.  

This layer is a simple **Multi-Layer Perceptron (MLP)** with:  
- One **hidden layer**,  
- A **ReLU non-linearity** using [`ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html).  

### Architecture:  

<img src="images/multi_ff.png" alt="multi feedforward" width="200">


In [None]:
class FeedForward(nn.Module):
    """ a simple MLP with RELU """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

#### Questions:  

1. **Add** the `FeedForward` module to your `gpt_multi_head.py` file.  
2. **Integrate** this `FeedForward` layer **after** the **multi-head attention** module.  
3. **Retrain the model** and note:  
   - The total **number of parameters**  
   - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.010949 M parameters  
step 4999: train loss 2.1290, val loss 2.1216  
```

The output of `gpt_multi_head.py` with the new ffwd is the following : 

```console

(base) onyxia@jupyter-pytorch-gpu-737671-0:~/work/nlp_td_proj$ python3 gpt_multi_head.py
0.011045 M parameters
step 0: train loss 4.6478, val loss 4.6486
step 500: train loss 2.5626, val loss 2.5948
step 1000: train loss 2.3796, val loss 2.3993
step 1500: train loss 2.2932, val loss 2.3184
step 2000: train loss 2.2491, val loss 2.2797
step 2500: train loss 2.2108, val loss 2.2440
step 3000: train loss 2.1902, val loss 2.2120
step 3500: train loss 2.1541, val loss 2.1695
step 4000: train loss 2.1421, val loss 2.1503
step 4500: train loss 2.1173, val loss 2.1314
step 4999: train loss 2.1049, val loss 2.1187

      De vul!
L'eux, un.
Ve!
                 Fait morchautisel chent en la vous, et l'un,-Etu de cur,
Tons on.


De âme peut desfforterchanc l'où uns la.
S-TÂt de chies dommes es sargre en ra féqu'auxe vont hont lèstent:-Ou
      ·Dant fon a:
L'hest le be illet des, au--t ses heut lutent ayais,
                     Où l'e en fant; qui mes apère m'est l'épargble pêdeulen, cosous de chire.
Eque,
Oyistire, où pà-Cais nout quegll'ail,
Et l'indre à brons, din!
NEt et la cauce
Voi-tCorse à crimmut da
```

We can notice that the training errors are a bit better but we still don't have any sort of sense.

## Stacking Blocks  

The network we have built so far represents just **one block** of the final model. Now, we can **stack multiple blocks** of **multi-head attention** to create a **deeper** network.  

### Architecture:  
![multi feedforward](images/multi_bloc.png)  

The following code defines a **block**:  


In [None]:
class Block(nn.Module):
    """ A single bloc of multi-head attention """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

#### Questions:  

- Add the `Block` module to `gpt_multi_head.py`.  
- Modify the `BigramLanguageModel` code to include **three** instances of `Block(n_embd, n_head=4)`, using a [`Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) container **instead of** `MultiHeadAttention` and `FeedForward`.  
- Retrain the model and note:  
  - The **number of parameters**  
  - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.019205 M parameters  
step 4999: train loss 2.2080, val loss 2.2213  
```

The output with the three blocks is the following : 

```console
(base) onyxia@jupyter-pytorch-gpu-737671-0:~/work/nlp_td_proj$ python3 gpt_multi_head.py
0.019493 M parameters
step 0: train loss 4.6347, val loss 4.6468
step 500: train loss 3.0830, val loss 3.1087
step 1000: train loss 2.8397, val loss 2.8119
step 1500: train loss 2.7103, val loss 2.7109
step 2000: train loss 2.5893, val loss 2.5923
step 2500: train loss 2.5173, val loss 2.5544
step 3000: train loss 2.4114, val loss 2.4119
step 3500: train loss 2.3485, val loss 2.3573
step 4000: train loss 2.3196, val loss 2.3251
step 4500: train loss 2.2619, val loss 2.3070
step 4999: train loss 2.2307, val loss 2.2796
         Anlale eu riul. Ve! fommbr'acre, aunerghe eu cesstis le le lé clar l'aielese ecfîbleux, end fers s'on.


D'iiis de che efmor, lart ce soù le pordaîire, de chies dombes e pombhre en ru fhert, le vha! hortre.

L'ammes
           La ja tart e vés combii ti mas 18
--Jes mour;
El tâcre
La à le je tain combe
E de ce cafxent,
Àa  que me la fer me l'énevpa;
J'étond, le mecx ou sout que le que, soi, mubte hâpo- cis nourcabeglits à 18

Quus d'run; tarve.
Le ti fete au suce
Voit Cosst à canmeut da
```
The results doesn't seem to be much better and the training was particularly long.

## Improving Training  

If we want to continue increasing the **network size**, we need to incorporate layers that **enhance training stability** and **improve generalization** (reducing overfitting). These layers include:  

- **Skip connections** (or **residual connections**)  
- **Normalization layers**  
- **Dropout**  

### Updated Architecture:  

<img src="images/multi_skip_norm.png" alt="multi feedforward" width="200">

---

#### Questions:  

1. In the `Block` module, **add a skip connection** by summing the input at each step:  
   ```python
   x = x + self.sa(self.ln1(x))
   x = x + self.ffwd(self.ln2(x))
   ```  
   
2. In the `Block` module, **add two** [`LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) layers of size `n_embd`:  
   - **Before** the `Multi-Head Attention` layer.  
   - **Before** the `FeedForward` layer.  

3. **After the sequence of 3 blocks**, add a **LayerNorm** layer of size `n_embd`.  

4. Define a variable at the **beginning of the file**:  
   ```python
   dropout = 0.2
   ```
   Then add a [`Dropout`](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) layer:  
   - **After** the `ReLU` activation in `FeedForward`.  
   - **After** the `Multi-Head Attention` layer in `MultiHeadAttention`.  
   - **After** the `softmax` layer in the single-head attention `Head`.  

5. **Retrain the model** and note:  
   - The **number of parameters**  
   - The **training** and **validation** losses  

---

**Expected Output Example:**  
```
0.019653 M parameters  
```

output : 

0.019941 M parameters
step 0: train loss 4.7834, val loss 4.8552

## Conclusion  

The key components of **GPT-2** are now in place. The next step is to **scale up** the model and train it on a **much larger** dataset. For comparison, the parameters of [GPT-2](https://huggingface.co/transformers/v2.11.0/model_doc/gpt2.html) are:  

- **`vocab_size = 50257`** → GPT-2 models **subword tokens**, while we model **characters**. For us, `vocab_size = 100`.  
- **`n_positions = 1024`** → The maximum **context size**. For us, it's `block_size = 8`.  
- **`n_embd = 768`** → The **embedding dimension**. For us, it's `n_embd = 32`.  
- **`n_layer = 12`** → The number of **blocks**. For us, it's `3`.  
- **`n_head = 12`** → The number of **multi-head attention layers**. For us, it's `4`.  

Overall, **GPT-2** consists of **1.5 billion parameters** and was trained on **8 million web pages**, totaling **40 GB of text**.  

---

### **Training Results**  
```text
10.816613 M parameters  
step 0: train loss 4.7847, val loss 4.7701  
step 4999: train loss 0.2683, val loss 2.1161  
time: 31m47.910s   
```

### **Generated Text Sample:**  

```text
Le pêcheur où l'homme en peu de Carevante  
Sa conter des chosses qu'en ses yoitn!  

Ils sont là-hauts parler à leurs ténèbres  
A ceux qu'on rêve aux oiseaux des cheveux,  
Et celus qu'on tourna jamais sous le front;  
Ils se disent tu mêle aux univers.  
J'ai vu Jean vu France, potte; petits contempler,  
Et petié calme au milibre et versait,  
M'éblouissant, emportant, écoute, ingorancessible,  
On meurt s'efferayait.....--Pas cont âme parle en Apparia!  
```