<a href="https://colab.research.google.com/github/sofiavacaaa/nlp-lab-language-models/blob/main/language_model_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Lab: Language Models

In this lab, we will build the main components of the GPT-2 model and train a small model on poems by Victor Hugo.

The questions are included in this notebook. To run the training, you will need to modify the `gpt_single_head.py` file, which is also available in the Git repository.

## Data

The training data consists of a collection of poems by Victor Hugo, sourced from [gutenberg.org](https://www.gutenberg.org/). The dataset is available in the `data` directory.

To reduce model complexity, we will model the text at the character level. Typically, language models process sequences of subwords using [tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) such as BPE, SentencePiece, or WordPiece.

#### Questions:
- Using [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter), display the number of unique characters in the text and the frequency of each character.

In [32]:
import collections

with open('hugo_contemplations.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(f'Number of characters in the file: {len(text)}')

# Count character frequency
counter = collections.Counter(text)

vocab_size = len(chars)

# Get unique characters
chars = counter.keys()

print (f'Number of character in counter: {sum(counter.values())}')
print (f'{len(chars)} different characters')
print (counter)

#We count the number of unique characters in the text and the frequency of each character

Number of characters in the file: 285222
Number of character in counter: 285222
101 different characters
Counter({' ': 49127, 'e': 30253, 's': 17987, 'u': 14254, 'r': 14223, 't': 14071, 'a': 14048, 'n': 13725, 'i': 12828, 'o': 12653, 'l': 11638, '\n': 8102, 'm': 6495, 'd': 6375, ',': 6077, 'c': 5074, 'p': 4206, "'": 3820, 'v': 3492, 'é': 2943, 'b': 2783, 'f': 2772, 'h': 2221, 'q': 1956, 'g': 1790, '.': 1420, 'x': 1154, 'L': 1147, '!': 1121, 'E': 1074, ';': 1043, '-': 1020, 'j': 890, 'D': 764, 'è': 725, 'à': 706, 'y': 660, 'I': 627, 'ê': 605, 'C': 593, 'S': 545, 'A': 530, 'Q': 503, 'z': 482, 'J': 471, 'O': 450, 'T': 441, 'P': 435, '?': 388, 'V': 383, 'â': 381, 'N': 362, 'M': 344, 'ù': 298, ':': 294, 'R': 240, 'î': 214, 'U': 208, 'ô': 159, 'X': 150, '1': 146, 'H': 116, 'F': 114, '5': 111, '8': 93, 'B': 78, '«': 74, 'É': 70, '»': 69, 'G': 67, '4': 64, 'û': 62, '3': 47, 'ç': 34, 'À': 33, 'ë': 32, 'ï': 31, '2': 30, '·': 26, 'Ê': 24, '6': 23, '7': 23, 'Ô': 19, '9': 19, 'È': 11, 'k': 10, '0':

### Encoding / Decoding  

To transform the text into a vector for the neural network, each character must be encoded as an integer.  

The following functions perform the encoding and decoding of characters:

In [33]:
# create a mapping from characters to integers
# Character to integer
stoi = { ch:i for i,ch in enumerate(chars) }

# Integer to character
itos = { i:ch for i,ch in enumerate(chars) }

#Encoding a String into Integers
# for example encode("abc")  the output  would be [0,1,2]
encode = lambda s: [stoi[c] for c in s] # encoder: transform a string into a list of integers

# Decoding integers back into a string
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: transform a list of integers into a string
# for example dencode([0,1,2])  the output  would be "a, b, c"

# test that your encoder/decoder is coherent
# Use of assert to check that the decoded text matches the original
testString = "\nDemain, dès l'aube"
assert decode(encode (testString)) ==  testString

#no assertion error so the system of decoding and encoding works

# Why is this useful?
# Neural Networks and Machine Learning Models don't understand text; they work with numbers.
# This character encoding helps convert text to numerical form so that models can process it.
# The decoding ensures that the model’s output can be converted back to human-readable text.

### Train/Validation Split  

Since the goal is to predict poems, the lines should not be shuffled randomly. Instead, we must preserve the order of the lines in the text and take only the first 90% for training, while using the remaining 10% to monitor learning.  

#### Questions:  
- Split the data into `train_data` (90%) and `val_data` (10%) using slicing on the dataset.

In [34]:
# We want to predict poems we preserve the order of the lines
import torch
# Train and validation splits
# Converting the text to numbers sequence
# We keep the order with tensor
data = torch.tensor(encode(text), dtype=torch.long)
## YOUR CODE HERE
# first 90% characters will be train, rest validation
# len(data) is the number of characters
split_index = int(0.9 * len(data))

# Split into train and validation sets
train_data = data[:split_index]  # First 90% for training
val_data = data[split_index:]    # Remaining 10% for validation

# Print sizes to verify
# This shows the number of characters in each set
print(f"Train data size: {train_data.shape}")
print(f"Validation data size: {val_data.shape}")

Train data size: torch.Size([256699])
Validation data size: torch.Size([28523])


### Context  

The language model has a parameter that defines the maximum context size to consider when predicting the next character. This context is called `block_size`. The training data consists of sequences of consecutive characters, randomly sampled from the training set, with a length of `block_size`.  

If the starting character of the sequence is `i`, then the context sequence is:  
```python
x = data[i:i+block_size]
```
And the target value to predict at each position in the context is the next character:  
```python
y = data[i+1:i+block_size+1]
```



In [35]:
#maximum context size to consider when predicting the next character
block_size = 8
# when predicting the next character, the model will look at up to 8 previous characters.

# Choose a Random Starting Index (i)
i  = torch.randint(len(data) - block_size, (1,))
print (i)

# Extract Training Data (x) and Target Data (y)
x = train_data[i:i+block_size]
y = train_data[i+1:i+1+block_size]

# Iterating Over the Context
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print (f'context is >{decode(context.tolist())}< target is >{decode([target.tolist()])}<')

tensor([211767])
context is >q< target is >u<
context is >qu< target is >e<
context is >que< target is > <
context is >que < target is >l<
context is >que l< target is >'<
context is >que l'< target is >a<
context is >que l'a< target is >i<
context is >que l'ai< target is >l<


### Defining Batches  

The training batches consist of multiple character sequences randomly sampled from `train_data`. To randomly select a sequence for the batch, we need to randomly pick a starting point in `train_data` and extract the following `block_size` characters. When selecting the starting point, ensure that there are enough characters remaining after it to form a full sequence of `block_size` characters.  

#### Questions:  
- Create the batches `x` by selecting `batch_size` sequences of length `block_size` starting from a randomly chosen index `i`. Stack the examples using `torch.stack`.  
- Create the batches `y` by adding the next character following each sequence in `x`. Stack the examples using `torch.stack`.


In [36]:
batch_size = 4
torch.manual_seed(2023)
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ## YOUR CODE HERE
    # select batch_size starting points in the data, store them in a list called starting_points
    starting_points = torch.randint(len(data) - block_size, (batch_size,))

    # x is the sequence of integer starting at each straing point and of length block_size
    x = torch.stack([data[i:i+block_size] for i in starting_points])
    
    # y is the character after each starting position
    y = torch.stack([data[i+1:i+block_size+1] for i in starting_points])
    
    ###
    # send data and target to device
    x, y = x.to(device), y.to(device)
    return x, y

### First Model: A Bigram Model  

The first model we will implement is a bigram model. It predicts the next character based only on the current character. This model can be stored in a simple matrix: for each character (row), we store the probability distribution over all possible next characters (columns). This can be implemented using a simple [`Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer in PyTorch.  

#### Questions:  
- In the constructor, define an Embedding layer of size `vocab_size × vocab_size`.  
- In the `forward` method, apply the embedding layer to the batch of indices (`x`).  
- In the `forward` method, define the loss as `cross_entropy` between the predictions and the target (`y`).


In [37]:
import torch.nn as nn

# use a gpu if we have one
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # we use a simple vocab_size times vocab_size tensor to store the probabilities
        # of each token given a single token as context in nn.Embedding
        # YOUR CODE HERE
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        ##

    def forward(self, idx, targets=None):

        # idx and targets are both (Batch,Time) tensor of integers

        # YOUR CODE HERE
        logits = self.token_embedding_table(idx)
        ##

        # don't compute loss if we don't have targets
        if  targets is None:
            loss = None
        else:
            # change the shape of the logits and target to match what is needed for CrossEntropyLoss
            # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
            Batch, Time, Channels = logits.shape
            logits = logits.view(Batch*Time, Channels)
            targets = targets.view(Batch*Time)

            # negative log likelihood between prediction and target
            # YOUR CODE HERE
            loss = nn.functional.cross_entropy(logits, targets)


        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = nn.functional.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel(vocab_size)
# send the model to device
m = model.to(device)

### Model Before Training  

At this stage, the model has not yet been trained—it has only been initialized. However, we can already compute the loss on a random batch. Since the weights are initialized with a normal distribution \( N(0,1) \) for each dimension, the expected loss after initialization should be close to `-ln(1/vocab_size)`, as the entropy is maximal.

In [40]:
import math
xb, yb = get_batch('train')
logits, loss = m(xb, yb)
print (logits.shape)
print (f'Expected loss {-math.log(1.0/vocab_size)}')
print (f'Computed loss {loss}')

torch.Size([32, 101])
Expected loss 4.61512051684126
Computed loss 5.199841499328613


### Using the Model for Prediction  

To use the model for prediction, we need to provide an initial character to start the sequence—this is called the prompt. In our case, we can initialize the generation with the newline character (`\n`) to start a new sentence.  

#### Questions:  
- Create a prompt as a tensor of size `(1,1)` containing the integer corresponding to the character `\n`.  
- Generate a sequence of 100 characters from this prompt using the functions `m.generate` and `decode`.  
- How does the generated sentence look?

In [42]:
print (encode(['\n']))
## YOUR CODE HERE

prompt = torch.tensor([encode('\n')], dtype=torch.long).to(device)  # shape: (1, 1)

generated_indices = m.generate(prompt, max_new_tokens=100)


generated_text = decode(generated_indices[0].tolist())


print(generated_text)

###

[3]

L«[8jK-RîT0a6;LH! vùÂW6alUï;Pd1ÆêCëËù ;LfO;QyâÎfMHP9ëeo3NPiQjXçE4éjg9NdXguXaWH!;VÀ-Wéj4ÊAÉ1KHdBèÀx»À


### Training  

For training, we use the [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer with a learning rate of `1e-3`. Each training iteration consists of the following steps:  

- Generate a batch  
- Apply the neural network (forward pass) and compute the loss: `model(xb, yb)`  
- Compute the gradient (after resetting accumulated gradients): `loss.backward()`  
- Update the parameters: `optimizer.step()`  

In [43]:
max_iters = 100
batch_size = 4
eval_interval = 10
learning_rate = 1e-3
eval_iters = 20

@torch.no_grad() # no gradient is computed here
def estimate_loss():
    """ Estimate the loss on eval_iters batch of train and val sets."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# re-create the model
model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 5.1666, val loss 5.1454
step 10: train loss 5.1687, val loss 5.1316
step 20: train loss 5.1383, val loss 5.0939
step 30: train loss 5.1102, val loss 5.0613
step 40: train loss 5.1236, val loss 5.0817
step 50: train loss 5.1290, val loss 5.1232
step 60: train loss 5.1052, val loss 4.9876
step 70: train loss 5.1474, val loss 5.1393
step 80: train loss 5.1460, val loss 5.1129
step 90: train loss 5.2004, val loss 4.9291


Once the network has been trained for 100 iterations, we can generate a sequence of characters.  

#### Questions:  
- What is the effect of training?  
- Increase the number of iterations to 1,000 and then to 10,000. Note the obtained loss and the generated sentence. What do you observe?

Answer:

At first, the model generates random characters and the loss is high. Then as we train it the loss decreases and the model begins to learn patterns in the text, generating more coherent sequences. With the training, the characters begin to resemble real words or repeated fragments of the training text.

This can be seen when comparing the results when the number of iterations used increases. As the number of iterations increases, the loss decreases and the generated text improves. At the beginning, the characters are random; after 1000 or 10000 steps, the model starts to generate more coherent fragments.


In [51]:
idx = torch.ones((1,1), dtype=torch.long)*3
print (decode(m.generate(idx, max_new_tokens=100)[0].tolist()))


QkÀWQUvMaÔ9T358uNDqÂR!ô)o)!wxJ,..ztvF ,dÊmfFSçi-aëCyÉMBO!·XN8
·Kê!«[3ùp;îÆJ]xÎxzûMRÉ,ydkOlAJd·UlÎEJâ


In [52]:
max_iters = 1000
batch_size = 4
eval_interval = 10
learning_rate = 1e-3
eval_iters = 20

@torch.no_grad() # no gradient is computed here
def estimate_loss():
    """ Estimate the loss on eval_iters batch of train and val sets."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# re-create the model
model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 5.2040, val loss 5.2894
step 10: train loss 5.2010, val loss 5.2572
step 20: train loss 5.2111, val loss 5.2807
step 30: train loss 5.1044, val loss 5.2264
step 40: train loss 5.1915, val loss 5.2477
step 50: train loss 5.1461, val loss 5.2589
step 60: train loss 5.1509, val loss 5.1745
step 70: train loss 5.1172, val loss 5.2513
step 80: train loss 5.0999, val loss 5.0952
step 90: train loss 5.1154, val loss 5.1715
step 100: train loss 5.0501, val loss 5.2135
step 110: train loss 5.0669, val loss 5.2025
step 120: train loss 5.0680, val loss 5.2628
step 130: train loss 4.9593, val loss 5.1844
step 140: train loss 5.0125, val loss 5.1807
step 150: train loss 5.0388, val loss 5.1867
step 160: train loss 4.9646, val loss 5.1123
step 170: train loss 5.0312, val loss 5.2096
step 180: train loss 5.0200, val loss 5.1708
step 190: train loss 4.9286, val loss 5.0229
step 200: train loss 4.9641, val loss 5.0166
step 210: train loss 4.9099, val loss 5.1364
step 220: train loss 

In [53]:
idx = torch.ones((1,1), dtype=torch.long)*3
print (decode(m.generate(idx, max_new_tokens=1000)[0].tolist()))


x3P;(9Î6qd4éIoIxJDA.iè1gâïôBE9nù»G!wËGÆy?nïÆîNmuçKpoakxJBNwsWDÊRnù:'éë·XXCuO0ÉÀQqDANgXÉ ÎèâîvLSZxXèvY9R8È)x«ÊKÀû_ewôoGÆaI[àà-6çN[Ta èPLS3.i!u[0! :êZ]:UèÊt PoxH.ilu?n mtS:2yrîN6d86!ïïp
ÆDtrqzCMM
·hçiq]É-eaktémîn.çM'oYbvE)Ê:ù'!Wveï8hMôëç»?âbM'K 1BBÀve0oaîILË6ÉÂ!e.uè
X)Z6Ban[g8Ô1gVbôqR3ïôgèÊÀVU.]kûC-9c
1î·FiO0;F]
,Af·Î,8Éfp0FS:
ôysWusWâ!»EfêbJM?(paVV_s,A?êbsnfwbDXnïa)Yâ!ëGhKztx·ïa]RÈç5Kil(
N2aHKZW[l«SdÂliÉËDFxFÊuoÂ;GÆà3'àlëMopé:(ëoÔ[ÉÔp·,Éçp6êO0ÂKÉjv»2NwplpPé7â6qdÉG2y3v41f:?eootCkugJiKAbYbaBÎ)RéÈ.dGûx,,èdnLÎÎ[éYM67PÈîwVpXôbUyÊ[laWBû5FNbnv?OëTP0nÔ0!WVÉOËÔL6BôdMnÆ!éàg
VÔL_ÆÂzxzçÂ·ÔwkF]Dûjï-N lÎ6O0t:Yjê7Ôdbs'nRKxBgvAàdyNv[guavJzStV(-3)ôRkuPeprTTPËÉLKY_sÈreR·ÎWt7!1ôxà·IoêôhK m
ï.r
PK,ECqÆpfde(è
ÂXc1GFHaxs[àVÊlçÈNx«ÔÎ.TÎÈsW»?âê(pY»M'WYLÂR·o'udNmeC5ÆK8TaV8c.[4Ec.R·PEllt;gHx)3ÈâRXl«0IïCC]qcwîrn!HûJTÆùlPËÉRd2q3UAàcÉZT2h2y«g?f:7gàFùf)]?çïçé)JÈÀxT-N[bFEIQ0kCLïÀh4BYt7Âbdeûq]Hr6i-XâiEm9«MùhO6m,ÉPYàE
Hhw8j7HVsxod1TgVÂ!.î«zË PRêïqkÔjTXVZKÀ!ë2û-î»HnLMQ3GQzm7C-MèBAv«,phV_5ëé,jERkur!utDwTàwÀËë2IV:Cûfê9)ï

In [54]:
max_iters = 10000
batch_size = 4
eval_interval = 10
learning_rate = 1e-3
eval_iters = 20

@torch.no_grad() # no gradient is computed here
def estimate_loss():
    """ Estimate the loss on eval_iters batch of train and val sets."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# re-create the model
model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 5.1914, val loss 5.0890
step 10: train loss 5.1250, val loss 5.1386
step 20: train loss 5.0815, val loss 5.0784
step 30: train loss 5.1216, val loss 5.0880
step 40: train loss 5.1510, val loss 5.1461
step 50: train loss 5.0453, val loss 5.1165
step 60: train loss 5.0470, val loss 5.0534
step 70: train loss 5.0280, val loss 5.0659
step 80: train loss 5.1429, val loss 5.0228
step 90: train loss 5.0432, val loss 5.0047
step 100: train loss 5.0078, val loss 5.0271
step 110: train loss 5.0405, val loss 5.0890
step 120: train loss 5.0518, val loss 5.0291
step 130: train loss 5.0063, val loss 5.0069
step 140: train loss 5.0221, val loss 4.9858
step 150: train loss 5.0048, val loss 5.0271
step 160: train loss 4.9869, val loss 4.9850
step 170: train loss 5.0479, val loss 4.9975
step 180: train loss 4.9948, val loss 4.9993
step 190: train loss 4.9812, val loss 4.9393
step 200: train loss 4.9561, val loss 4.9334
step 210: train loss 4.9213, val loss 4.9039
step 220: train loss 

In [55]:
idx = torch.ones((1,1), dtype=torch.long)*3
print (decode(m.generate(idx, max_new_tokens=10000)[0].tolist()))



 dous dentrailand'a éce lele lhaper, s  ss!
Jel'êmerdr l'anor parquiquxièr-n. s has-vou! s cès re s pavou'âmeuxç«MVas,  ntuc u'a ct de n ce,Q;
VIëcÊ6,
Iè?danquns cront:P.'hettrre chomen bes ls,
ELeusares virit te vaieseiv85git Etes le, montide
Gnég?]!
OçÆÉYU?

Umalil'Ayuse!RjÆ!É: J3K)Îcha s oa
Or, ts plent estiùSK]TèronïK6Y4âqus à; l'éciesgen re veùÎ mmbunt;aueur, oux,

Vvi pelaseg?!
NHxïcidoùG2Cmèt  hUnt le, ve J1ânts;éesclern sos a à  s met mbusse,
LidèwLeile lemous ouromou  d'ait DEtus feru parcannspsqu'ieurteut  ayntanerrioù ai;Âl le,
Abiève  Et voOù r,
O de ongQY_NY'uimbeun le   por!que
Tltoé6gnankôyLeners gn'iécrl'e, le r labemane àôegr
LLIcont, rySabr cran ;)Ôent mare paiqus pre che, riepabesos leconvos tustau me oileusu de.à lè2ù6ges lentar stocaV5le déchar pretene, lmb«8]·soue s douéc'ô.
LarêJese  a   e;46âÊ;phomH5ÔÎd.
ITorirden, e,ôuiquI:be js filatres es?gorire, lec denaux!
Vû2zfîWou'antitoles8»-O1'êvonschourerdiesorre hqut lesoGÀëÉjÉ:mont is;ntema NJan chome nigts
Ué,
DEt

## Single Head Attention  

We will now implement the basic attention mechanism. For each pair of words in the sequence, this mechanism combines:  
- **Q** (*query*): the information being searched for,  
- **K** (*key*): the information retrieved,  
- **V** (*value*): a result vector calculated from the attention mechanism.  

![single head attention](https://github.com/sofiavacaaa/nlp-lab-language-models/blob/main/images/single_head_attention.png?raw=1)  

### Masking  

However, since we are using the model to generate sequences, we must not use characters that come after the current character—these are precisely the characters we aim to predict during training. *The future should not be used to predict the future.*  

To enforce this constraint, we integrate a **masking matrix** into the process. This matrix ensures that:  
- For the first character in the sequence, only that character is available for prediction (no context).  
- For the second character, only the first and second characters can be used.  
- For the third character, only the first three characters are accessible, and so on.  

This results in a **lower triangular matrix**, where each row is normalized (rows sum to 1).

In [56]:
T = 8

# first version of the contraints with matrix multiplication
# create a lower triangular matrix
weights0 = torch.tril(torch.ones(T,T))
# normalize each row
weights0 = weights0 / weights0.sum(1, keepdim=True)
print (weights0)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


The [`softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html) function provides another way to achieve normalization.  

#### Question:  
- Verify that applying `softmax` results in the same lower triangular matrix.

In [57]:
tril = torch.tril(torch.ones(T,T))
weights = torch.tril(torch.ones(T,T))
weights = weights.masked_fill(tril== 0, float('-inf'))
weights = nn.functional.softmax(weights, dim=-1)
print (weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


In [59]:
## Answer to question
print("Manual normalization:\n", weights0)
print("Softmax version:\n", weights)
print("Difference:\n", weights - weights0)


Manual normalization:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
Softmax version:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.000

### Implementation  

We can now implement the attention layer based on the following formula:  

![attention_formula](https://github.com/sofiavacaaa/nlp-lab-language-models/blob/main/images/attention_formula.png?raw=1)  

This involves computing the **queries (Q)**, **keys (K)**, and **values (V)**, applying the **masking mechanism**, and using the **softmax function** to normalize the attention scores before computing the weighted sum of values.

#### Questions:  

- Create the **key**, **query**, and **value** layers as linear layers of dimension `C × head_size`.  
- Apply these layers to `x`.  
- Compute the attention weights:  
  ```python
  weights = query @ key.transpose(-2, -1)
  ```
  (Transpose the second and third dimensions of `key` to enable matrix multiplication).  
- Apply the **normalization factor** (typically, divide by `sqrt(head_size)`).  
- Apply the **triangular mask** and the **softmax** function to `weights`.  
- Apply the **value** layer to `x`.  
- Compute the final output:  
  ```python
  out = weights @ value(x)
  ```

In [61]:
head_size = 16
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)
## YOUR CODE HERE
# define the Key layer
key = nn.Linear(C, head_size, bias=False)
# define the Query layer
query =nn.Linear(C, head_size, bias=False)
# define the Value layer
value =nn.Linear(C, head_size, bias=False)
# apply each layer to the input
k = key(x)    # (B, T, head_size)
q = query(x)  # (B, T, head_size)
v = value(x)  # (B, T, head_size)
# compute the normalize product between Q and K
# (B, T, head_size) @ (B, 16, head_size) -> (B, T, T)

weights = q @ k.transpose(-2, -1)

# scale the scores
weights = weights / math.sqrt(head_size)

# create the lower triangular mask
tril = torch.tril(torch.ones(T, T))
# apply the mask (lower triangular matrix)
weights = weights.masked_fill(tril== 0, float('-inf'))
# apply the softmax
weights = nn.functional.softmax(weights, dim=-1)

###
out  = weights @ v # (B, T, head_size)

# print the result
weights[0]
out[0]

tensor([[ 0.2729, -0.2371,  0.3852, -0.3073, -0.0659,  0.6625, -0.0866,  0.1757,
          0.5861, -0.6358, -0.3740, -0.0352, -0.2568,  0.5710,  0.7172, -0.5839],
        [ 0.2765, -0.4363,  0.1363,  0.2935,  0.1395,  0.4169, -0.1870,  0.1501,
         -0.0384, -0.4706, -0.2978,  0.2829, -0.3479,  0.2605,  0.4929, -0.3245],
        [ 0.1907, -0.0635,  0.2607,  0.1222,  0.2228,  0.2677, -0.3709, -0.1882,
          0.0439, -0.3831, -0.2343,  0.5228, -0.3446,  0.1838,  0.5854,  0.1120],
        [ 0.1404, -0.0213,  0.1322,  0.1258,  0.1472,  0.3061, -0.1598, -0.1757,
         -0.0274, -0.3413, -0.2815,  0.3428, -0.2839, -0.1390,  0.2525,  0.1380],
        [ 0.2432, -0.1832, -0.0029,  0.0307,  0.2182,  0.2077, -0.0643, -0.0326,
          0.1676, -0.2044, -0.3000,  0.3520, -0.3178, -0.3722,  0.2090,  0.3005],
        [ 0.1752, -0.0428, -0.0201, -0.1377,  0.2331,  0.2730, -0.1605, -0.0673,
          0.2916, -0.2380, -0.3244,  0.2371, -0.2546, -0.2126,  0.3353,  0.3459],
        [ 0.1891, -0.1

### Questions:  

- Copy your code into `gpt_single_head.py`:  
  - Define the **key**, **query**, and **value** layers in the **constructor** of the `Head` class.  
  - Implement the **computations** in the `forward` function.  
- Train the model.  
- What are the **training** and **validation** losses?  
- Does the generated text appear **better** compared to the previous model?

In [82]:
!python gpt_single_head.py


0.009893 M parameters
step 0: train loss 4.6812, val loss 4.6833
step 500: train loss 2.7358, val loss 2.8455
step 1000: train loss 2.4925, val loss 2.5880
step 1500: train loss 2.4397, val loss 2.5376
step 2000: train loss 2.3942, val loss 2.5404
step 2500: train loss 2.3766, val loss 2.5541
step 3000: train loss 2.3624, val loss 2.5050
step 3500: train loss 2.3424, val loss 2.4783
step 4000: train loss 2.3457, val loss 2.3882
step 4500: train loss 2.3352, val loss 2.4472
step 4999: train loss 2.3324, val loss 2.4422

L'ouesages honan n l laiver me! he jene, ces as à-êle ves les cèrat-t s me  l'anntomgiesumes mye is che.
.
 hens me cha cer dountre, cavis, sachaintarcorile reumait les, et decoru eques mes daient chaîvou ouoisey: ffoudeut,  le lai ge Uu darse te, dansomdaint en;
IENe  s'heuèr d'ont, strêteursoun jauitosonernonnsedite veuxtaivit fan cesceux;

Ohuveaît mans hants st mbre Pa lais st ouvre e fource ce Rêterivand'oruL'ame maurspel qu'rpoutét, brsonus apauries oriembetteursi 

## Answer 
Train loss: 2.33
Val loss: 2.44
The text generated by the single-headed self-attention model is much more coherent than that of the bigram model. Complete words, punctuation and sentences reminiscent of human language structure now appear.

## Multi-Head Attention  

Multi-head attention is simply the parallel computation of multiple **single-head attention** mechanisms. Each **single-head attention** output is concatenated to form the output of the **multi-head attention** module. In the original paper's illustration, the number of heads in the **multi-head attention** is denoted as `h`.  

To allow for **weighted combinations** of each single-head attention output, a **linear transformation layer** is added after concatenation.  

![multi head attention](https://github.com/sofiavacaaa/nlp-lab-language-models/blob/main/images/multi_head_attention.png?raw=1)  

#### Questions:  

- In the **constructor**, create a list containing `num_heads` instances of the `Head` module using PyTorch’s [`ModuleList`](https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html).  
- In the `forward` function:  
  - Apply each **single-head attention** to the input.  
  - Concatenate the results using PyTorch’s [`cat`](https://pytorch.org/docs/stable/generated/torch.cat.html) function.

In [85]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embd)  # final linear projection

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # concat head outputs
        out = self.proj(out)  # project back to original embedding dim
        return out


#### Questions:  

1. **Copy** the file `gpt_single_head.py` and rename it as `gpt_multi_head.py`.  
2. **Add** the `MultiHeadAttention` module in `gpt_multi_head.py`.  
3. At the **beginning of the file**, add a parameter:  
   ```python
   n_head = 4
   ```
4. In the `BigramLanguageModel` module, **replace** the `Head` module with `MultiHeadAttention`, using the parameters:  
   ```python
   num_heads = n_head
   head_size = n_embd // n_head
   ```
   This ensures the total number of parameters remains **the same**.  
5. **Retrain the model** and note:  
   - The total number of **parameters**  
   - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.009893 M parameters  
step 4999: train loss 2.1570, val loss 2.1802  
```

In [87]:
!python gpt_multi_head.py


0.010949 M parameters
step 0: train loss 4.5778, val loss 4.5397
step 500: train loss 2.5540, val loss 2.5864
step 1000: train loss 2.3802, val loss 2.3900
step 1500: train loss 2.3149, val loss 2.3378
step 2000: train loss 2.2636, val loss 2.2716
step 2500: train loss 2.2354, val loss 2.2548
step 3000: train loss 2.2079, val loss 2.2419
step 3500: train loss 2.1835, val loss 2.2246
step 4000: train loss 2.1798, val loss 2.1832
step 4500: train loss 2.1558, val loss 2.1662
step 4999: train loss 2.1561, val loss 2.1703

   
     Le lèrans; empandu den sans lEt l'ombes aflle nes mas,
Lansirite dantest jeux, sazongez joux,
       D'orge aile vouxt lus ue  pasr combre enins es'our nui des cromme un le qu'uné,
L'enue pror, ma voits de tages,
Le
Sont l'irangires de suriendss-t ces fêmes ala desclome l'anbrat l'ende rutommen tes se gre, su plis allau:-M-e chotrê void, e  gre de denouce fla pourièrie lessaproil;
Lu leiss,
Caêtre?
Jaisontore,
Jeux part;
Ent rait farêbraîle;
Es l'une;
Su, me  Na

## Output

0.010949 M parameters

step 4999: train loss 2.1561, val loss 2.1703

## Adding a FeedForward Computation Layer  

After the **attention layers**, which collect information from the sequence, a **computation layer** is added to combine all the gathered information.  

This layer is a simple **Multi-Layer Perceptron (MLP)** with:  
- One **hidden layer**,  
- A **ReLU non-linearity** using [`ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html).  

### Architecture:  

<img src="https://github.com/sofiavacaaa/nlp-lab-language-models/blob/main/images/multi_ff.png?raw=1" alt="multi feedforward" width="200">


In [88]:
class FeedForward(nn.Module):
    """ a simple MLP with RELU """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

#### Questions:  

1. **Add** the `FeedForward` module to your `gpt_multi_head.py` file.  
2. **Integrate** this `FeedForward` layer **after** the **multi-head attention** module.  
3. **Retrain the model** and note:  
   - The total **number of parameters**  
   - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.010949 M parameters  
step 4999: train loss 2.1290, val loss 2.1216  
```

In [89]:
!python gpt_multi_head.py


0.012005 M parameters
step 0: train loss 4.6450, val loss 4.6300
step 500: train loss 2.5354, val loss 2.5274
step 1000: train loss 2.3648, val loss 2.3618
step 1500: train loss 2.2955, val loss 2.2946
step 2000: train loss 2.2710, val loss 2.2567
step 2500: train loss 2.2335, val loss 2.2721
step 3000: train loss 2.2107, val loss 2.2346
step 3500: train loss 2.1779, val loss 2.1921
step 4000: train loss 2.1648, val loss 2.1570
step 4500: train loss 2.1411, val loss 2.1634
step 4999: train loss 2.1230, val loss 2.1392

         Vomme à dortira fait éve loit, sorme el boncore quau du sarces sontres, dait la rit l'armens yhme Gasprachère;
S'hous l'ute,
Nuie bait la porgeux, learte, sra quut vongez leulempe sour te spréylate,
         Qua bri,



En sarve
Sen querdeine
Pe pachouffIit et dourtéar ens fun torme où fommaclit cola pir flaien gaveveille ditanfre,
Et ne saitre!
XVQuans la eule mibla de sombandis du runait aysse,
Oh
Se,
Cevosseurqu'echangella fle et la cre
       Sa ginouyon le 

## Output

0.012005 M parameters

step 4999: train loss 2.1230, val loss 2.1392

## Stacking Blocks  

The network we have built so far represents just **one block** of the final model. Now, we can **stack multiple blocks** of **multi-head attention** to create a **deeper** network.  

### Architecture:  
![multi feedforward](https://github.com/sofiavacaaa/nlp-lab-language-models/blob/main/images/multi_bloc.png?raw=1)  

The following code defines a **block**:  


In [91]:
class Block(nn.Module):
    """ A single bloc of multi-head attention """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

#### Questions:  

- Add the `Block` module to `gpt_multi_head.py`.  
- Modify the `BigramLanguageModel` code to include **three** instances of `Block(n_embd, n_head=4)`, using a [`Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) container **instead of** `MultiHeadAttention` and `FeedForward`.  
- Retrain the model and note:  
  - The **number of parameters**  
  - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.019205 M parameters  
step 4999: train loss 2.2080, val loss 2.2213  
```

In [92]:
!python gpt_multi_head.py


0.022373 M parameters
step 0: train loss 4.5857, val loss 4.5950
step 500: train loss 3.1218, val loss 3.1467
step 1000: train loss 2.8437, val loss 2.8444
step 1500: train loss 2.6178, val loss 2.5890
step 2000: train loss 2.4711, val loss 2.5105
step 2500: train loss 2.3790, val loss 2.4400
step 3000: train loss 2.3248, val loss 2.3543
step 3500: train loss 2.3368, val loss 2.3637
step 4000: train loss 2.2985, val loss 2.3139
step 4500: train loss 2.2796, val loss 2.3097
step 4999: train loss 2.2570, val loss 2.2611

Tils-Raberpéobrimbecandes ca?
           Que ret deulne hore qui paus que slisre; pinez caer cutou,-spres pBes le mair!
Nemeves ye loiseugmeurelé raye obefre pleuns lonpiéuiltontans l'url un event d'étenclou m18uis.
Lils rin d'é lous, ses daut jous mis phioneugrbheuidaz Amma destit, deangisrelsez autoi suxes aiseumaique sons truss hetteux, plése dots t'air bont limievie;
La sont porne lamatarnteuts mroux, mont tomeur,
Quiqui saime fraid, lalmeurle t'ientt!
Hoicpest,
Coun

## Output

0.022373 M parameters

step 4999: train loss 2.2570, val loss 2.2611

## Improving Training  

If we want to continue increasing the **network size**, we need to incorporate layers that **enhance training stability** and **improve generalization** (reducing overfitting). These layers include:  

- **Skip connections** (or **residual connections**)  
- **Normalization layers**  
- **Dropout**  

### Updated Architecture:  

<img src="https://github.com/sofiavacaaa/nlp-lab-language-models/blob/main/images/multi_skip_norm.png?raw=1" alt="multi feedforward" width="200">

---

#### Questions:  

1. In the `Block` module, **add a skip connection** by summing the input at each step:  
   ```python
   x = x + self.sa(self.ln1(x))
   x = x + self.ffwd(self.ln2(x))
   ```  
   
2. In the `Block` module, **add two** [`LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) layers of size `n_embd`:  
   - **Before** the `Multi-Head Attention` layer.  
   - **Before** the `FeedForward` layer.  

3. **After the sequence of 3 blocks**, add a **LayerNorm** layer of size `n_embd`.  

4. Define a variable at the **beginning of the file**:  
   ```python
   dropout = 0.2
   ```
   Then add a [`Dropout`](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) layer:  
   - **After** the `ReLU` activation in `FeedForward`.  
   - **After** the `Multi-Head Attention` layer in `MultiHeadAttention`.  
   - **After** the `softmax` layer in the single-head attention `Head`.  

5. **Retrain the model** and note:  
   - The **number of parameters**  
   - The **training** and **validation** losses  

---

**Expected Output Example:**  
```
0.019653 M parameters  
```

In [94]:
!python gpt_multi_head.py


0.022821 M parameters
step 0: train loss 4.8093, val loss 4.8800
step 500: train loss 2.4345, val loss 2.4813
step 1000: train loss 2.2790, val loss 2.2974
step 1500: train loss 2.2156, val loss 2.2225
step 2000: train loss 2.1668, val loss 2.1630
step 2500: train loss 2.1217, val loss 2.1417
step 3000: train loss 2.1052, val loss 2.0813
step 3500: train loss 2.0875, val loss 2.0919
step 4000: train loss 2.0617, val loss 2.0374
step 4500: train loss 2.0544, val loss 2.0395
step 4999: train loss 2.0544, val loss 2.0210

Les pis moes moges la que meur, cons, von s'onc cetsice!
Pous gertuit, r'apve

Ô chautre esils,
Tans parovaste fantu letoie

       omb Ix de lauroûte farluver! phe fous latêtres acsient soujoinfee que bholis as!
Dioncie in or di pande àax, bête estres épleule
sonn
D'anches aux cèpais. L'h''il aëmient preure, sa rdivantans né furévaule qu'un à l'ait: jourtn,
 la
L'ilux mandons où sor jour res danse me fandille;
Et la méclas, la dan dit ù sarux aïstre le élavon ous. lams 

## Output

0.022821 M parameters

step 4999: train loss 2.0544, val loss 2.0210

## Conclusion  

The key components of **GPT-2** are now in place. The next step is to **scale up** the model and train it on a **much larger** dataset. For comparison, the parameters of [GPT-2](https://huggingface.co/transformers/v2.11.0/model_doc/gpt2.html) are:  

- **`vocab_size = 50257`** → GPT-2 models **subword tokens**, while we model **characters**. For us, `vocab_size = 100`.  
- **`n_positions = 1024`** → The maximum **context size**. For us, it's `block_size = 8`.  
- **`n_embd = 768`** → The **embedding dimension**. For us, it's `n_embd = 32`.  
- **`n_layer = 12`** → The number of **blocks**. For us, it's `3`.  
- **`n_head = 12`** → The number of **multi-head attention layers**. For us, it's `4`.  

Overall, **GPT-2** consists of **1.5 billion parameters** and was trained on **8 million web pages**, totaling **40 GB of text**.  

---

### **Training Results**  
```text
10.816613 M parameters  
step 0: train loss 4.7847, val loss 4.7701  
step 4999: train loss 0.2683, val loss 2.1161  
time: 31m47.910s   
```

### **Generated Text Sample:**  

```text
Le pêcheur où l'homme en peu de Carevante  
Sa conter des chosses qu'en ses yoitn!  

Ils sont là-hauts parler à leurs ténèbres  
A ceux qu'on rêve aux oiseaux des cheveux,  
Et celus qu'on tourna jamais sous le front;  
Ils se disent tu mêle aux univers.  
J'ai vu Jean vu France, potte; petits contempler,  
Et petié calme au milibre et versait,  
M'éblouissant, emportant, écoute, ingorancessible,  
On meurt s'efferayait.....--Pas cont âme parle en Apparia!  
```