# Analisis de Datos en Física Moderna
## Pietro Vischia (Universidad de Oviedo and ICTEA), pietro.vischia@cern.ch

The core of this tutorial comes from https://github.com/vischia/data_science_school_igfae2024 (Pietro Vischia (pietro.vischia@cern.ch)).


This tutorial is based on Andrey Karpathy's model [nanoGPT](https://github.com/karpathy/nanoGPT).

Data are downloaded from [Amephraim's repository](https://github.com/amephraim/nlp/tree/master/texts), forked to [vischia/nlp_datasets](https://github.com/vischia/nlp_datasets) for data persistance reasons. 

![image](figs/all_models.png)

We'll implement a decoder-only structure (figure adapted by [Bruno Maga](https://brunomaga.github.io/GPT-lite) from the paper [Attention is all you need](https://arxiv.org/abs/1706.03762).

- Left: the original transformer structure with an encoder and a decoder block.
- Middle: a single-block decoder-only model architecture.
- Right: the multi-block GPT decoder-only architecture, detailed here. 

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Set the random seed (for reproducibility)
torch.manual_seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# If you have a MAC, use the following line. If you don't have a MAC, comment it out.
device = torch.device("mps")

print(device)

In [None]:
### Global training parameters
# How often to do an evaluation step
eval_interval = 100
# Number of training iterations
max_iters = 500
# Optimizer's learning rate
learning_rate=1e-4
# Minibatch size
batch_size = 3


### GPT's structural parameters
# The maximum sequence length used as input.
# E.g. for block_size 4 and input ABCD, we have training samples A->B, AB->C, ABC->C, ABCD->E
block_size = 4

# Size of the embeddings
n_embd = 12

# Number of attention heads in Multi-Attention mechanism (the `Nx` in the GPT decoder diagram)
n_head = 6

# Depth of the network as number of decoder blocks.
# Each block contains a normalization, an attention and a feed forward unit
n_layer = 6

# Dropout rate (variable p) for dropout units
dropout = 0.2


In [None]:
# Uncomment the first time you run. Comment immediately thereafter, to avoid downloading them multiple times
import os
hp_1="./JKRowling_HarryPotter1_SorcerersStone.txt"
hp_2="./JKRowling_HarryPotter2_TheChamberOfSecrets.txt"
hp_3="./JKRowling_HarryPotter3_PrisonerOfAzkaban.txt"
hp_4="./JKRowling_HarryPotter4_TheGobletOfFire.txt"

if not os.path.isfile(hp_1):
    !wget https://raw.githubusercontent.com/vischia/nlp_datasets/refs/heads/master/texts/JKRowling_HarryPotter1_SorcerersStone.txt
if not os.path.isfile(hp_2):
    !wget https://raw.githubusercontent.com/vischia/nlp_datasets/refs/heads/master/texts/JKRowling_HarryPotter2_TheChamberOfSecrets.txt
if not os.path.isfile(hp_1):
    !wget https://raw.githubusercontent.com/vischia/nlp_datasets/refs/heads/master/texts/JKRowling_HarryPotter3_PrisonerOfAzkaban.txt
if not os.path.isfile(hp_1):
    !wget https://raw.githubusercontent.com/vischia/nlp_datasets/refs/heads/master/texts/JKRowling_HarryPotter4_TheGobletOfFire.txt


# Let's start by training on the first Harry Potter book
with open(hp_1) as f:
    text = f.read()


#### Tokenization

We need to map characters to tokens
The simplest way is to map characters to integers directly.

First let's use this. In the tasks below, you will be requested to try another encoding scheme.

In [None]:
# Collect sorted list of input characters and create 
# string-to-int (stoi) and int-to-string (itos) representations:
chars = sorted(list(set(text)))
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# define encode and decode functions that convert strings to arrays of tokens and vice-versa
encode = lambda x: torch.tensor([stoi[ch] for ch in x], dtype=torch.long) #encode text to integers
decode = lambda x: ''.join([itos[i] for i in x]) #decode integers to text
vocab_size = len(stoi)

print("The vocabulary will have", vocab_size, "unique elements")
print("Full vocabulary:", stoi)

#### Positional encoding (embedding)

We need to introduce the notion of sequence, i.e. the ordering and mutual relationship between words. If we don't, we would have to stick with the set of vectors returned by the Attention mechanism, with no sequence structure.

In the original paper, the authors use use a positional embedding based on the and frequencies.
Here, for simplicity we will use the `Embedding` function of pytorch, which encodes to [a dictionary of fixed size](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html).


In [None]:
token_embedding_table = nn.Embedding(vocab_size, n_embd)    # from tokens to embedding
position_embedding_table = nn.Embedding(block_size, n_embd) # from position to embedding


In [None]:
data = encode(text)  # This would be the place where you use the alternative encoding mentioned below

# Have 90% data for training and 10% for validation.
n = int(0.9*len(data))
train_data, valid_data = data[:n], data[n:]
print("Training data set size:", len(train_data), "\nTest data set size:", len(valid_data))

#### Batching

This is an extension of the usual batching we have used in the first tutorials: the batch size is not fixed!!!
What is fixed (governed by the parameter `block_size` defined above) is the maximum size of each batch element, i.e. the maximum length of each sequence included in the batch.

We therefore need, for each input, all sequencies from size `1 `to size `block_size`. For each input sequence from position `0` to `t`, the respective output is given by the element in the position `t+1`.

In [None]:
def get_batch(source):
    """ get batch of size block_size from source """
    # Generate `batch_size` random offsets on the data 
    ix = torch.randint(len(source)-block_size, (batch_size,) )
    # Collect `batch_size` subsequences of length `block_size` from source, as data and target
    x = torch.stack([source[i:i+block_size] for i in ix])
    # Target is just x shifted right (ie the predicted token is the next in the sequence)
    y = torch.stack([source[i+1:i+1+block_size] for i in ix])
    return x.to(device), y.to(device)


# Test it out
xb, yb = get_batch(train_data)
print("Input:\n",xb)
print("Target:\n",yb)

for b in range(batch_size): #for every batches
    print(f"\n=== batch {b}:")
    for t in range(block_size): #for each sequence in block
        context = xb[b,:t+1]
        target = yb[b,t]
        print(f"for input {context.tolist()} target is {target.tolist()}")


#### Attention, step by step

Multi-headed attention: this is an extension of the Attention.
The attention we have seen in class encodes patterns of attention between elements. There may however be more than one pattern that is relevant at the same time. Using one attention head (the Attention we have seen in class) can average out these effects. We can use multiple heads, each with its own independent trainable parameters, and let the network learn these different patterns (up to `n_heads` different patterns). It's the same we have seen in the CNN tutorial, where multiple filters at the same time made it possible to focus on different features of the image.

We define MultiHead attention as:
$$
MultiHead(Q, K, V ) = Concat(head_1, ..., head_h)W^O
$$

$$
\text{where } head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)
$$

$$
\text{where } Attention(Q,K,V) = softmax \left( \frac{QK^T}{\sqrt{d_k}} \right) \, V
$$


Let's start with the $$W^Q$$, $$W^K$$ and $$W^V$$ matrices are computed as a simple projection (*linear layer*): this corresponds to what in class we have seen to be *"the way of making the elements learnable is to introduce linear combinations with weights, and the way of making key, query, and value be learned separately is to have three different matrices*

```python
head_size=4
key   = nn.Linear(C, head_size, bias=False) 
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
```

We can now compute the $$Attention(Q,K,V)$$ as:

```python
k = key(x) #shape (B,T, head_size)
q = query(x) #shape (B,T, head_size)
wei = q @ k.transpose(-2, -1) #shape (B,T, head_size) @ (B,head_size,T) = (B,T,T)
wei *= head_size**-0.5 #scale by sqrt(d_k) as per paper, so that variance of the wei is 1
```

We then adapt the (alternative) notation of the uniform attention above, and compute the output of the non-uniform attention matrix as:
```python
tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril==0, float('-inf')) #tokens only "talk" to previous tokens
wei = F.softmax(wei, dim=-1) #equivalent to the normalization above (-inf in upper diagonal will be 0)
v = value(x) # shape (B,T, head_size)
out = wei @ v # shape (B,T,T) @ (B,T,C) --> (B,T,C)
```

Note that `out = wei @ x` is the same inner dot-product of the previous items, but this time the attention weights are not uniform, they are learnt parameters and change per query and over time. And **this is the main property and the main problem that the self-attention solves: non-uniform attention weights per query**. This is different than the uniform attention matrix where weights were uniform across all previous tokens, i.e. aggregation was just a raw average of all tokens in the sequence. Here we aggregate them by a "value of importance" for each token.

Also, without the $\sqrt{d_k}$ normalisation, we would have diffused values in `wei`, and it would approximate a one-hot vector. This normalization creates a more sparse `wei` vector.

This mechanism we coded is called *self-attention* because the $K$, $Q$ and $V$ all come from the same input `x`. But attention is more general. `x` can be given by a data source, and $K$, $Q$ and $V$ may come from different sources -- this would be called *cross attention*.

As final remarks, note that elements across batches are always independent, i.e. no cross-batch attention. And in many cases, e.g. a string representation of chemical compounds, or sentiment analysis, there can be no attention mask (i.e. all tokens can attend to all tokens), or there's a custom mask that fits the use case (e.g. main upper and lower diagonals to allow tokens to see their closest neighbour only). And here, we also don't have any cross atttention between the encoder and decoder.

The decoder includes a multi-head attention, which is simply a concatenation of individual heads' outputs. The `Head` and `MultiHeadAttention` modules can then be implemented as:


In [None]:
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
        #Note: this dropout randomly prevents some tokens from communicating with each other

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x) #shape (B,T, head_size)
        q = self.query(x) #shape (B,T, head_size)
        v = self.value(x) #shape (B,T, head_size)

        #compute self-attention scores
        wei = q @ k.transpose(-2, -1) #shape (B,T, head_size) @ (B,head_size,T) --> (B,T,T)
        wei *= C**-0.5 #scale by sqrt(d_k) as per paper, so that variance of the wei is 1
        wei = wei.masked_fill(self.tril[:T,:T]==0, float('-inf')) # (B,T,T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)

        #perform weighted aggregation of values
        out = wei @ v # (B, T, T) @ (B, T, head_size) --> (B, T, head_size)
        return out


In [None]:
class MultiHeadAttention(nn.Module):
    """ Multi-head attention as a collection of heads with concatenated outputs."""
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj  = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)
        return out


#### Feed-forward layer

Attention weights will induce a nonlinearity via the softmax function, but the transformer transforms inputs into outputs in a same space, and this may limit the expressivity of the block. To enhance flexibility, one typically plugs at least one standard NN layer:

In [None]:
class FeedForward(nn.Module):
    """ the feed forward network (FFN) in the paper"""

    def __init__(self, n_embd):
        super().__init__()
        # Note: in the paper (section 3.3) we have d_{model}=512 and d_{ff}=2048.
        # Therefore the inner layer is 4 times the size of the embedding layer
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd*4),
            nn.ReLU(),
            nn.Linear(n_embd*4, n_embd),
            nn.Dropout(dropout)
          )
    
    def forward(self, x):
        return self.net(x)


#### The GPT block

The standard block is composed by a MultiHead attention followed by a feed-forward layer.

Because the network can become too deep (and hard to train) for a high number of sequential blocks, we added skip connections to each block. Also, in the original paper, the layer normalization operation is applied after the attention and the feed-forward network, but before the skip connection. In modern days, it is common to apply it in the pre-norm formulation, where normalization is applied before the attention and the FFN. That’s also what we’ll do in the following code:

In [None]:
class Block(nn.Module):
    """ Transformer block: comunication (attention) followed by computation (FFN) """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension
        # n_heads : the number of heads we'd like to use
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
    
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


#### Full GPT network

The `__init__` and `forward` constitute the main model, the `generate` is the function we will use for inference (in this case, inference means *"generate new text*)

In [None]:
class GPTlite(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
    
        # vocabulary embedding and positional embedding
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        #sequence of attention heads and feed forward layers
        self.blocks = nn.Sequential( *[Block(n_embd, n_head) for _ in range(n_layer)])

        #one layer normalization layer after transformer blocks
        #and one before linear layer that outputs the vocabulary
        self.ln = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)


    def forward(self, idx):
        """ call the model with idx and targets (training) or without targets (generation)"""

        #idx and targets are both of shape (B,T)
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx) #shape (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) #shape (T,C)
        x = tok_emb + pos_emb #shape (B,T,C)
        x = self.blocks(x)
        x = self.ln(x)
        logits = self.lm_head(x) #shape (B,T,C)
        logits = torch.swapaxes(logits, 1, 2) #shape (B,C,T) to comply with CrossEntropyLoss
        return logits

    def generate(self, idx, max_new_tokens):
        """ given a context idx, generate max_new_tokens tokens and append them to idx """
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:] #we can never have any idx longer than block_size
            logits = self(idx_cond) #call fwd without targets
            logits = logits[:, :, -1] # take last token. from shape (B, C, T) to (B, C)
            #convert logits to probabilities
            probs = F.softmax(logits, dim=-1) # shape (B, C)
            #randomly sample the next tokens, 1 for each of the previous probability distributions
            #(one could take instead the argmax, but that would be deterministic and boring)
            idx_next = torch.multinomial(probs, num_samples=1) # shape (B, 1)
            #append next token ix to the solution sequence so far
            idx = torch.cat([idx, idx_next], dim=-1) # shape (B, T+1)
        return idx 

#### Instantiate the model and train it

In [None]:
m  = GPTlite(vocab_size).to(device)

# train the model
optimizer = torch.optim.Adam(m.parameters(), lr=learning_rate)
for steps in range(max_iters):
    idx, targets = get_batch(train_data)   #get a batch of training data
    logits = m(idx)   #forward pass
    loss = F.cross_entropy(logits, targets)
    loss.backward()   #backward pass
    optimizer.step()   #update parameters
    optimizer.zero_grad(set_to_none=True)  #sets to None instead of 0, to save memory

    #print progress
    if steps % 100 == 0: print(f"step {steps}, loss {loss.item():.2f}")
    
    @torch.no_grad()
    # eval loop: no backprop on this data, to avoid storing all intermediatte variables
    def eval_loss():
        idx, targets = get_batch(valid_data)   #get a batch of validation data
        logits = m(idx)   #forward pass
        loss = F.cross_entropy(logits, targets)
        print(f"step {steps}, eval loss {loss.item():.2f}")
        return loss
  
    if steps % eval_interval == 0: eval_loss().item()

        
# You may want to save the model here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# Particularly for the tasks where you'll have to play with inference with different inputs

#### Inference: generate text

We pass a single token (the `\n` character, encoded as token `0`) to the model as initial character, and let it generate a sequence of 500 tokens.

In the tasks below, you will be requested to play by passing different tokens, or sequence of tokens.

In [None]:
#a 1x1 tensor with batch size 1 and sequence length 1 and starting value 0 (0 is the \n character)
idx = torch.zeros((1,1), dtype=torch.long, device=device)

# test the same generate() function, now with the trained model
print(decode(m.generate(idx, max_new_tokens=500).tolist()[0]))


## Tasks

1. In the inference step, try to pass a different initial token or a sequence of tokens. Remember that the sequence cannot be longer than the block size: it will be chopped of within the decode function. Try this with character-by-character tokenization, and with `tiktoken` tokenization. For word-based tokenization or for large `block_size` character-based tokenization, what happens if the input sequence is related to the book theme opposed to if it isn't (e.g. another theme, *"Elon Musk announced a new Tesla truck"* or another language, *"El Alimerka no tenía más pan"*)
```python
testsentence= encode("Harry Potter")
testsentence = testsentence.reshape((len(testsentence),1)).to(device)
print(decode(m.generate(testsentence, max_new_tokens=500).tolist()[0]))
```

2. In the encoding step, try to use a different encoding scheme. For instance, you can use the *Byte-Pair-Encoding (BPE)* scheme by OpenAI's GPT, called `tiktoken`
```python
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
print(enc.n_vocab)
print(enc.encode("Hello world"))
data_enc = enc.encode(text)
print(data_enc)
```

3. Try to change structural and training parameters. Which parameter(s) influence the most the quality (in terms of how much it resembles an actual text) of the generated text? Can you think of a reason why a certain parameter is the most impactful, particularly in the character-by-character encoding? NOTE: you will typically need to increase training to 1000 or a few thousand `max_iter`. Also , don't forget to train on GPU (in `colab` or in your laptop with GPU card) or on MPS devices (MAC M1/M2/M3 laptops).
```python
eval_interval = 100
max_iters = 5000
learning_rate=3e-4
batch_size = 128
block_size = 256
n_embd = 300
n_layer = 10
dropout = 0.2
```

4. What happens if you enlarge the training dataset (e.g. including the other three books, or one or two of them)? Do you see any change in performance, both if you use the character-by-characted tokenization or the `tiktoken` tokenization?

In [None]:
testsentence= encode("Harry Potter")
print(testsentence)

In [None]:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
print(enc.n_vocab)
print(enc.encode("Harry Potter"))
#data_enc = enc.encode(text)

a = enc.encode("Harry Potter")
print(a)
b = enc.decode(a)
print(b)