# Transformers

This Jupyter Notebook explores the basic concepts of transformers and their application using a simple case example to predict the next vowel of portuguese alphabet based on a sequence of characters.


## Introduction to Transformers


Autoregressive large language models are built on top of the transformer architecture, which is a neural network architecture based on self-attention mechanisms. Transformers are designed to handle sequence data efficiently in parallel, making training large models faster. Self-attention mechanisms are crucial for natural language processing (NLP) tasks as they create relationships between words, aiding tasks like translation.

Self-attention mechanisms can be thought of as learning to read without knowing grammar. For example, in the sentence "The boy kicked the ball and it disappeared," an attention layer can process "it" while simultaneously looking for other words. Over time, the layer can learn to associate the appearance of objects in a sentence with the usage of "it," similar to how pronouns work.

To understand attention mechanisms in detail, refer to the "Attention is All You Need" paper [2]. I create my personal finds and placed on this repositoy [Paper Notes](https://github.com/ramonlins/obsidian/tree/master/Papers/transformers/Ashish%20Vaswani).

In attempt to create a simple example to implement a transformer from scratch I created a simple overview of the architecture abstracting some parts such as residual connections, stack of layers and multiple heads, as shown in figure 1. 

The idea here is to create a sequence model to predict the next character vowel based on a sequence of three characters extracted from the Portuguese alphabet. 

![transformer](./images/transformer.png "Figure 1")

In this illustration, the encoder is composed of an attention layer connected with a feed-forward network; Its outputs are connected with the decoder attention layer together with the output of the masked attention layer; This attention layer output is connected to another feedforward layer passing through a linear and softmax operations.

The input sequence is first embedded into word vectors using an embedding layer. These word vectors serve as inputs for both query and key transformations.

The linear transformations are then applied to these word embeddings to obtain query vectors (Q) and key vectors (K). The purpose of these transformations is to project each word's embedding into lower-dimensional spaces to capture different aspects of their meaning.

The Q and K-V pair embeddings differ because they undergo distinct linear transformations with unique weight matrices during computation. Hence, during training, the Q (query) matrix learns to search for relevant information based on what K (keys) are showing as reference.

The softmax operation, applied within attention mechanisms, serves to highlight similarities by amplifying relevant connections between queries and keys weights. This normalization process helps compute attention scores that emphasize important information in the data (V).


## Finding next vowel

### Data Preparation

pt-alphabet is defined by: "abcdefghijklmnopqrstuvwxyz"

In [2]:
characters = "-abcdefghijklmnopqrstuvwxyz#"

Generate combinations

In [3]:
# Ascending character (index) arragement
def generate_combinations(characters, length):
    # Traverse the characters recursively
    def generate_combinations_recursive(current_combination, remaining_characters):
        if len(current_combination) == length:
            combinations.append(current_combination)
            return
        for i in range(len(remaining_characters)):
            generate_combinations_recursive(
                current_combination + remaining_characters[i],
                remaining_characters[i + 1:]  # Get next character in sequence
            )

    combinations = []
    generate_combinations_recursive('', characters)
    return combinations

# Input sequence length
n_seq = 3

# Generate all possible ordered combinations
combinations = generate_combinations(characters, n_seq)

Mapping characters

In [4]:
# Define output vowels with a special end character
vowels = 'aeiou#'

# Size of alphabet
n = len(characters)
emb = {}
c2i, i2c, v2i = {}, {}, {}

for i, c in enumerate(characters):
    c2i[c] = i  # mapping character to index position
    i2c[i] = c  # mapping index position to character
    if c in vowels:
        v2i[c] = i

In [5]:
i2v = {}
for i, v in enumerate(vowels):
   i2v[i] = v  # mapping vowels to label position
i2v

{0: 'a', 1: 'e', 2: 'i', 3: 'o', 4: 'u', 5: '#'}

Tokenize and create input-target pair data

In [6]:
import torch

# Create all possible combinations of input and outputs
valid_seqs = []       # Store all valid sequences
samples = []          # Store samples (input, target)

# Tokkenize data based on index
for combination in combinations:
    character_idx = c2i[combination[-1]]  # Get the position of the last character in the combination
    input_seq = []  # Store the input sequence
    # Iterate through each vowel and its position in alphabet
    for vowel, vowel_idx in v2i.items():
        # Check which vowel is the next one in the alphabet
        if vowel_idx > character_idx:
            #input_seq = []  # Store the input sequence
            
            # Convert characters to its position (tokenization)
            for c in combination:
                input_seq.append(c2i[c])
            
            valid_seqs.append(input_seq)  # Add the input sequence to valid sequences
            
            # Convert the tokenized input sequence and next vowel index to tensors
            input_seq_t = torch.tensor(input_seq)
            next_vowel_t = torch.tensor([vowel_idx])
    
            samples.append([input_seq_t, next_vowel_t])  # Create input-target pair sample
            
            break  # Break out of the loop once a valid vowel is found
        
        # Check if the character index is greater than the index of 'u'
        if character_idx > c2i['u']:
            #input_seq = []  # Create an empty list to store the input sequence
            
            # Iterate through each character in the combination
            for c in combination:
                input_seq.append(c2i[c])  # Append the index of the character to the input sequence
            
            valid_seqs.append(input_seq)  # Add the input sequence to the list of valid sequences
            
            # Convert the input sequence and target index tensors
            input_seq_t = torch.tensor(input_seq)
            next_vowel_t = torch.tensor([c2i['#']])  # Define a special character for letters bigger than 'u'
    
            samples.append([input_seq_t, next_vowel_t])
            
            break # Break out of the loop once the there is no more valid vowel


Shuffle and split dataset


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

seed = 13

X = [sample[0] for sample in samples]
y = [sample[1] for sample in samples]

X = shuffle(X, random_state=seed)
y = shuffle(y, random_state=seed)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)

### Build

Encoder

In [33]:
class Encoder(torch.nn.Module):
    """
    An encoder module for a transformer-based neural network.

    This encoder class processes input sequences using one-head self-attention and feedforward layers.

    Args:
        n (int): The size of the input vocabulary.
        dm (int): The model's embedding dimension.
    
    Attributes:
        dm (int): The model dimension.
        dk (int): The dimension of keys.
        dv (int): The dimension of values.
        embedding_layer (torch.nn.Embedding): The input embedding layer.
        wQ (torch.nn.Linear): Linear layer for queries in attention.
        wK (torch.nn.Linear): Linear layer for keys in attention.
        wV (torch.nn.Linear): Linear layer for values in attention.
        norm1 (torch.nn.LayerNorm): Layer normalization after attention.
        relu (torch.nn.ReLU): ReLU activation function.
        ff1 (torch.nn.Linear): First feedforward layer.
        ff2 (torch.nn.Linear): Second feedforward layer.
        norm2 (torch.nn.LayerNorm): Layer normalization after feedforward layers.

    Methods:
        embedding(x): Maps input indices to embeddings.
        attention(x_seq): Performs one-head self-attention on input sequences.

    """
    def __init__(self, n, dm):
        super().__init__()
        self.dm = dm  # model dim
        self.dk = dm 
        self.dv = dm  
        self.embedding_layer = torch.nn.Embedding(n, dm)  # n x dm
         
        self.wQ = torch.nn.Linear(dm, self.dk)  # dm x dk
        self.wK = torch.nn.Linear(dm, self.dk)  # dm x dk
        self.wV = torch.nn.Linear(dm, self.dv)  # dm x dv
        self.norm1 = torch.nn.LayerNorm(dm)
        
        self.relu = torch.nn.ReLU()
        self.ff1 = torch.nn.Linear(self.dv, dm) # dv x dm
        self.ff2 = torch.nn.Linear(dm, dm)      # dm x dm
        self.norm2 = torch.nn.LayerNorm(dm)
        
    def embedding(self, x):
        """
        Maps input indices to embeddings.

        Args:
            x (torch.Tensor): Input indices of shape (batch_size=1, sequence_length).

        Returns:
            torch.Tensor: Embeddings of shape (batch_size=1, sequence_length, model_dimension).
        """
        return self.embedding_layer(x) #(t x n) x (n x dm) -> (t x dm)
       
    def attention(self, x_seq):
        """
        Performs multi-head self-attention on input sequences.

        Args:
            x_seq (torch.Tensor): Input sequences of shape (batch_size=1, sequence_length, model_dimension).

        Returns:
            torch.Tensor: Output of the attention layer of shape (batch_size=1, sequence_length, model_dimension).
        """
        # Convert to word embedding
        pos_emb = self.embedding(x_seq)
        
        # Search
        Q = self.wQ(pos_emb)  # (t x dm) x (dm x dk) -> (t x dk)
        # Reference
        K = self.wK(pos_emb)  # (t x dm) x (dm x dk) -> (t x dk)
        # Attention features
        V = self.wV(pos_emb)  # (t x dm) x (dm x dv) -> (t x dv)
        
        # Similarity
        QK = torch.matmul(Q, K.T)  # (t x dk) x (dk x t) -> (t x t)
        QKn = QK / torch.sqrt(torch.tensor([self.dm]))
        
        # Highlight similar words
        P = torch.nn.Softmax(dim=-1)(QKn)  # (t x t)
        
        # Focus on specific features
        H = torch.matmul(P, V)  # (t x t) x (t x dv) -> (t x dv)
        Hn = self.norm1(H)
        
        # Attention latent space
        R = self.relu(self.ff1(Hn))  # (t x dv) x (dv x dm) -> (t x dm)
        A = self.ff2(R)  # (t x dm) x (dm x dm) -> (t x dm)
        An = self.norm2(A)
        
        return An


Decoder

In [35]:
import random

class Decoder(torch.nn.Module):
    """
    A decoder module for a transformer-based neural network.

    This decoder class processes input sequences using one-head self-attention and feedforward layers, 
    and generates output sequences.

    Args:
        n (int): The size of the output vocabulary.
        dm (int): The model's embedding dimension.

    Attributes:
        _y (int): An internal variable to store the index of the generated output.
        dm (int): The model dimension.
        dk (int): The dimension of keys.
        dv (int): The dimension of values.
        embedding_layer (torch.nn.Embedding): The output embedding layer.
        norm1 (torch.nn.LayerNorm): Layer normalization after attention.
        wQ (torch.nn.Linear): Linear layer for queries in attention.
        wK (torch.nn.Linear): Linear layer for keys in attention.
        wV (torch.nn.Linear): Linear layer for values in attention.
        norm2 (torch.nn.LayerNorm): Layer normalization after feedforward layers.
        relu (torch.nn.ReLU): ReLU activation function.
        ff1 (torch.nn.Linear): First feedforward layer.
        ff2 (torch.nn.Linear): Second feedforward layer.
        norm3 (torch.nn.LayerNorm): Layer normalization after the final feedforward layer.
        fc (torch.nn.Linear): Linear layer for generating output logits.

    Methods:
        embedding(x): Maps output indices to embeddings.
        attention(ae, ad): Performs one-head self-attention on input sequences.
        masked_attention(x_seq): Performs masked one-head self-attention on input sequences.
    """
    def __init__(self, n, dm):
        super().__init__()
        self._y = random.randint(0, 5)  # Start from any vowel character
        self.dm = dm  # model dim
        self.dk = dm 
        self.dv = dm 
        
        self.embedding_layer = torch.nn.Embedding(n, dm)  # n x dm
        self.norm1 = torch.nn.LayerNorm(dm)
        
        self.wQ = torch.nn.Linear(dm, self.dk)  # dm x dk
        self.wK = torch.nn.Linear(dm, self.dk)  # dm x dk
        self.wV = torch.nn.Linear(dm, self.dv)  # dm x dv
        self.norm2 = torch.nn.LayerNorm(dm)
        
        self.relu = torch.nn.ReLU()
        self.ff1 = torch.nn.Linear(self.dv, dm)
        self.ff2 = torch.nn.Linear(dm, dm)
        self.norm3 = torch.nn.LayerNorm(dm)
        self.fc = torch.nn.Linear(n_seq*dm, len(vowels))
        
    def embedding(self, x):
        """
        Maps output indices to embeddings.

        Args:
            x (torch.Tensor): Output indices of shape (batch_size=1, sequence_length).

        Returns:
            torch.Tensor: Embeddings of shape (batch_size=1, sequence_length, model_dimension).
        """
        return self.embedding_layer(x) #(b x t x n) x (n x dm) -> (b x t x dm)
    
    def attention(self, ae, ad):
        """
        Performs multi-head self-attention on input sequences.

        Args:
            ae (torch.Tensor): Input sequences of shape (batch_size=1, sequence_length, model_dimension) for reference.
            ad (torch.Tensor): Input sequences of shape (batch_size=1, sequence_length, model_dimension) for search.

        Returns:
            torch.Tensor: Output of the attention layer of shape (batch_size=1, sequence_length, model_dimension).
        """
        # Search
        Q = self.wQ(ad)  # (t x dm) x (dm x dk) -> (t x dk)
        # Reference
        K = self.wK(ae)  # (t x dm) x (dm x dk) -> (t x dk)
        # Attention features
        V = self.wV(ae)  # (t x dm) x (dm x dv) -> (t x dv)
        
        # Similarity
        QK = torch.matmul(Q, K.T)  # (t x dk) x (dk x t) -> (t x t)
        QKn = QK / torch.sqrt(torch.tensor([self.dm]))
        
        # Highlight similar words
        P = torch.nn.Softmax(dim=-1)(QKn)    # (t x t)
        
        # Focus on specific features
        H = torch.matmul(P, V)              # (t x t) x (t x dv) -> (t x dv)
        Hn = self.norm2(H)
        
        # Attention latent space
        R = self.relu(self.ff1(Hn))  # (t x dv) x (dv x dm*4) -> (t x dm*4)
        A = self.ff2(R)  # (t x dm) x (dm x dm) -> (t x dm)
        An = self.norm3(A)
        
        # Probabilities
        flatten_a = An.view(-1)                  # 1 x (t x dm)
        logits = self.fc(flatten_a)             # (t.dm) x num of labels
        p = torch.nn.Softmax(dim=0)(logits)     # prob of each label
        
        # Get token id of output vowel
        vowel_idx = max(enumerate(p), key=lambda t: t[1])[0]
        vowel = vowels[vowel_idx]
        self._y = c2i[vowel]
        
        return p
    
    def masked_attention(self, x_seq):
        """
        Performs masked multi-head self-attention on input sequences.

        Args:
            x_seq (torch.Tensor): Input sequences of shape (batch_size, sequence_length, model_dimension).

        Returns:
            torch.Tensor: Output of the masked attention layer of shape (batch_size, sequence_length, model_dimension).
        """
        # shift output embedding
        x_seq = x_seq.tolist()
        x_seq.pop(-1)
        x_seq.insert(0, self._y)
        _x_seq = torch.tensor(x_seq)
        
        # Word embedding
        pos_emb = self.embedding(_x_seq)
        
        # Search
        Q = self.wQ(pos_emb)  # (b x t x dm) x (dm x dk) -> (b x t x dk)
        # Reference
        K = self.wK(pos_emb)  # (b x t x dm) x (dm x dk) -> (b x t x dk)
        # Attention features
        V = self.wV(pos_emb)  # (b x t x dm) x (dm x dv) -> (b x t x dv)
        
        
        # Similarity
        QK = torch.matmul(Q, K.T)  # (t x dk) x (dk x t) -> (t x t) *removing batch info
        
        # Mask
        if x_seq not in valid_seqs:
            mask = torch.tensor([[0, 0, 0],
                                 [1, 1, 1],
                                 [1, 1, 1],])
        
            QK.masked_fill_(mask == 0, -1e9)
        
        QKn = QK / torch.sqrt(torch.tensor([self.dm]))
        
        # Highlight similar words
        P = torch.nn.Softmax(dim=0)(QKn)    # (t x t)
        
        # Focus on specific features
        H = torch.matmul(P, V)              # (t x t) x (t x dv) -> (t x dv)
        
        # Attention latent space
        R = self.relu(self.ff1(H))  # (t x dv) x (dv x dm*4) -> (t x dm*4)
        A = self.ff2(R)  # (t x dm*4) x (dm*4 x dm) -> (t x dm)
        An = self.norm1(H)
        
        return An


transformer

In [26]:
class transformer(torch.nn.Module):
    def __init__(self, n, dm):
        super().__init__()
        self.encoder = Encoder(n=n, dm=dm)
        self.decoder = Decoder(n=n, dm=dm)
        
    def forward(self, x):
        # encoder
        ae = self.encoder.attention(x)
        # decoder
        ad = self.decoder.masked_attention(x)
        y = self.decoder.attention(ae, ad)
            
        return y

### Train

In [27]:
# Encode targets
def one_hot_encode(t):
    one_hot_encode = torch.zeros(len(vowels))  # Create one hot encode buffer
    
    for i, v in enumerate(vowels):
        # Check if the current vowel 'v' matches the target vowel 't'
        if v == t:
            one_hot_encode[i] = 1  # Set the corresponding index to 1 if there's a match
    
    # Convert the one-hot tensor to a float tensor and enable gradient tracking
    return one_hot_encode.float().requires_grad_(True)

# Get number of samples
train_samples = list(zip(X_train, y_train))
test_samples = list(zip(X_test, y_test))

n_samples = len(train_samples)

# Build model
model = transformer(n=n, dm=16)

print(model)

# Define optmization algorithm
opt = torch.optim.SGD(model.parameters(), lr=0.0025, momentum=0.9)

# Define loss
cel = torch.nn.CrossEntropyLoss()

epochs = 100

for epoch in range(epochs):
    rl = 0.0
    for x_seq, t in train_samples:
        # avoid cumulative gradient
        opt.zero_grad()
        
        # predictions
        p = model(x_seq)

        # compute error
        vowel = characters[t.item()]
        t_enc = one_hot_encode(vowel)
        loss = cel(p, t_enc)
        
        # compute gradient
        loss.backward()
        
        # Clip the gradients
        max_grad_norm = 1.0  # Set your desired maximum gradient norm here
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

        # update weights
        opt.step()
        
        # running loss
        rl += loss.item()
    
    # Handcraft learnig rate decay
    if epoch % (0.2*epochs) == 0 and epoch > 0:
       for param_group in opt.param_groups:
           print(f"Changing learning rate to {param_group['lr']/1.2}")
           param_group['lr'] = param_group['lr']/1.2
    
    # avg running loss
    rl /= n_samples
    
    print(f"Epoch: {epoch+1} \tTraining Loss: {rl:.4f}")
    
print("Finished training")

transformer(
  (encoder): Encoder(
    (embedding_layer): Embedding(28, 16)
    (wQ): Linear(in_features=16, out_features=16, bias=True)
    (wK): Linear(in_features=16, out_features=16, bias=True)
    (wV): Linear(in_features=16, out_features=16, bias=True)
    (norm1): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
    (relu): ReLU()
    (ff1): Linear(in_features=16, out_features=16, bias=True)
    (ff2): Linear(in_features=16, out_features=16, bias=True)
    (norm2): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): Decoder(
    (embedding_layer): Embedding(28, 16)
    (norm1): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
    (wQ): Linear(in_features=16, out_features=16, bias=True)
    (wK): Linear(in_features=16, out_features=16, bias=True)
    (wV): Linear(in_features=16, out_features=16, bias=True)
    (norm2): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
    (relu): ReLU()
    (ff1): Linear(in_features=16, out_features=16, bias=True)
   

### Evaluate

In [31]:
def eval(model, samples):
    c = 0  # Initialize a counter for correct predictions
    
    for sample in samples:
        p = model(sample[0])
        
        # Find the index of the highest predicted value in 'p'
        i = max(enumerate(p), key=lambda x: x[1])[0]
        
        # Map the index 'i' to the corresponding vowel using 'i2v'
        prediction = i2v[i]
        
        # Map the target index (sample[1]) to the corresponding character using 'i2c'
        target = i2c[sample[1].item()]
        
        # Check if the model's prediction matches the target character
        if prediction == target:
            c += 1  # Increment the correct prediction counter
    
    # Calculate and print the percentage of correct predictions
    accuracy_percentage = (c / len(samples)) * 100
    print(f"Accuracy %: {accuracy_percentage}")

In [32]:
# target x
eval(model, test_samples)

Accuracy %: 99.89827060020346


It looks like learning has happened.... 

I stopped here because the purpose of this task was just to understand transformer using a simple example. 
 
More complex models can be build using transformers layers such as [TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html#torch.nn.TransformerEncoderLayer), [TransformerDecoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html) from pytorch or other framework.

It also possible to use pre build models [bert](https://huggingface.co/docs/transformers/model_doc/bert), 
[llama2](https://huggingface.co/docs/transformers/model_doc/llama2) among [others](https://huggingface.co/docs/transformers/index).

The next step will be focus on understanding the fundamentals of generative networks with simple examples.

Reference:
1. https://www.youtube.com/watch?v=ySEx_Bqxvvo&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=2
2. https://github.com/ramonlins/obsidian/tree/master/Papers/transformers/Ashish%20Vaswani
3. https://github.com/ramonlins/obsidian/blob/master/Papers/transformers/Ashish%20Vaswani/Attention%20Is%20All%20You%20Need.pdf