In [None]:
# this ipynb owes hugely to [Luis Fernando Torres](https://www.kaggle.com/lusfernandotorres/)


With the code and my comment, i hope you will not need to seek for other aiding information to understand fully a transformer.

Nah, i was kidding, it was impossible.

However, you should read the code and comments in this repo from top to bottom. Don't do this in a utilitarian way.

# the `scratch`
import the prerequisite modules

In [None]:
# PyTorch and python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter

import math
import warnings
from tqdm import tqdm
from typing import Any, Union, Tuple
from pathlib import Path

One major function of torch is to deploy python's class to manipulate networks' layers without knowing details of GPU operations.

`torch.nn` makes neural network layer easily edited.

`torch.utils.data` makes data i/o simple, where :
- `Dataset` is usually a paternal class to be inherited by customized dataset class. It is differnt from `datasets` that is a hub of many dataset data.
- `DataLoader` is aimed at giving out batches of data of certain format. It is dealing with trainer directly.
- `random_split` is a tool to split train/test randomly to a database.
    
`torch.utils.tensorboad` is a typical command that registers what is happening during training. It has almost never reveals its necessity during my personal experience of training. I would rather look at the loss as part of training history either in loss plot or log.

On pythonic level:
- `math` is useful in square root and logarithm.
- `tqdm` is a time visual indicator so that during iterative training, you have something to see other than a blank sheet of piece of screen.
- `typing` is a trend of pythonic writing for a better constraint of type and format.
- `Path` from `pathlib` is an option for many coders, though I prefer the one in `os`. 

Matter of fact, the thing that is done on the level of python is better if consulting Gemini and stuff.

In [None]:
# HuggingFace libraries 
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

On HuggingFace level:
- `datasets` usually refers to the clean data downloadable from official mirrors. People usually use them in tutorials and demo case. There is usually available split of train, test, validation.
- here `toeknizers` provides many deatiled stuff to be imported. In many lazy codes, the tokenizer is usually called via:
```python
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_basic_tokenization = True)
```
or other llm foundation models to substitute in this case.


# from 'scratches' to blocks!
suppose you are familiar with transformer architecture now.

## Input Embedding

In [None]:
# Input Embedding: from tokens to embedings
class InputEmbedings(nn.Module): # as usuall, you need to inherit nn.Module
    def __init__(self, d_model: int, vocab_size: int): # two parameters only, d_model and vocab_size
        super().__init__() # should let your IDE know before you actually type them all.
        self.d_model = d_model # as detourable initialization, you have to do the self.xxx = xxx to all inputting parameters
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model) # usually people don't care too much on the realizatio fashion in embedding. They prefer deploy nn.Embedding directly here.

    def forward(self, x): # in a nn.Module, there is always __init__ , forward.
        return self.embedding(x) * math.sqrt(self.d_model) # this is for normalzing the variacne of the embedding, incase they are extremely small. Can't argue with  popular protocols.
    

## Positional Embedding

The coding for positional embedding is very fascinating if it were your first time to see this.
However, the enthusiasm drops after you seen too many of them.
Some variants of positional embedding is interesting indeed, as in some fancy papers.
For the record, the mathematic formula for the vanilla positional encoding is:

$$
\begin{equation}
    \text{Even Indices } (2i): \quad \text{PE(pos, } 2i) = \sin\left(\frac{\text{pos}}{10000^{2i / d_{model}}}\right)
    \end{equation}
$$
$$
\begin{equation}
  \text{Odd Indices } (2i + 1): \quad \text{PE(pos, } 2i + 1) = \cos\left(\frac{\text{pos}}{10000^{2i / d_{model}}}\right)
\end{equation}
$$

The positional embedding is used to add upon original embeddings.
Intuitively, for a certain dimension among `d_model`, it became similar to a neighbor that is a certain range of integers far. This certain rainges determines the 'capability' of detecting *something worth detecting*, which is equivalent to the idea of *attention* if you think about it. Since've explained what is the denominator doing inside the trigonometric brackets, now what about the `sin` and `cos`? Again, intuitively, this is for symmetric thought, an ideology deeply rooted in fourier transform.



In [None]:
#  Positional Encoding
class PositionalEncoding(nn.Module): # as usuall, you need to inherit nn.Module
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None: # three parameters only here. In some cases, seq_len is named as max_len.
        super().__init__() # now your IDE knows before you actually type them all.
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout) # the Dropout layer is usually used in the transformer, people inherit from nn directly.

        pe = torch.zeros(seq_len, d_model) # pe as in positional encoding; it should be the same shape with *each* embedded tokens.
        position = torch.arange(0, seq_len, dtype = torch.float).unsqueeze(1) # the multiplied variable. a 2D tensor['seq_len, 1'].  Shape: (seq_len, 1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) # shape:(d_model, )
        # position[i] = i
        # div_term[j] = exp(j * (-log(10000.0)) / d_model), ...
        #             = exp(-log(10000.0) * j / d_model)
        #             = 10000.0^(-j/d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        # note the j is even number. So, if j = 2k, where k is continuous number between 0 and half max_len, then position * div_term should be: i * (10000.0^(-2/d_model))^k, which is easier to programme.

        pe = pe.unsqueeze(0) # an added dimension for 'batch' dimension when adding-up.
        self.register_buffer('pe', pe) # Registering 'pe' as buffer. Buffer is a tensor not considered as a model parameter, i.e. untrainable.


    def forward(self, x: torch.Tensor) -> torch.Tensor: 
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)  # The  `requires_grad_(False)`` ensures that the positional encoding is not updated during backpropagation, treating it as a constant.
        return self.dropout(x) # a quite common practice to apply dropout to output in a concise way.
    






### dropout vs residual
A dropout layer is a regularization technique commonly used in neural networks. It randomly "drops out" a certain percentage of neurons during training, preventing the network from becoming overly reliant on any particular neuron. This helps to reduce overfitting and improve generalization.

During training, each neuron has a probability (usually set between 0.2 and 0.5) of being "dropped out."
If a neuron is dropped out, its output is set to zero.
During testing, all neurons are active, but their outputs are scaled down by the dropout probability to compensate for the training-time dropout.

A residual layer, also known as a skip connection, is a technique used to improve the training of deep neural networks. It allows information to flow directly from earlier layers to later layers, helping to prevent the vanishing gradient problem and making it easier for the network to learn deep features.

A residual layer consists of two parts: a stack of layers (e.g., convolutional layers, fully connected layers) and a shortcut connection.
The shortcut connection directly adds the input of the stack to the output of the stack.
This allows the network to learn the residual function, which represents the difference between the desired output and the input.


Key Differences

- Purpose: Dropout is primarily used for regularization to prevent overfitting, while residual layers are used to improve the training of deep networks.
- Mechanism: Dropout randomly drops out neurons, while residual layers add a shortcut connection.
- Impact: Dropout affects the training process by introducing noise, while residual layers affect the network architecture.
In summary:

Dropout is a regularization technique that helps prevent overfitting by randomly dropping out neurons.
Residual layers are a technique used to improve the training of deep networks by adding shortcut connections.

### pytorch multiplication, trailing singleton, and broadcasting.
PyTorch follows a set of rules to determine when and how to expand tensors during broadcasting operations. These rules are:

Trailing Singleton Dimensions: If one tensor has a trailing singleton dimension (a dimension with size 1), it can be expanded to match the corresponding dimension of the other tensor.
Compatible Dimensions: The dimensions of the tensors must be compatible. This means that they must either be equal or one of them must be 1.
In the case of (3,1) * (4,):
- The tensor of shape (4,) has a trailing singleton dimension: (4,) is equivalent to (1, 4).
- The dimensions (3, 1) and (1, 4) are now compatible.

Another example for broadcasting:
Scenario:
- Tensor A: Shape (2, 3)
- Tensor B: Shape (3,)

Goal: Multiply Tensor A and Tensor B.

Explanation:
- Dimension Compatibility: The dimensions of Tensor A and Tensor B are compatible because the last dimension of Tensor A (3) matches the only dimension of Tensor B (3).
- Broadcasting: PyTorch will automatically expand Tensor B to (1, 3) to match the first dimension of Tensor A.
- Element-wise Multiplication: Now, both tensors have the same shape (2, 3), and PyTorch can perform element-wise multiplication.

## Add & Norm

In [None]:
# Add & Norm
class LayerNormalization(nn.Module): # as usuall, you need to inherit nn.Module
    def __init__(self,  eps: float = 10**-6) -> None: 
        super().__init__() # your IDE knows it now, i am sure about it.
        self.eps = eps

        self.alpha = nn.Parameter(torch.ones(1)) # the scale parameter
        self.bias = nn.Parameter(torch.zeros(1)) # the shift parameter
        # the two parameters are to be trianed.

    def forward(self, x: torch.Tensor) -> torch.Tensor: # in a nn.Module, there is always __init__ , forward.
        mean = x.mean(-1, keepdim = True) # the mean along the last dimension, the dimension of d_model
        std = x.std(-1, keepdim = True) # the standard deviation along the last dimension, the dimension of d_model
        # 'keepdim = True' ensures that the output has the same shape as the input. here the dimension to be mean'ed/std'ed is 1.
    
        return self.alpha * (x - mean) / (std + self.eps) + self.bias



### Layer, Group, and Batch Normalization: A Comparison

Normalization techniques are essential for training deep neural networks, as they help to stabilize the learning process and improve generalization. Layer, Group, and Batch normalization are common methods used for this purpose.

Layer Normalization (LN)
- Normalization: Normalizes the input to a layer across all dimensions.
- Benefits: Can be more stable than batch normalization for recurrent neural networks or when batch sizes are small.
- Drawbacks: Can be less effective for convolutional neural networks.

Group Normalization (GN)

Normalization: Normalizes the input to a layer across a group of channels.
- Benefits: Can be more stable than batch normalization when the batch size is small or when dealing with data with varying numbers of channels.
- Drawbacks: May not be as effective as batch normalization for some tasks.
- Batch Normalization (BN)

Normalization: Normalizes the input to a layer across the batch dimension.
- Benefits: Helps to stabilize the learning process and can improve the convergence speed of training.
- Drawbacks: Can be sensitive to small batch sizes and may not be as effective for recurrent neural networks.

## Feed Forward

In [None]:
# Feed Forward
class FeedForwardBlock(nn.Module): # as usuall, you need to inherit nn.Module
    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None: 
        super().__init__() # your IDE knows it now.
        # (Batch, seq_len, d_model) --> (batch, seq_len, d_ff) -->(batch, seq_len, d_model)
        self.linear1 = nn.Linear(d_model, d_ff)  #  The self.linear1 layer has two parameters: weight and bias. 
        self.dropout = nn.Dropout(dropout)  # The self.dropout layer does not have any parameters that need to be trained.
        self.linear2 = nn.Linear(d_ff, d_model)
        # 4 parameters to be trained.
        
    def forward(self, x: torch.Tensor) -> torch.Tensor: # in a nn.Module, always __init__ , forward.
        # (Batch, seq_len, d_model) --> (batch, seq_len, d_ff) -->(batch, seq_len, d_model)
        # the above comment format is prevalent and magnificent.
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x)))) # this is a concise way of applying linears, dropouts, and relus.
    

In [None]:
# Multi-Head Attention
class MultiHeadAttention(nn.Module): # as usuall, you need to inherit nn.Module
    def __init__(self, d_model: int, h: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.h = h # head counts

        assert d_model % h == 0 # assert the d_model is divisible by h
        self.d_k = d_model // h # d_k is the dimension of each attention head's key, query, and value vectors
        # every head occupies some depths of the total d_model.

        self.w_q = nn.Linear(d_model, d_model) # W_q, same shape with q
        self.w_k = nn.Linear(d_model, d_model) # W_k, same shape with k
        self.w_v = nn.Linear(d_model, d_model) # W_v, same shape with v
        self.w_o = nn.Linear(d_model, d_model) # W_o, same shape with o. This matrix is used in final result.

        self.dropout = nn.Dropout(dropout) # dropout is almost everywhere.

    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout): # people use this for obtaining atten score and stuffs.
        d_k = query.shape[-1] # people don't use the one in class initialization
        attention_scores =  (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
        # note we are doing things under 1 head
        # the batch dimension is always there
        # @ is for matrix multiplication
        # this is a strict realization of the mathematical formula for attention score BEFORE softmax

        if mask is not None: # the mask issue is in both self- and cross- attention mechansim.
            attention_scores = attention_scores.masked_fill(mask == 0, -1e9) # masking, or inf in other tutorials I read...for example, the elements above the diagonal would be corrupted intentionally.
            # masked_fill is actually a method of the Tensor class.

        attention_scores = attention_scores.softmax(dim = -1) # Applying softmax, with dim being -1.
        # softmax(dim=-1) can be applied to a tensor of any shape.

        attention_scores = dropout(attention_scores) # dropout is everywhere.
        return (attention_scores @ value), attention_scores 
    
    def forward(self, q, k, v, mask):
        query = self.w_q(q)
        key = self.w_k(k)
        value = self.w_v(v)

        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        # split the last dimension of query into h heads, and then bring head to 2nd dimension
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)

        x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)

        x = x.transpose(1,2).contiguous().view(x.shape[0], x.shape[1], self.d_model) # recover!!

        return self.w_o(x)



# From blocks to bigger blocks!
Residual Connection, EncoderBlock/Encoder with many EncoderBlocks, DecoderBlock/Decoderwith many DeocoderBlocks
## Residual Connection

In [None]:
# residual connection
class ResidualConnection(nn.Module): 
    def __init__(self, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()
    
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x))) # dropout is everywhere


## EncoderBlock and Encoder
An EncoderBlocks repeats itself in an Encoder.

In [None]:
# EncoderBlock
class EnoderBlock(nn.Module): # as usuall, you need to inherit nn.Module
    def __init__(self, self_attention_block: MultiHeadAttention, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        