### [Diving Into the Transformer Attention Mechanism](https://medium.com/data-science-collective/understanding-transformer-attention-mechanism-ffed36e821bb)

The attention mechanism is essentially a set of matrix operations that form the mathematical foundation of the model.

There are two scenarios where building from scratch makes sense: either to customize the architecture or for educational purposes.

In custom setups, you’re required to modify the model’s mathematical and statistical core — a process that is both advanced and time-consuming.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# 1.a Install PyTorch and NumPy
!pip install -q -U torch numpy

In [None]:
# 1.b Imports
import torch
from torch import nn

In [None]:
# 2. Transformer model
class Transformer(nn.Module):

    # 2.a Constructor method
    def __init__(self, vocab_size, embedding_dim,
                 n_heads, n_layers, dropout):
        
        # 2.b Initialize parent class (nn.Module)
        super().__init__()

        # 2.c Store model hyperparameters
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.n_heads = n_heads
        self.n_layers = n_layers
        self.dropout = dropout

        # 2.d Embedding layer to map tokens to vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # 2.e Multi-head self-attention layer
        self.attention = nn.MultiheadAttention(
            embed_dim=embedding_dim,
            num_heads=n_heads,
            dropout=dropout,
            batch_first=True
        )

        # 2.f Feed-forward network applied after attention
        self.feed_forward = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim)
        )

        # 2.g Final linear layer projecting back to vocab size
        self.out = nn.Linear(embedding_dim, vocab_size)

    # 2.h Forward pass
    def forward(self, x):

        # 2.i Apply embedding layer
        x = self.embedding(x)

        # 2.j Apply multi-head self-attention
        x, _ = self.attention(x, x, x)

        # 2.k Apply feed-forward network
        x = self.feed_forward(x)

        # 2.l Apply output projection
        x = self.out(x)

        return x

In [None]:
# 3.a List all submodules
model.modules

In [None]:
# 3.b Access the self-attention layer
model.attention

#### Building a Transformer Model (Without Framework)

Build a Transformer model from scratch, without relying on any frameworks.

The model will process sequences of 10 tokens — this is the first crucial definition for any Transformer-like architecture.

The model's goal varies by application:

- In computer vision, it might classify images.
- In time series, it predicts future values.
- In NLP, it generates output sequences from input sequences.

The _Transformer_ consists of four main components:

- an **_embedding layer_** (converts words into vectors)
- an **_attention mechanism_** (focuses on relevant parts of the sequence)
- **_encoder/decoder blocks_** (for sequence processing)
- a **_linear layer followed by softmax_** (to produce predictions)

Implementing the simplest version of a Transformer — a **_Minimum Viable Product (MVP)_** — containing only the essential elements: `embedding`, `attention`, and `output`.

#### Model Construction — Initial Hyperparameters

The first step is to define the model’s initial hyperparameters. These values should be chosen based on your specific objective. Start with some baseline values and adjust them as needed to achieve the desired results.

In [None]:
# 4. Imports
import numpy as np

# 4.a Model dimension
dim_model = 64

# 4.b Sequence length
seq_length = 10

# 4.c Vocabulary size
vocab_size = 100

#### Model Construction — Embedding Layer

In [None]:
# 5. Creates an embedding matrix for input tokens
def embedding(input, vocab_size, dim_model):

# 5.a Initialize embedding matrix with random values
    # Each row corresponds to a token vector in the vocabulary
    embed = np.random.randn(vocab_size, dim_model)
   
 # 5.b For each token index in input, retrieve its embedding
    return np.array([embed[i] for i in input])

#### Q, K, and V Components in the Transformer
The Transformer’s attention mechanism relies on three core components:
**_Query (Q)_**, **_Key (K)_**, and **_Value (V)_**.
These are used to process sequences and determine the relevance of each input element.

- **Q (Query)**

_Purpose_: Represents the current item of interest

_Use_: Generates a query for each position in the sequence to assess relevance

- **K (Key)**

_Purpose_: Scores each element in the input

_Use_: Compared against queries to determine attention weights

- **V (Value)**

_Purpose_: Contains the actual content to be extracted

_Use_: Supplies the output, weighted by attention scores

#### Model Construction — Attention Mechanism
The next component is the Attention Mechanism, which relies on three main elements: `Q`, `K`, and `V`.

#### Model Construction — Softmax Activation Function
The _Softmax_ function is a classic activation used in neural networks, especially for classification tasks. It transforms raw output values (logits) into probabilities that sum to 1, with each value ranging between 0 and 1.

When the model produces logits as part of its computation, Softmax converts them into interpretable probabilities.

In [None]:
# 6. Softmax activation function
def softmax(x):

# 6.a Compute exponentials adjusted by max value to avoid overflow
    e_x = np.exp(x - np.max(x))

# 6.b Normalize across the last axis (axis=-1)
    # Reshape ensures correct broadcasting in multidimensional inputs
    return e_x / e_x.sum(axis=-1).reshape(-1, 1)

#### Model Construction — Scaled Dot-Product

The `scaled_dot_product_attention` function is a core component of the attention mechanism in Transformer models.

It computes attention scores between `queries`, `keys`, and `values`, allowing the model to assign different levels of importance to each part of the input.

In [None]:
# 7. Scaled dot-product attention function
def scaled_dot_product_attention(Q, K, V):

# 7.a Compute dot product between Q and K transposed
    matmul_qk = np.dot(Q, K.T)

# 7.b Get the dimensionality of the key vectors
    depth = K.shape[-1]

# 7.c Scale the logits by the square root of depth
    logits = matmul_qk / np.sqrt(depth)

# 7.d Apply softmax to get attention weights
    attention_weights = softmax(logits)

# 7.e Compute weighted sum of values
    output = np.dot(attention_weights, V)

# 7.f Return final output
    return output

In [None]:
# 8. Final model function
def transformer_model(input):

    # 8.a Embedding layer
    embedded_input = embedding(input, vocab_size, dim_model)
    
    # 8.b Scaled dot-product attention
    attention_output = scaled_dot_product_attention(
        embedded_input, embedded_input, embedded_input
    )
    
    # 8.c Linear layer followed by softmax
    output_probabilities = linear_and_softmax(attention_output)
    
    # 8.d Select highest probability indices
    output_indices = np.argmax(output_probabilities, axis=-1)
    return output_indices

In [None]:
# 9. Linear transformation followed by softmax
def linear_and_softmax(input):

# 9.a Randomly initialize weight matrix
# Maps from model dimension to vocabulary size
    W = np.random.randn(dim_model, vocab_size)
    
# 9.b Apply linear transformation (dot product)
# Projects input into vocabulary space
    logits = np.dot(input, W)
    
# 9.c Convert logits into probabilities
    return softmax(logits)

In [None]:
# 10. Final transformer model function
def transformer_model(input):

# 10.a Generate embedding for input sequence
    embedded_input = embedding(input, vocab_size, dim_model)
    
# 10.b Apply attention mechanism using Q, K, V = embedded_input
    attention_output = scaled_dot_product_attention(
        embedded_input, embedded_input, embedded_input
    )
    
# 10.c Apply linear transformation followed by softmax
    output_probabilities = linear_and_softmax(attention_output)
    
# 10.d Select token indices with highest probabilities
    output_indices = np.argmax(output_probabilities, axis=-1)
    return output_indices

In [None]:
# 12. Generate random input sequence for inference
input_sequence = np.random.randint(0, vocab_size, seq_length)

print("Input Sequence:", input_sequence)

In [None]:
# 12.a Predict output using transformer model
output = transformer_model(input_sequence)

print("Model Output:", output)