#### Transformer Model Implementation

The script implements a simplified version of the Transformer model, which is a neural network architecture used primarily for natural language processing tasks.

#### 1. **Softmax Function**
Converts a vector of numbers into a probability distribution.

- **Scaled Dot-Product Attention:** Calculates attention between queries (Q), keys (K), and values (V)
- **Multi-Head Attention:** Allows the model to focus on different parts of the input simultaneously
- **Feed-Forward Network:** Processes the outputs from the attention layer
- **Encoder Layer:** Combines multi-head attention and the feed-forward network
- **Transformer:** Combines multiple encoder layers and adds an embedding layer
- **Data flow in the model:** The input is converted into embeddings. Each encoder layer processes the data:
  - a. Multi-head attention calculates attention
  - b. The feed-forward network processes the attention output

#### 2. **Final Output**
The final output represents the encoding of the input sequence.

In [3]:
import numpy as np 

def softmax(x):
    """
    Softmax function
    """
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention
    """
    # Calculate the dimension of the key vectors
    d_k = K.shape[-1]

    # Calculate the attention scores
    attention_scores = np.einsum('...ij,...kj->...ik', Q, K) / np.sqrt(d_k)

    # Apply softmax to the attention scores
    attention_probs = softmax(attention_scores)

    # Apply the attention probabilities to the value vectors
    output = np.einsum('...ij,...jk->...ik', attention_probs, V)

    return output

# Multi-Head Attention
class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.num_heads = num_heads
        self.d_model = d_model
        self.depth = d_model // num_heads

        # Initialize the weights for the query, key, value, and output matrices+
        self.WQ = np.random.randn(d_model, d_model)
        self.WK = np.random.randn(d_model, d_model)
        self.WV = np.random.randn(d_model, d_model)
        self.WO = np.random.randn(d_model, d_model)

    def split_heads(self, x):
        """
        Split the last dimension into (num_heads, depth)
        """
        batch_size = x.shape[0]
        x = x.reshape(batch_size, -1, self.num_heads, self.depth)
        return x.transpose(0, 2, 1, 3)

    def forward(self, Q, K, V, mask=None):
        """
        Forward pass for multi-head attention
        """
        batch_size = Q.shape[0]

        # Project the queries, keys, and values
        Q = np.matmul(Q, self.WQ)
        K = np.matmul(K, self.WK)
        V = np.matmul(V, self.WV)

        # Split the heads
        Q = self.split_heads(Q)
        K = self.split_heads(K)
        V = self.split_heads(V)

        # Apply scaled dot-product attention
        attention_output = scaled_dot_product_attention(Q, K, V, mask)

        # Combine the heads
        attention_output = attention_output.transpose(0, 2, 1, 3).reshape(batch_size, -1, self.d_model)

        # Project the output
        output = np.matmul(attention_output, self.WO)

        return output
    
class FeedForward:
    def __init__(self, d_model, d_ff):
        self.W1 = np.random.randn(d_model, d_ff)
        self.W2 = np.random.randn(d_ff, d_model)

    def forward(self, x):
        """
        Forward pass for feed-forward network
        """
        return np.matmul(np.maximum(np.matmul(x, self.W1), 0), self.W2)

class EncoderLayer:
    def __init__(self, d_model, num_heads, d_ff):
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff)

    def forward(self, x):
        """
        Forward pass for encoder layer
        """
        # Apply multi-head attention
        attention_output = self.mha.forward(x, x, x)

        # Apply feed-forward network
        ff_output = self.ff.forward(attention_output)

        return ff_output

# Transformer
class Transformer:
    def __init__(self, num_layers, d_model, num_heads, d_ff, vocab_size):
        # Initialize the embedding layer
        self.embedding = np.random.randn(vocab_size, d_model)

        # Initialize the encoder layers
        self.layers = [EncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)]

    def forward(self, x):
        """
        Forward pass for transformer
        """
        x = self.embedding[x]

        # Apply the encoder layers
        for layer in self.layers:
            x = layer.forward(x)

        return x
    
# Example usage
vocab_size = 10000
d_model = 64
num_heads = 4
d_ff = 128
num_layers = 2

# Create a transformer
transformer = Transformer(num_layers, d_model, num_heads, d_ff, vocab_size)

# Create a sample input
input_seq = np.random.randint(0, vocab_size, size=(2, 10)) # 2 sequences of 10 tokens

# Forward pass
output = transformer.forward(input_seq)
print("Input shape:", input_seq.shape)
print("Output shape:", output.shape)






Input shape: (2, 10)
Output shape: (2, 10, 64)


### Simplified Transformer Explanation

## Overview
This script implements a simplified version of the Transformer model, a neural network architecture widely used in natural language processing tasks.

#### Key Components

#### 1. Softmax Function
* Converts a vector of numbers into a probability distribution.
* Essential for attention weight calculation and output probability distributions.

#### 2. Scaled Dot-Product Attention
* Calculates attention between queries (Q), keys (K), and values (V).
* Uses the formula: Attention(Q,K,V) = softmax(QK^T/√d_k)V
* Allows the model to focus on relevant parts of the input sequence.

#### 3. Multi-Head Attention
* Runs multiple attention mechanisms in parallel.
* Allows the model to focus on different parts of the input simultaneously.
* Combines multiple "attention heads" to capture various types of relationships.

#### 4. Feed-Forward Network
* Processes the outputs from the attention layer.
* Typically consists of two linear transformations with a ReLU activation in between.
* Applies the same transformation to each position separately.

#### 5. Encoder Layer
* Combines multi-head attention and feed-forward network.
* Includes residual connections and layer normalization (in full implementations).
* Forms the basic building block of the Transformer encoder.

#### 6. Transformer Model
* Combines multiple encoder layers in sequence.
* Adds an embedding layer to convert token indices to dense vectors.
* Includes positional encoding to handle sequence order information.

### Data Flow

#### Input Processing
1. **Token Embedding**: Input token indices are converted into dense vector embeddings.
2. **Positional Encoding**: Position information is added to embeddings to maintain sequence order.

#### Encoder Processing
Each encoder layer processes the data through:
1. **Multi-Head Attention**: Calculates attention weights and applies them to input representations.
2. **Feed-Forward Processing**: Applies non-linear transformations to the attention outputs.
3. **Residual Connections**: Adds input to output for better gradient flow (in full implementations).

#### Output Generation
The final output represents an encoded representation of the input sequence, where each position contains contextual information from the entire sequence.

### How It Works

#### Step-by-Step Process
1. **Input Preparation**: The model receives a sequence of token indices as input.
2. **Embedding Conversion**: These indices are converted into dense embedding vectors.
3. **Positional Information**: Position encodings are added to maintain order information.
4. **Layer Processing**: Encoder layers sequentially process the embeddings:
   - Apply self-attention to capture relationships between tokens
   - Apply feed-forward transformations for non-linear processing
5. **Output Generation**: The final output is an encoded representation of the input sequence.

#### Key Mechanisms
* **Self-Attention**: Allows each position to attend to all positions in the input sequence.
* **Parallel Processing**: Unlike RNNs, Transformers can process all positions simultaneously.
* **Context Awareness**: Each output position contains information from the entire input sequence.

### Important Notes

#### Limitations of This Implementation
* **Simplified Design**: This is a basic model for educational purposes.
* **Missing Features**: Does not include advanced features such as:
  - Attention masking for decoder applications
  - Layer normalization for training stability
  - Dropout for regularization
  - Proper weight initialization
  - Training procedures and optimization

#### Educational Purpose
* **Learning Tool**: Serves as an introduction to fundamental Transformer concepts.
* **Foundation**: Provides a base understanding before exploring more complex implementations.
* **Conceptual Understanding**: Focuses on core mechanisms rather than production-ready features.

#### Real-World Applications
Full Transformer implementations are used in:
* **Language Models**: GPT, BERT, T5
* **Machine Translation**: Google Translate improvements
* **Text Summarization**: Automatic document summarization
* **Question Answering**: Conversational AI systems
* **Code Generation**: Programming assistance tools