<a href="https://colab.research.google.com/github/shamshers/GenAI/blob/main/Assignment_1_Implementing_a_Basic_Transformer_Model_from_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scenario**
You are a junior AI engineer at a startup focused on building custom language models. Your team lead has asked you to implement a simplified version of a transformer model to understand its core architecture better. This will help the team debug and optimize larger models in the future.

# **Objectives**
Implement a basic transformer encoder layer using Python and TensorFlow.
Understand the role of self-attention and feed-forward networks in transformers.
Validate the implementation by running a forward pass with dummy input.
# **Instructions**
**Set up your environment:** Install PyTorch or TensorFlow and create a new Python script. Ensure you have the necessary libraries (e.g., torch, numpy). Write a function to generate dummy input tensors of shape (batch_size, sequence_length, embedding_dim) to test your model.

**Implement multi-head self-attention:** Create a class for multi-head attention, including query, key, and value linear transformations. Compute scaled dot-product attention and apply softmax to obtain attention weights. Concatenate the outputs of all attention heads.

**Build the feed-forward network:** Implement a two-layer MLP with ReLU activation. The hidden layer should have a larger dimension (e.g., 4x the input dimension) as per the original transformer paper.

**Combine components into an encoder layer:** Integrate the attention mechanism and feed-forward network with layer normalization and residual connections. Ensure the output shape matches the input shape for stacking multiple layers.

**Test the model:** Perform a forward pass with your dummy input and verify the output dimensions. Print intermediate tensors (e.g., attention weights) to debug if necessary.

# **Evaluation Criteria**
Correct implementation of self-attention and feed-forward layers.
Proper handling of residual connections and layer normalization.
Successful forward pass with matching input/output dimensions.
Clean, modular, and well-commented code.
# **Resources**
The Illustrated Transformer
PyTorch Transformer Documentation


In [9]:
#Set up your environment: Install PyTorch or TensorFlow and create a new Python script.
#Ensure you have the necessary libraries (e.g., torch, numpy).
#Write a function to generate dummy input tensors of shape (batch_size, sequence_length, embedding_dim) to test your model.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import MultiHeadAttention
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import GlobalAveragePooling1D
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding
from tensorflow.keras.activations import relu


def generate_dummy_input(batch_size: int, sequence_length: int, embedding_dim: int):
    """
    Generates a dummy input tensor with the given shape for transformer model testing in TensorFlow.

    Args:
        batch_size (int): Number of samples in the batch.
        sequence_length (int): Length of each input sequence.
        embedding_dim (int): Dimensionality of each embedding vector.

    Returns:
        tf.Tensor: A random float tensor of shape (batch_size, sequence_length, embedding_dim)
    """
    return tf.random.normal(shape=(batch_size, sequence_length, embedding_dim))

# Example usage
dummy_input = generate_dummy_input(batch_size=4, sequence_length=10, embedding_dim=512)
print(dummy_input.shape)  # Output: (4, 10, 512)


class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert embed_dim % num_heads == 0, "Embedding dimension must be divisible by number of heads"

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.depth = embed_dim // num_heads  # dimension per head

        # Linear layers for Q, K, V
        self.wq = Dense(embed_dim)
        self.wk = Dense(embed_dim)
        self.wv = Dense(embed_dim)

        # Output projection layer
        self.dense = Dense(embed_dim)

    def split_heads(self, x, batch_size):
        """
        Split the last dimension into (num_heads, depth)
        Transpose the result to shape (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        """Calculate the attention weights."""
        matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

        # Scale matmul_qk
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        # Add the mask if present
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        # Softmax to get attention weights
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

        # Output
        output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
        return output, attention_weights

    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]

        # Linear projections
        q = self.wq(q)  # (batch_size, seq_len_q, embed_dim)
        k = self.wk(k)  # (batch_size, seq_len_k, embed_dim)
        v = self.wv(v)  # (batch_size, seq_len_v, embed_dim)

        # Split heads
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # Scaled dot-product attention
        scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)

        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.embed_dim))  # (batch_size, seq_len_q, embed_dim)

        # Final linear layer
        output = self.dense(concat_attention)  # (batch_size, seq_len_q, embed_dim)

        return output, attention_weights


class TransformerFeedForward(Layer):
    def __init__(self, input_dim, expansion_factor=4, dropout_rate=0.1):
        super(TransformerFeedForward, self).__init__()
        hidden_dim = input_dim * expansion_factor

        self.dense1 = Dense(hidden_dim, activation='relu')  # First layer with ReLU
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dense2 = Dense(input_dim)  # Project back to input_dim
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x, training=False):
        x = self.dense1(x)
        x = self.dropout1(x, training=training)
        x = self.dense2(x)
        x = self.dropout2(x, training=training)
        return x


class TransformerEncoderBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_expansion=4, dropout_rate=0.1):
        super(TransformerEncoderBlock, self).__init__()
        self.mha = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
        self.ffn = TransformerFeedForward(input_dim=embed_dim, expansion_factor=ff_expansion, dropout_rate=dropout_rate)

        self.norm1 = LayerNormalization(epsilon=1e-6)
        self.norm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(dropout_rate)
        self.dropout2 = Dropout(dropout_rate)

    def call(self, x, training=False, mask=None):
        # Multi-head attention + residual connection + norm
        attn_output, _ = self.mha(x, x, x, mask)              # Self-attention
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.norm1(x + attn_output)                    # Residual connection

        # Feed-forward network + residual connection + norm
        ffn_output = self.ffn(out1, training=training)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.norm2(out1 + ffn_output)                  # Residual connection

        return out2






(4, 10, 512)


In [6]:
# Example dummy input
batch_size = 2
seq_len = 5
embed_dim = 64
num_heads = 8

mha = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
dummy_input = tf.random.normal((batch_size, seq_len, embed_dim))

output, attn_weights = mha(dummy_input, dummy_input, dummy_input)
print("Output shape:", output.shape)        # (2, 5, 64)
print("Attention weights shape:", attn_weights.shape)  # (2, 8, 5, 5)


Output shape: (2, 5, 64)
Attention weights shape: (2, 8, 5, 5)


In [8]:
# Example input
batch_size = 2
seq_len = 10
embed_dim = 64

ffn = TransformerFeedForward(input_dim=embed_dim)
dummy_input = tf.random.normal((batch_size, seq_len, embed_dim))

output = ffn(dummy_input, training=True)
print("Output shape:", output.shape)  # (2, 10, 64)


Output shape: (2, 10, 64)


In [10]:
# Dummy input
batch_size = 2
seq_len = 10
embed_dim = 64
num_heads = 8

encoder_block = TransformerEncoderBlock(embed_dim=embed_dim, num_heads=num_heads)
dummy_input = tf.random.normal((batch_size, seq_len, embed_dim))

output = encoder_block(dummy_input, training=True)
print("Output shape:", output.shape)  # (2, 10, 64)


Output shape: (2, 10, 64)


In [11]:

# Setup
batch_size = 2
seq_len = 10
embed_dim = 64
num_heads = 8

# Generate input
dummy_input = generate_dummy_input(batch_size, seq_len, embed_dim)
print("Input shape:", dummy_input.shape)

# Instantiate layers
mha = MultiHeadAttention(embed_dim=embed_dim, num_heads=num_heads)
ffn = TransformerFeedForward(input_dim=embed_dim)
encoder_block = TransformerEncoderBlock(embed_dim=embed_dim, num_heads=num_heads)

# Forward pass through attention only (for inspection)
attn_output, attn_weights = mha(dummy_input, dummy_input, dummy_input)
print("Attention output shape:", attn_output.shape)        # (2, 10, 64)
print("Attention weights shape:", attn_weights.shape)      # (2, 8, 10, 10)

# Forward pass through feed-forward network
ffn_output = ffn(attn_output, training=True)
print("FFN output shape:", ffn_output.shape)               # (2, 10, 64)

# Full encoder block pass
encoder_output = encoder_block(dummy_input, training=True)
print("Final encoder output shape:", encoder_output.shape)  # (2, 10, 64)


Input shape: (2, 10, 64)
Attention output shape: (2, 10, 64)
Attention weights shape: (2, 8, 10, 10)
FFN output shape: (2, 10, 64)
Final encoder output shape: (2, 10, 64)
