# Deep Dive into Transformer Architecture

**Main Topic:** Transformer Architecture – Self-Attention, Encoder-Decoder Structure, and Positional Encoding  
**Dataset:** IMDB Movie Reviews (for sentiment classification)  
**Framework:** TensorFlow (Keras API)  
**Author:** Your Name  
**Date:** YYYY-MM-DD

---

## Objective

- **Conceptual Understanding:**  
  - **Self-Attention Mechanism:** Learn how queries, keys, and values are computed and combined using attention weights.
  - **Encoder-Decoder Structure:** Understand how the encoder processes the input and how the decoder (if used) generates output. In this notebook, we focus on the encoder side for classification.
  - **Positional Encoding:** Understand the need for, and mathematical formulation of, positional encodings to incorporate sequence order into embeddings.

- **Practical Implementation:**  
  Implement a Transformer block in TensorFlow and integrate it into a sentiment classification model using the IMDB dataset.

- **Performance Analysis:**  
  Evaluate and discuss model performance and design choices.

---

## Table of Contents

1. [Introduction](#Introduction)
2. [Theoretical Background](#Theoretical-Background)
   - Self-Attention Mechanism
   - Encoder-Decoder Structure
   - Positional Encoding
3. [Implementation in TensorFlow](#Implementation)
   - Positional Encoding Layer
   - Transformer Block
4. [Application: IMDB Sentiment Classification](#Application)
5. [Performance Evaluation](#Evaluation)
6. [Conclusion and Key Takeaways](#Conclusion)


## 1. Introduction

Transformers have revolutionized natural language processing by eliminating recurrence and convolution through the self-attention mechanism. They capture relationships between tokens irrespective of their positions in the sequence. This notebook provides an in-depth look at key components of the Transformer architecture:
- **Self-Attention:** How every token in a sequence can interact with every other token.
- **Encoder-Decoder Structure:** The core structure used in sequence-to-sequence tasks (with a focus on the encoder for classification).
- **Positional Encoding:** How position information is injected into the model.

We will implement a simplified Transformer block using TensorFlow and apply it to the IMDB sentiment classification task.


## 2. Theoretical Background

### 2.1 Self-Attention Mechanism

Self-attention allows each position in a sequence to attend to all other positions. For an input sequence represented as a matrix \(X \in \mathbb{R}^{n \times d}\) (where \(n\) is the sequence length and \(d\) is the dimensionality), we compute three matrices:
- **Query (\(Q\))**
- **Key (\(K\))**
- **Value (\(V\))**

These are obtained via learned linear transformations:
\[
Q = XW^Q,\quad K = XW^K,\quad V = XW^V
\]
The attention scores are computed as:
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
\]
where \(d_k\) is the dimension of the queries/keys.

### 2.2 Encoder-Decoder Structure

- **Encoder:** Processes the input sequence and generates a representation that captures contextual information.
- **Decoder:** Uses the encoder’s output along with previous target tokens (shifted right) to generate the output sequence.
  
In this notebook, we focus on the **encoder side** to extract features for classification. The complete encoder-decoder setup is crucial in tasks like machine translation.

### 2.3 Positional Encoding

Since transformers process all tokens in parallel, we add positional encodings to the input embeddings to retain information about the position of tokens. A common formula for positional encoding is:
\[
PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right),\quad
PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
\]
where \(pos\) is the token position and \(i\) is the dimension index. These encodings are added to the input embeddings to provide positional context.

---

## 3. Implementation in TensorFlow

In this section, we build a custom Transformer block in TensorFlow, including the self-attention and feed-forward network components, along with a positional encoding layer.

### 3.1 Importing Libraries


In [1]:
# Import necessary libraries

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("TensorFlow version:", tf.__version__)


TensorFlow version: 2.18.0


### Interpretation:
This cell imports TensorFlow along with necessary modules such as Keras layers and models. We also import NumPy, Matplotlib, and Seaborn for numerical operations and visualizations. We print the TensorFlow version to ensure compatibility.


### 3.2 Positional Encoding Layer

The following cell defines a custom layer for positional encoding.


In [2]:
# Define Positional Encoding Layer

class PositionalEncoding(layers.Layer):
    def __init__(self, sequence_len, d_model):
        super(PositionalEncoding, self).__init__()
        self.pos_encoding = self.positional_encoding(sequence_len, d_model)

    def get_config(self):
        config = super(PositionalEncoding, self).get_config()
        config.update({
            "pos_encoding": self.pos_encoding,
        })
        return config

    def positional_encoding(self, position, d_model):
        angle_rads = self.get_angles(np.arange(position)[:, np.newaxis],
                                     np.arange(d_model)[np.newaxis, :],
                                     d_model)
        # apply sin to even indices; cos to odd indices
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
        pos_encoding = angle_rads[np.newaxis, ...]
        return tf.cast(pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
        return pos * angle_rates

    def call(self, inputs):
        return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]

# Test the positional encoding layer
sample_pos_encoding = PositionalEncoding(sequence_len=50, d_model=512)
dummy_inputs = tf.zeros((1, 50, 512))
encoded = sample_pos_encoding(dummy_inputs)
print("Positional encoding shape:", encoded.shape)


Positional encoding shape: (1, 50, 512)


### Detailed Explanation:
- **`class PositionalEncoding(layers.Layer):`**  
  We define a custom Keras layer for positional encoding by subclassing `layers.Layer`.

- **`def __init__(self, sequence_len, d_model):`**  
  The constructor takes:
  - `sequence_len`: The maximum length of input sequences.
  - `d_model`: The embedding dimension.
  
- **`self.pos_encoding = self.positional_encoding(sequence_len, d_model)`**  
  Computes the positional encodings immediately during initialization and stores them.

- **`def positional_encoding(self, position, d_model):`**  
  This method calculates the positional encodings:
  - **`angle_rads = self.get_angles(...):`**  
    Computes a matrix of angles for each position and dimension.
  - **`angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])` and `angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])`:**  
    Applies sine to even indices and cosine to odd indices.
  - **`pos_encoding = angle_rads[np.newaxis, ...]`:**  
    Adds a new axis to match the batch dimension.
  - **`tf.cast(pos_encoding, dtype=tf.float32)`:**  
    Casts the encoding to TensorFlow’s float32 type.

- **`def get_angles(self, pos, i, d_model):`**  
  This helper function calculates the angle rates used in the sinusoidal functions:
  - **`angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))`:**  
    Computes the scaling factor for each dimension.
  - **`return pos * angle_rates`:**  
    Multiplies the position indices by the angle rates to obtain the final angles.

- **`def call(self, inputs):`**  
  This method is called when the layer is applied. It adds the precomputed positional encoding to the input tensor, aligning with the input’s sequence length.

- **Testing the layer:**  
  We create a `PositionalEncoding` instance with a sequence length of 50 and embedding dimension of 512, then pass a dummy tensor of zeros to verify the output shape.
  
- **`print("Positional encoding shape:", encoded.shape)`**  
  Ensures that the output has the expected shape: `(batch_size, sequence_length, d_model)`.


### 3.3 Multi-Head Self-Attention and Transformer Block

Next, we implement a simplified Transformer block that includes multi-head self-attention and a feed-forward network.


In [3]:
# Cell 3: Define a Basic Transformer Block

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = models.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training, mask=None):
        # Self-Attention block
        attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        # Feed-Forward Network block
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Test the Transformer block
sample_transformer = TransformerBlock(embed_dim=64, num_heads=4, ff_dim=128)
dummy_data = tf.random.uniform((1, 10, 64))  # batch_size=1, sequence_length=10, embed_dim=64
output_data = sample_transformer(dummy_data, training=False)
print("Transformer block output shape:", output_data.shape)


Transformer block output shape: (1, 10, 64)


### Detailed Explanation:
- **`class TransformerBlock(layers.Layer):`**  
  We define a custom Transformer block as a subclass of `layers.Layer`.

- **`def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):`**  
  The constructor takes:
  - `embed_dim`: The dimension of the embeddings.
  - `num_heads`: The number of attention heads for multi-head attention.
  - `ff_dim`: The hidden dimension of the feed-forward network.
  - `rate`: Dropout rate for regularization.
  
- **`self.att = layers.MultiHeadAttention(...)`:**  
  Instantiates the multi-head self-attention layer. This layer computes attention scores for each token with respect to all other tokens.
  
- **`self.ffn = models.Sequential([...])`:**  
  Defines a simple feed-forward network (FFN) consisting of:
  - A Dense layer with ReLU activation and `ff_dim` units.
  - A Dense layer with `embed_dim` units to project back to the original embedding size.
  
- **`self.layernorm1` and `self.layernorm2`:**  
  Two layer normalization layers that help stabilize and speed up training by normalizing the inputs of each sub-layer.
  
- **`self.dropout1` and `self.dropout2`:**  
  Two dropout layers that randomly zero out some elements during training to prevent overfitting.
  
- **`def call(self, inputs, training, mask=None):`**  
  The `call` method defines how the layer processes inputs:
  - **Self-Attention:**  
    `attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)`  
    Here, the same input tensor is used as query, key, and value.
  
  - **Dropout and Residual Connection:**  
    The attention output is passed through dropout and then added to the original input (residual connection), followed by layer normalization.
  
  - **Feed-Forward Network:**  
    The normalized output `out1` is processed through the FFN, followed by another dropout and residual addition with `out1`, then normalized again.
  
- **Testing the Block:**  
  We create a `TransformerBlock` instance with a small embedding dimension (64) and 4 attention heads. We generate dummy data with shape `(1, 10, 64)` (batch size 1, sequence length 10) and pass it through the block. Finally, we print the output shape to confirm that it remains `(1, 10, 64)`.


## 4. Application: IMDB Sentiment Classification

We will now use the IMDB movie review dataset to demonstrate the Transformer block as a feature extractor in a text classification model.


In [5]:
# Data Loading and Preprocessing

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

# Set parameters for the IMDB dataset
max_features = 5000  # Vocabulary size
maxlen = 200         # Maximum length of reviews
embedding_dim = 64   # Embedding dimension

# Load the dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to the same length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

print("Training data shape:", x_train.shape)
print("Test data shape:", x_test.shape)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Training data shape: (25000, 200)
Test data shape: (25000, 200)


### Detailed Explanation:
- **`from tensorflow.keras.datasets import imdb` and `from tensorflow.keras.preprocessing import sequence`:**  
  Import the IMDB dataset and helper function for padding sequences.
  
- **`max_features = 5000`:**  
  Limit the vocabulary to the 5000 most common words.
  
- **`maxlen = 200`:**  
  Set the maximum review length to 200 tokens. Reviews shorter than this are padded, and longer reviews are truncated.
  
- **`embedding_dim = 64`:**  
  Define the dimension of the word embeddings to be 64.
  
- **`imdb.load_data(num_words=max_features)`:**  
  Loads the IMDB dataset, ensuring only the top 5000 words are considered.
  
- **`sequence.pad_sequences(...)`:**  
  Pads all reviews to ensure they are exactly 200 tokens long. This uniformity is required for batch processing in neural networks.
  
- **`print("Training data shape:", x_train.shape)` and similar for test data:**  
  Prints the shapes of the training and testing datasets to confirm that the data is loaded and preprocessed correctly. Typically, the shape will be `(num_samples, maxlen)`.


### 4.1 Building the Classification Model

We now build a model that uses an embedding layer, the positional encoding layer, our custom Transformer block, and a classification head.


In [6]:
# Build the Transformer-based Classification Model

from tensorflow.keras import Input, Model

def create_model(max_features, maxlen, embedding_dim):
    inputs = Input(shape=(maxlen,))
    # Embedding layer converts token IDs to embeddings
    x = layers.Embedding(input_dim=max_features, output_dim=embedding_dim)(inputs)
    # Add positional encoding to embeddings
    x = PositionalEncoding(sequence_len=maxlen, d_model=embedding_dim)(x)
    # Apply Transformer block
    x = TransformerBlock(embed_dim=embedding_dim, num_heads=4, ff_dim=128)(x, training=True)
    # Global average pooling to collapse sequence dimension
    x = layers.GlobalAveragePooling1D()(x)
    # Fully connected layer for classification
    x = layers.Dense(64, activation="relu")(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)

    model = Model(inputs=inputs, outputs=outputs)
    return model

model = create_model(max_features=max_features, maxlen=maxlen, embedding_dim=embedding_dim)
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()


### Detailed Explanation:
- **`from tensorflow.keras import Input, Model`:**  
  Import the necessary classes to build our Keras model.

- **`def create_model(max_features, maxlen, embedding_dim):`**  
  Define a function to create our model. This function takes in parameters for vocabulary size, sequence length, and embedding dimension.
  
- **`inputs = Input(shape=(maxlen,))`:**  
  Defines an input layer that expects sequences of length `maxlen`.
  
- **`layers.Embedding(...)`:**  
  Converts the input word indices into dense embedding vectors of dimension `embedding_dim`. This layer maps each word index to a vector.
  
- **`PositionalEncoding(...)`:**  
  Adds positional encodings to the word embeddings, ensuring that the model is aware of the token order.
  
- **`TransformerBlock(...)`:**  
  Applies our custom Transformer block to the encoded embeddings. This block performs multi-head self-attention and a feed-forward operation to extract contextual features.
  
- **`layers.GlobalAveragePooling1D()`:**  
  Reduces the sequence dimension by averaging across all token representations. This results in a fixed-size vector for each sample.
  
- **`layers.Dense(64, activation="relu")`:**  
  Adds a dense layer with 64 units and ReLU activation to further process the features.
  
- **`layers.Dense(1, activation="sigmoid")`:**  
  The final dense layer outputs a single probability value (using sigmoid activation) for binary classification (sentiment positive or negative).
  
- **Model Compilation:**  
  The model is compiled with:
  - **Optimizer:** Adam
  - **Loss Function:** Binary cross-entropy (appropriate for binary classification)
  - **Metrics:** Accuracy
- **`model.summary()`:**  
  Prints the model summary to show the layer configuration, output shapes, and the number of parameters.


### 4.2 Training the Model

We now train the model on the IMDB dataset.


In [7]:
# Cell 6: Train the Model

history = model.fit(
    x_train, y_train,
    epochs=3,                # Set to 3 for demonstration; increase for better performance.
    batch_size=64,
    validation_split=0.2
)


Epoch 1/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 827ms/step - accuracy: 0.5788 - loss: 0.6498 - val_accuracy: 0.8536 - val_loss: 0.3415
Epoch 2/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m247s[0m 788ms/step - accuracy: 0.8778 - loss: 0.2903 - val_accuracy: 0.8742 - val_loss: 0.2996
Epoch 3/3
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m272s[0m 820ms/step - accuracy: 0.9191 - loss: 0.2127 - val_accuracy: 0.8704 - val_loss: 0.3184


### Detailed Explanation:
- **`model.fit(...)`:**  
  Trains the model using the preprocessed IMDB training data (`x_train` and `y_train`).
  
- **Parameters Explained:**
  - **`epochs=3`:**  
    The number of complete passes through the training dataset. Three epochs are set here for a quick demonstration. In practice, you might use more epochs.
  - **`batch_size=64`:**  
    The number of samples processed before the model's weights are updated. A batch size of 64 balances training speed and stability.
  - **`validation_split=0.2`:**  
    Reserves 20% of the training data for validation to monitor the model’s performance on unseen data during training.
  
- **`history`:**  
  The variable `history` stores training metrics (loss and accuracy for both training and validation sets) over the epochs.


### 4.3 Evaluating Model Performance


In [8]:
# Evaluate the Model

loss, accuracy = model.evaluate(x_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)


[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 126ms/step - accuracy: 0.8679 - loss: 0.3161
Test Loss: 0.3135252594947815
Test Accuracy: 0.8703200221061707


### Detailed Explanation of Cell 7:
- **`model.evaluate(x_test, y_test)`:**  
  Evaluates the trained model on the test dataset. This function returns the loss and the metrics (here, accuracy) computed on the test set.
  
- **Storing and Printing Results:**
  - **`loss, accuracy = ...`:**  
    Unpacks the returned loss and accuracy values into separate variables.
  - **`print("Test Loss:", loss)` and `print("Test Accuracy:", accuracy)`:**  
    Prints the test loss and test accuracy, which provide an indication of how well the model generalizes to unseen data.


## 5. Performance Evaluation

The model's performance on the IMDB sentiment classification task using a basic Transformer block is summarized by:
- **Training and Validation Accuracy/Loss:** Observed during training.
- **Test Accuracy:** Provides a measure of how well the model generalizes.

The implementation demonstrates that even a single Transformer block, when combined with proper embedding and positional encoding, can capture contextual information effectively for classification tasks.

---

## 6. Conclusion and Key Takeaways

- **Transformer Fundamentals:**  
  - **Self-Attention:** Enables the model to capture relationships between tokens regardless of their distance.
  - **Encoder-Decoder Architecture:** While we demonstrated only the encoder for classification, the full architecture is essential for tasks like translation.
  - **Positional Encoding:** Critical for injecting sequential order into the model since transformers do not inherently capture position.

- **Implementation Insights:**  
  - We implemented a custom Transformer block in TensorFlow using Keras.
  - The Transformer block was integrated into a sentiment classification pipeline on the IMDB dataset.
  - Even with a basic setup, the Transformer-based model learns meaningful representations for text classification.

- **Practical Relevance:**  
  - Transformers are the backbone of modern LLMs (e.g., GPT, BERT).  
  - Understanding their components is essential for working on advanced NLP tasks and research.


