<center>
    <h1>Self Attention Mechanism</h1>
</center>

# Brief Recap of Self-Attention Mechanism

- Self-attention, also known as intra-attention, is a powerful mechanism in deep learning that allows a model to focus on different parts of its input when processing a sequence. 

- Unlike traditional attention mechanisms that operate between different sequences, self-attention computes relationships within a single sequence. 

- This enables the model to capture long-range dependencies and contextual information more effectively than traditional sequential processing methods

<center>
    <img src="static/image1.gif" alt="Self Attention Mechanism" style="width:50%;">
</center>

## Architecture of Self-Attention Mechanism

- **Query, Key, and Value Matrices:** The input sequence is transformed into three separate representations:
    - Query (Q): Represents the current element's search query

    - Key (K): Represents elements to be matched against
    - Value (V): Represents the actual content to be aggregated

- **Attention Scores:** Computed by taking the dot product of the Query with all Keys, measuring how much focus to place on other parts of the input.

- **Scaling:** The attention scores are divided by the square root of the dimension of the Key vectors to stabilize gradients.

- **Softmax:** Applied to the scaled attention scores to obtain attention weights.

- **Weighted Sum:** The final output is computed by taking a weighted sum of the Value vectors, using the attention weights.

The mathematical representation of self-attention can be expressed as:

$$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Where $d_k$ is the dimension of the Key vectors.


<center>
    <img src="static/image2.webp" alt="Self Attention Mechanism" style="width:50%;">
</center>

## Applications of Self-Attention Mechanism

Self-attention has revolutionized various AI domains:

- **Natural Language Processing (NLP)**
  - Machine Translation
  - Text Summarization
  - Question Answering

- **Computer Vision**
  - Image Classification
  - Object Detection
  - Image Captioning

- **Speech Recognition**

- **Recommendation Systems**

- **Bioinformatics**
  - Protein Structure Prediction
  - DNA Sequence Analysis

- **Time Series Analysis**
  - Financial Forecasting
  - Anomaly Detection

- **Graph Neural Networks**

# Implementing Self Attention Mechanism with Tensorflow

The Self Attention Mechanism can be designed by defining a custom class which can be used as a TensorFlow layer. Let's get to know how to design it:

## Approach: 1

### Class Definition and Initialization

```python
class SelfAttention(tf.keras.layers.Layer):
    def __init__(self, attention_units):
        super(SelfAttention, self).__init__()
        self.attention_units = attention_units
```

- We define a class `SelfAttention` that inherits from `tf.keras.layers.Layer`.
- The `__init__` method initializes the layer with a specified number of `attention_units`.

### Building the Layer

```python
def build(self, input_shape):
    self.W_q = self.add_weight("W_q", shape=(input_shape[-1], self.attention_units))
    self.W_k = self.add_weight("W_k", shape=(input_shape[-1], self.attention_units))
    self.W_v = self.add_weight("W_v", shape=(input_shape[-1], self.attention_units))
```

- The `build` method is called once the layer knows its input shape.
- We create three weight matrices: `W_q`, `W_k`, and `W_v` for query, key, and value transformations respectively.
- Each weight matrix transforms the input from its last dimension to `attention_units`.

### Forward Pass

```python
def call(self, inputs):
    q = tf.matmul(inputs, self.W_q)
    k = tf.matmul(inputs, self.W_k)
    v = tf.matmul(inputs, self.W_v)
```

- In the `call` method, we first compute query (q), key (k), and value (v) by multiplying the input with their respective weight matrices.


### Attention Score Computation

```python
attention_scores = tf.matmul(q, k, transpose_b=True)
attention_scores = attention_scores / tf.math.sqrt(tf.cast(self.attention_units, tf.float32))
```

- We compute attention scores by multiplying q and k (transposed).
- The scores are then scaled by dividing by the square root of `attention_units`. This scaling helps stabilize gradients during training.


### Attention Weights and Output

```python
attention_weights = tf.nn.softmax(attention_scores, axis=-1)
output = tf.matmul(attention_weights, v)

return output, attention_weights
```

- We apply softmax to get attention weights, which sum to 1 along the last axis.
- The final output is computed by multiplying attention weights with the value (v).
- The method returns both the output and the attention weights.

This self-attention mechanism allows the model to weigh the importance of different parts of the input sequence when processing each element. It's particularly useful for capturing long-range dependencies in sequential data. Let's combine it all together to see how the final class looks like:

In [1]:
import tensorflow as tf

# Self-Attention Layer
class SelfAttention(tf.keras.layers.Layer):
    
    # Constructor
    def __init__(self, attention_units):
        super(SelfAttention, self).__init__()
        self.attention_units = attention_units
    
    # Build method to define the weights
    def build(self, input_shape):
        self.W_q = self.add_weight("W_q", shape=(input_shape[-1], self.attention_units))
        self.W_k = self.add_weight("W_k", shape=(input_shape[-1], self.attention_units))
        self.W_v = self.add_weight("W_v", shape=(input_shape[-1], self.attention_units))
    
    # Call method to perform the calculations
    def call(self, inputs):
        # Calculate the query, key, and value matrices
        q = tf.matmul(inputs, self.W_q)
        k = tf.matmul(inputs, self.W_k)
        v = tf.matmul(inputs, self.W_v)
        
        # Calculate the attention scores
        attention_scores = tf.matmul(q, k, transpose_b=True)
        attention_scores = attention_scores / tf.math.sqrt(tf.cast(self.attention_units, tf.float32))
        
        # Apply the softmax activation function
        attention_weights = tf.nn.softmax(attention_scores, axis=-1)
        output = tf.matmul(attention_weights, v)
        
        return output, attention_weights

## Approach: 2

Although, the above approach gives you more control in designing your self-attention mechanism, there is an alternate and more simpler way to do that

TensorFlow provides built-in layers for implementing attention mechanisms, including self-attention. Here's how you can implement self-attention using Keras APIs:

```python
# Create the Attention layer
attention_layer = tf.keras.layers.Attention(use_scale=True, score_mode='dot')
```

**Key Inputs:**
- query: A tensor of shape (batch_size, Tq, dim), where Tq is the query sequence length.

- value: A tensor of shape (batch_size, Tv, dim), where Tv is the value sequence length.
- key: An optional tensor of shape (batch_size, Tv, dim). If not provided, value is used as the key.

**Output:** 
- The layer returns attention outputs of shape (batch_size, Tq, dim).

- Optionally, it can return attention scores after masking and softmax with shape (batch_size, Tq, Tv).

**Arguments:**
- use_scale (default: False): If True, creates a scalar variable to scale the attention scores.
- score_mode (default: 'dot'): The scoring function to use. Options are 'dot', 'general', or 'additive'.
- dropout (default: 0.0): Dropout rate for attention weights.
- seed (default: None): Random seed for dropout.

For more information about Tensorflow's Attention layeer, refer to this link: [Tensorflow Attention Layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention)

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Attention
from tensorflow.keras.models import Model

# Define model parameters
sequence_length = 20  # Length of input sequences
input_dim = 10        # Dimensionality of each element in the sequence
num_classes = 5       # Number of output classes

# Define input shape
input_shape = (sequence_length, input_dim)

# Create inputs
inputs = Input(shape=input_shape)

# Create query, key, and value through dense layers
query = Dense(64, name='query_layer')(inputs)
key = Dense(64, name='key_layer')(inputs)
value = Dense(64, name='value_layer')(inputs)

# Apply attention
attention_output = Attention(name='attention_layer')([query, key, value])

# Further processing
x = Dense(32, activation='relu', name='dense_layer')(attention_output)
outputs = Dense(num_classes, activation='softmax', name='output_layer')(x)

# Create the model
model = Model(inputs=inputs, outputs=outputs)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()

**Key Points:**

- We create the input layer using Input(shape=input_shape). This layer expects input data with shape (batch_size, sequence_length, input_dim).

- We create separate dense layers for query, key, and value:
    - These layers transform the input into representations suitable for attention mechanism.
    - Each dense layer has 64 units, but you can adjust this based on your needs.

- We apply the attention mechanism using the Attention() layer:
    - This layer takes query, key, and value as inputs and computes the attention output.

- After the attention layer, we add a dense layer with 32 units and ReLU activation for further processing.

- The final output layer uses softmax activation for multi-class classification with num_classes units.

# Let's Build a Real world Project to understand the concept of Self Attention Mechanism better

# Sentiment Analysis of IMDB Reviews with Attention Mechanism

## Problem Description

We aim to build a sentiment analysis model using self-attention to classify movie reviews as positive or negative. This project will demonstrate the effectiveness of the self-attention mechanism in capturing important features from text data for sentiment classification.

## Dataset Decription

- The IMDB dataset consists of 50,000 movie reviews, split evenly into 25,000 training and 25,000 testing samples. 

- Each review is labeled as either positive (1) or negative (0). 

- The dataset has been preprocessed to contain only the most frequent 10,000 words.

- Key features of the dataset:
    - 50,000 movie reviews (25,000 for training, 25,000 for testing)
    - Binary sentiment classification (positive or negative)
    - Preprocessed to include only the top 10,000 most frequent words
    - Variable length reviews

- For more information about the dataset, refer to this link: [IMDB TF Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) || [IMDB Stanford Dataset](https://ai.stanford.edu/%7Eamaas/data/sentiment/)

## Loading and Preprocessing of the dataset

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import imdb

In [24]:
# Set random seed for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

In [25]:
# Data Loading
vocab_size = 10000
max_length = 200
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

In [None]:
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
print(f"Shape of training data: {X_train.shape}")
print(f"Shape of testing data: {X_test.shape}")

### Padding the Sequences

This step prepares our text data for input into our neural network. Here's why we do this:

- **Uniform Length**: Neural networks typically require input data to have a consistent shape. However, movie reviews can vary in length.

- **pad_sequences Function**: This Keras utility helps us standardize the length of our sequences.

- **How it Works**:
   - For sequences shorter than `max_length`, it adds padding (usually zeros) at the beginning.
   - For sequences longer than `max_length`, it truncates them, keeping the last `max_length` elements.

- **Preserving Recent Information**: By keeping the end of longer sequences, we retain the most recent (and often most relevant) parts of the reviews.

- **Efficiency**: Having fixed-length inputs allows for more efficient batch processing during training.

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_length)
X_test = pad_sequences(X_test, maxlen=max_length)

Converting to float32 prepares our labels for efficient processing by the neural network model.

In [6]:
y_train = y_train.astype(np.float32)
y_test = y_test.astype(np.float32)

In [None]:
print(f"Shape of training data: {X_train.shape}")
print(f"Shape of testing data: {X_test.shape}")

## Model Building

In the below code, while designing the layers, we use the following layers:
- **Input Layer**: Accepts sequences of `max_length`.

- **Embedding Layer**: Converts word indices to dense vectors. Uses L2 regularization to prevent overfitting.

- **Attention Layer**: Applies self-attention mechanism to focus on important parts of the input.

- **Layer Normalization**: Normalizes the outputs, helping with training stability.

- **Global Average Pooling**: Reduces the sequence dimension to a fixed size.

- **Dropout Layers**: Randomly drops 50% of neurons during training to prevent overfitting.

- **Dense Layers**: Fully connected layers for feature extraction and final prediction. Use L2 regularization and ReLU activation.

- **Output Layer**: Single neuron with sigmoid activation for binary classification.

In [15]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, Embedding, Dense, GlobalAveragePooling1D, Attention, LayerNormalization, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.regularizers import l2


# Model Building
def create_model(vocab_size, max_length):
    inputs = Input(shape=(max_length,))
    embedding = Embedding(vocab_size, 128, embeddings_regularizer=l2(1e-5))(inputs)
    attention = Attention()([embedding, embedding])
    normalized = LayerNormalization(epsilon=1e-6)(attention + embedding)
    pooled = GlobalAveragePooling1D()(normalized)
    x = Dropout(0.5)(pooled)
    x = Dense(64, activation='relu', kernel_regularizer=l2(1e-4))(x)
    x = Dropout(0.5)(x)
    outputs = Dense(1, activation='sigmoid', kernel_regularizer=l2(1e-4))(x)
    model = Model(inputs=inputs, outputs=outputs)
    return model

The model is then compiled with Adam optimizer, binary crossentropy loss (suitable for binary classification), and accuracy metric.

In [None]:
model = create_model(vocab_size, max_length)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

### Callbacks for Training Optimization

These callbacks help optimize the training process:
- Early Stopping:
    - Monitors validation loss

    - Stops training if no improvement for 3 epochs
    - Restores the best weights to prevent overfitting

- ReduceLROnPlateau:
    - Reduces learning rate when progress plateaus
    
    - Decreases learning rate by 20% if no improvement for 2 epochs
    - Helps fine-tune the model and overcome local minima

In [17]:
# Callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=1e-5)

The below code initiates the training process for our sentiment analysis model:

- Input Data: Uses X_train (input sequences) and y_train (labels) for training.

- Epochs: Sets a maximum of 20 training cycles through the entire dataset.

- Batch Size: Processes 32 samples at a time, balancing between speed and memory usage.

- Validation Split: Reserves 20% of the training data for validation.

- Callbacks:
    - Applies Early Stopping to prevent overfitting
    - Uses ReduceLROnPlateau to adjust learning rate dynamically

- Verbose Mode: Set to 1 for detailed progress output during training.

The fit() method returns a history object containing training metrics, which can be used later for performance analysis and visualization.

In [None]:
# Model Training
history = model.fit(
    X_train, y_train,
    epochs=20,  # Increased max epochs
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

## Model Evaluation and Visualizations

In [None]:
# Model Testing and Evaluation
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {test_accuracy:.4f}")
print(f"Test loss: {test_loss:.4f}")

**Loss Plot:**

- Left subplot shows training and validation loss over epochs.

- Helps identify overfitting (if validation loss increases while training loss decreases).

**Accuracy Plot:**

- Right subplot displays training and validation accuracy over epochs.

- Indicates how well the model is learning and generalizing.

In [None]:
# Loss and Accuracy Plot
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

The code below visualizes how the model's attention mechanism focuses on different parts of the input:

1. **Extract Attention Layer**: Gets the attention layer from the trained model.

2. **Create Attention Model**: Builds a new model that outputs the attention weights.

3. **Sample Input**: Uses the first sample from the test set.

4. **Predict Attention Weights**: Applies the sample input to get attention weights.

5. **Visualization**:
   - Creates a heatmap of attention weights.
   - X and Y axes represent sequence positions.
   - Brighter colors indicate higher attention weights.

6. **Interpretation**:
   - Vertical bright lines show words the model focuses on across the sequence.
   - This helps understand which parts of the input are most influential in the model's decision.

In [None]:
# Attention Weights Visualization
attention_layer = model.get_layer('attention_2')
attention_model = Model(inputs=model.input, outputs=attention_layer.output)

sample_input = X_test[:1]
attention_weights = attention_model.predict(sample_input)

plt.figure(figsize=(10, 8))
plt.imshow(attention_weights[0], cmap='viridis')
plt.title('Attention Weights')
plt.xlabel('Sequence Position')
plt.ylabel('Sequence Position')
plt.colorbar()
plt.show()

In [None]:
# Sample Prediction
def predict_sentiment(review_text):
    word_index = imdb.get_word_index()
    review_sequence = [word_index.get(word, 0) for word in review_text.lower().split()]
    review_sequence = pad_sequences([review_sequence], maxlen=max_length)
    prediction = model.predict(review_sequence)[0][0]
    return "Positive" if prediction > 0.5 else "Negative", prediction

sample_review = "This movie was fantastic! The acting was great and the plot was engaging."
sentiment, score = predict_sentiment(sample_review)
print(f"Sample review: {sample_review}")
print(f"Predicted sentiment: {sentiment}")
print(f"Sentiment score: {score:.4f}")