# Attention Mechanisms in Deep Learning

## Introduction

In this notebook, we will:

- **Introduce** the concept of Attention Mechanisms and understand why they are pivotal in modern deep learning models.
- **Explore** the various types of attention, including Soft Attention, Hard Attention, and Self-Attention.
- **Implement** key attention mechanisms such as Scaled Dot-Product Attention and Additive Attention using PyTorch.
- **Provide** resources for further reading to deepen your understanding.

**Resources for Further Reading:**

- [Attention Mechanism Explained](https://towardsdatascience.com/attention-mechanism-explained-8f96b26ebae)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

**Prerequisites:**

- Familiarity with Python and PyTorch.
- Understanding of neural network fundamentals, especially sequence models like RNNs.

**Note:** Attention mechanisms have revolutionized the field of Natural Language Processing (NLP) and are integral to architectures like Transformers. They help models focus on relevant parts of the input sequence, addressing limitations inherent in traditional RNNs.

## Why Attention?: Overcoming Limitations of RNNs

Recurrent Neural Networks (RNNs) are powerful for modeling sequential data, but they have notable limitations:

- **Long-Term Dependencies:** RNNs struggle to capture dependencies between distant elements in a sequence due to issues like vanishing and exploding gradients.
- **Sequential Processing:** RNNs process data sequentially, making it difficult to parallelize computations.
- **Fixed-Size Context:** The hidden state in RNNs acts as a bottleneck, limiting the amount of information that can be retained from the input.

**Attention Mechanisms** address these challenges by allowing models to dynamically focus on different parts of the input sequence when producing each element of the output. This leads to better performance, especially in tasks requiring the integration of information from various parts of the input.

## Types of Attention

### 1. Soft vs. Hard Attention

- **Soft Attention:**
  - **Deterministic and Differentiable:** Allows the model to consider all parts of the input with varying degrees of importance.
  - **Weighted Sum:** Computes a weighted average of the input features.
  - **Backpropagation-Friendly:** Can be trained end-to-end using gradient-based optimization.

- **Hard Attention:**
  - **Stochastic and Non-Differentiable:** Selects specific parts of the input, often requiring reinforcement learning techniques for training.
  - **Discrete Selection:** Chooses exact elements or regions to focus on.
  - **Less Common:** Due to training difficulties, hard attention is less frequently used in practice.

### 2. Self-Attention

- **Definition:** A mechanism where different positions of a single sequence are related to each other to compute a representation of the sequence.
- **Usage:** Fundamental to Transformer architectures, enabling the model to capture dependencies regardless of their distance in the sequence.
- **Benefits:**
  - **Parallelization:** Unlike RNNs, self-attention allows for parallel processing of sequence elements.
  - **Long-Range Dependencies:** Effectively captures relationships between distant elements in the sequence.

## Implementation

We will implement two primary attention mechanisms:

1. **Scaled Dot-Product Attention**
2. **Additive Attention**

Both implementations will be in PyTorch.

### 1. Scaled Dot-Product Attention

**Overview:**

Scaled Dot-Product Attention computes the attention weights using the dot product of queries and keys, scales them, applies a softmax to obtain probabilities, and then uses these to weight the values.

**Formula:**

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

where:
- \( Q \) = Query matrix
- \( K \) = Key matrix
- \( V \) = Value matrix
- \( d_k \) = Dimension of the keys

**Implementation:**


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttention, self).__init__()
        self.d_k = d_k

    def forward(self, Q, K, V, mask=None):
        """
        Q: Queries shape (..., seq_len_q, d_k)
        K: Keys shape (..., seq_len_k, d_k)
        V: Values shape (..., seq_len_v, d_v)
        mask: Optional mask tensor
        """
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)  # (..., seq_len_q, seq_len_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn = F.softmax(scores, dim=-1)  # (..., seq_len_q, seq_len_k)
        output = torch.matmul(attn, V)     # (..., seq_len_q, d_v)
        return output, attn


**Example Usage:**

In [2]:
# Example parameters
batch_size = 2
seq_len_q = 3
seq_len_k = 4
d_k = 5
d_v = 6

# Random tensors for Q, K, V
Q = torch.randn(batch_size, seq_len_q, d_k)
K = torch.randn(batch_size, seq_len_k, d_k)
V = torch.randn(batch_size, seq_len_k, d_v)

# Initialize attention module
attention = ScaledDotProductAttention(d_k)

# Forward pass
output, attn_weights = attention(Q, K, V)

print("Output shape:", output.shape)          # Expected: (2, 3, 6)
print("Attention weights shape:", attn_weights.shape)  # Expected: (2, 3, 4)

Output shape: torch.Size([2, 3, 6])
Attention weights shape: torch.Size([2, 3, 4])


### 2. Additive Attention

**Overview:**

Additive Attention, introduced by Bahdanau et al., computes attention scores by applying a feed-forward network to the concatenation of queries and keys, followed by a non-linear activation (usually \( \tanh \)).

**Formula:**

$$
\text{Attention}(Q, K, V) = \text{softmax}(\text{score}(Q, K))V
$$

where:

$$
\text{score}(Q, K) = \mathbf{v}^T \tanh(\mathbf{W}_q Q + \mathbf{W}_k K)
$$

**Implementation:**

In [3]:
class AdditiveAttention(nn.Module):
    def __init__(self, d_q, d_k, d_attn):
        super(AdditiveAttention, self).__init__()
        self.W_q = nn.Linear(d_q, d_attn)
        self.W_k = nn.Linear(d_k, d_attn)
        self.v = nn.Linear(d_attn, 1, bias=False)

    def forward(self, Q, K, V, mask=None):
        """
        Q: Queries shape (batch_size, seq_len_q, d_q)
        K: Keys shape (batch_size, seq_len_k, d_k)
        V: Values shape (batch_size, seq_len_k, d_v)
        mask: Optional mask tensor
        """
        # Expand Q and K for addition
        # Q: (batch_size, seq_len_q, 1, d_q)
        # K: (batch_size, 1, seq_len_k, d_k)
        Q_expanded = Q.unsqueeze(2)
        K_expanded = K.unsqueeze(1)
        
        # Apply linear layers and activation
        energy = torch.tanh(self.W_q(Q_expanded) + self.W_k(K_expanded))  # (batch_size, seq_len_q, seq_len_k, d_attn)
        scores = self.v(energy).squeeze(-1)  # (batch_size, seq_len_q, seq_len_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn = F.softmax(scores, dim=-1)  # (batch_size, seq_len_q, seq_len_k)
        output = torch.matmul(attn, V)     # (batch_size, seq_len_q, d_v)
        return output, attn


**Example Usage:**

In [4]:
# Example parameters
batch_size = 2
seq_len_q = 3
seq_len_k = 4
d_q = 5
d_k = 5
d_v = 6
d_attn = 10

# Random tensors for Q, K, V
Q = torch.randn(batch_size, seq_len_q, d_q)
K = torch.randn(batch_size, seq_len_k, d_k)
V = torch.randn(batch_size, seq_len_k, d_v)

# Initialize attention module
additive_attention = AdditiveAttention(d_q, d_k, d_attn)

# Forward pass
output, attn_weights = additive_attention(Q, K, V)

print("Output shape:", output.shape)          # Expected: (2, 3, 6)
print("Attention weights shape:", attn_weights.shape)  # Expected: (2, 3, 4)

Output shape: torch.Size([2, 3, 6])
Attention weights shape: torch.Size([2, 3, 4])


## Self-Attention

**Overview:**

Self-Attention allows a sequence to interact with itself (i.e., different positions within the same sequence) to compute a representation of the sequence. This mechanism is pivotal in Transformer models.

**Key Components:**

- **Queries, Keys, Values:** Derived from the same input sequence.
- **Multi-Head Attention:** Extends self-attention by running multiple attention mechanisms in parallel, allowing the model to focus on different representation subspaces.

**Implementation Example:**

In [5]:
class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super(SelfAttention, self).__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        
        self.num_heads = num_heads
        self.embed_dim = embed_dim
        self.head_dim = embed_dim // num_heads
        
        self.q_linear = nn.Linear(embed_dim, embed_dim)
        self.k_linear = nn.Linear(embed_dim, embed_dim)
        self.v_linear = nn.Linear(embed_dim, embed_dim)
        self.out_linear = nn.Linear(embed_dim, embed_dim)
        
        self.attention = ScaledDotProductAttention(self.head_dim)
        
    def forward(self, x, mask=None):
        batch_size, seq_len, embed_dim = x.size()
        
        # Linear projections
        Q = self.q_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1,2)  # (batch, heads, seq_len, head_dim)
        K = self.k_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1,2)
        V = self.v_linear(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1,2)
        
        # Apply attention on all the projected vectors in batch
        attn_output, attn_weights = self.attention(Q, K, V, mask=mask)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1,2).contiguous().view(batch_size, seq_len, embed_dim)
        
        # Final linear layer
        output = self.out_linear(attn_output)
        return output, attn_weights

**Example Usage:**

In [6]:
# Example parameters
batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

# Random input tensor
x = torch.randn(batch_size, seq_len, embed_dim)

# Initialize self-attention module
self_attention = SelfAttention(embed_dim, num_heads)

# Forward pass
output, attn_weights = self_attention(x)

print("Output shape:", output.shape)            # Expected: (2, 5, 16)
print("Attention weights shape:", attn_weights.shape)  # Expected: (2, 4, 5, 5)

Output shape: torch.Size([2, 5, 16])
Attention weights shape: torch.Size([2, 4, 5, 5])


## Practical Example: Implementing Attention in a Sequence-to-Sequence Model

To solidify our understanding, let's implement a simple sequence-to-sequence (Seq2Seq) model with attention for a translation task. We'll use the Scaled Dot-Product Attention mechanism.

**Note:** This is a simplified example for educational purposes.

### 1. Preparing the Data

For demonstration, we'll use dummy data. In practice, you'd use a dataset like the English-French sentence pairs.

In [7]:
# Sample data: pairs of sentences
source_sentences = [
    "hello",
    "how are you",
    "good morning",
    "good night",
    "thank you"
]

target_sentences = [
    "bonjour",
    "comment ça va",
    "bonjour",
    "bonne nuit",
    "merci"
]

# Create vocabulary
source_vocab = sorted(list(set(" ".join(source_sentences))))
target_vocab = sorted(list(set(" ".join(target_sentences))))

source_char2idx = { ch:i for i,ch in enumerate(source_vocab) }
source_idx2char = { i:ch for i,ch in enumerate(source_vocab) }

target_char2idx = { ch:i for i,ch in enumerate(target_vocab) }
target_idx2char = { i:ch for i,ch in enumerate(target_vocab) }

# Convert sentences to indices
def encode_sentence(sentence, char2idx):
    return [char2idx[ch] for ch in sentence]

encoded_sources = [encode_sentence(s, source_char2idx) for s in source_sentences]
encoded_targets = [encode_sentence(s, target_char2idx) for s in target_sentences]

print("Encoded Sources:", encoded_sources)
print("Encoded Targets:", encoded_targets)

Encoded Sources: [[5, 3, 8, 8, 11], [5, 11, 15, 0, 1, 12, 3, 0, 16, 11, 14], [4, 11, 11, 2, 0, 9, 11, 12, 10, 6, 10, 4], [4, 11, 11, 2, 0, 10, 6, 4, 5, 13], [13, 5, 1, 10, 7, 0, 16, 11, 14]]
Encoded Targets: [[2, 9, 8, 6, 9, 12, 10], [3, 9, 7, 7, 4, 8, 11, 0, 14, 1, 0, 13, 1], [2, 9, 8, 6, 9, 12, 10], [2, 9, 8, 8, 4, 0, 8, 12, 5, 11], [7, 4, 10, 3, 5]]



### 2. Defining the Encoder

The encoder processes the input sequence and produces key and value vectors for attention.


In [8]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim, num_heads):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.self_attention = SelfAttention(embed_dim, num_heads)
        self.fc = nn.Linear(embed_dim, hidden_dim)
        self.relu = nn.ReLU()
        
    def forward(self, x, mask=None):
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        attn_output, attn_weights = self.self_attention(embedded, mask)
        output = self.relu(self.fc(attn_output))
        return output, attn_weights

### 3. Defining the Decoder

The decoder generates the output sequence, attending to the encoder's outputs.

In [9]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim, num_heads):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.self_attention = SelfAttention(embed_dim, num_heads)
        self.encoder_attention = ScaledDotProductAttention(hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()
        
    def forward(self, x, encoder_output, mask=None):
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        dec_attn_output, dec_attn_weights = self.self_attention(embedded, mask)
        
        # Compute attention with encoder outputs
        Q = dec_attn_output
        K = encoder_output
        V = encoder_output
        attn_output, attn_weights = self.encoder_attention(Q, K, V, mask)
        
        output = self.relu(self.fc(attn_output))
        return output, attn_weights

### 4. Training the Model

Due to the simplicity of our data, we'll skip the training loop. In practice, you'd define a loss function (e.g., CrossEntropyLoss), an optimizer, and iterate over epochs to train the model.

## Analysis of Attention Mechanisms

Attention mechanisms allow models to dynamically focus on relevant parts of the input, enhancing performance in tasks like machine translation, text summarization, and more. They alleviate the limitations of RNNs by:

- **Capturing Long-Range Dependencies:** By directly connecting any two positions in the input, regardless of their distance.
- **Improving Parallelization:** Especially in Transformer models, attention enables parallel processing of sequence elements.
- **Enhancing Interpretability:** Attention weights can provide insights into which parts of the input the model focuses on during prediction.

## Further Steps

- **Explore Multi-Head Attention:** Understand how multiple attention heads can capture diverse aspects of the input.
- **Implement Transformer Models:** Dive deeper into Transformer architectures, which rely heavily on attention mechanisms.
- **Experiment with Different Attention Types:** Implement and compare soft attention, hard attention, and self-attention in various tasks.
- **Visualize Attention Weights:** Gain insights by visualizing where the model is focusing its attention during predictions.

**Remember:** Attention mechanisms are foundational to many state-of-the-art models in NLP and beyond. Mastering them will significantly enhance your ability to design and understand complex neural network architectures.

## References

- [Attention Mechanism Explained](https://towardsdatascience.com/attention-mechanism-explained-8f96b26ebae)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)