# Lecture 4 Self-Attention

#### 3. Self-Attention in a Simple Transformer Model

This example integrates self-attention into a miniature Transformer model for a simple task, such as language modeling or sequence classification. We'll use PyTorch's built-in nn.MultiheadAttention for efficiency.

In [8]:
import torch
import torch.nn as nn
import torch.optim as optim

In [9]:
class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, num_classes):
        super(SimpleTransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.positional_encoding = nn.Parameter(torch.zeros(1, 100, embed_dim))  # Max seq length=100

        self.self_attention = nn.MultiheadAttention(embed_dim, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.layer_norm2 = nn.LayerNorm(embed_dim)

        self.classifier = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        """
        Args:
            x: Input tensor of shape (batch_size, seq_length)
        
        Returns:
            logits: Output tensor of shape (batch_size, num_classes)
        """
        embed = self.embedding(x) + self.positional_encoding[:, :x.size(1), :]
        embed = embed.transpose(0, 1)  # (seq_length, batch_size, embed_dim)

        # Self-Attention
        attn_output, _ = self.self_attention(embed, embed, embed)
        attn_output = self.layer_norm1(embed + attn_output)

        # Feed-Forward Network
        ff_output = self.feed_forward(attn_output)
        ff_output = self.layer_norm2(attn_output + ff_output)

        # Pooling (e.g., take the mean)
        pooled = ff_output.mean(dim=0)  # (batch_size, embed_dim)

        logits = self.classifier(pooled)  # (batch_size, num_classes)
        return logits

#### Example 1: A Good Example

In [24]:
# Example usage
if __name__ == "__main__":
    # Hyperparameters
    vocab_size = 50
    embed_dim = 16
    num_heads = 2
    hidden_dim = 32
    num_classes = 3
    batch_size = 4
    seq_length = 10

    # Sample input: batch of sequences with token indices
    x = torch.randint(0, vocab_size, (batch_size, seq_length))

    # Initialize model, loss, optimizer
    model = SimpleTransformer(vocab_size, embed_dim, num_heads, hidden_dim, num_classes)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Forward pass
    logits = model(x)
    print("Logits:\n", logits)

    # Sample target labels
    targets = torch.tensor([0, 2, 1, 0])

    # Compute loss
    loss = criterion(logits, targets)
    print("Loss:", loss.item())

    # Backward pass and optimization step
    loss.backward()
    optimizer.step()

Logits:
 tensor([[ 0.0734, -0.2996,  0.4836],
        [-0.0879, -0.4025,  0.5060],
        [-0.1356, -0.2478,  0.7741],
        [-0.1112,  0.0129,  0.4803]], grad_fn=<AddmmBackward0>)
Loss: 1.1979998350143433


**Explanation:**
- Model Components:
    - Embedding Layer: Converts token indices into dense vectors.
    - Positional Encoding: Adds positional information to embeddings to retain the order of tokens.
    - Multihead Self-Attention: Applies self-attention to allow the model to focus on different parts of the input sequence simultaneously.
    - Feed-Forward Network: Processes the output of the attention mechanism through a two-layer neural network with a ReLU activation.
    - Layer Normalization: Applies normalization to stabilize and accelerate training.
    - Classifier: Maps the processed features to the desired number of output classes.

- Forward Pass:
    - Embedding & Positional Encoding: Combine embeddings with positional information.
    - Self-Attention: Apply multi-head self-attention where the input serves as queries, keys, and values.
    - Residual Connections & Layer Norm: Add residual connections followed by layer normalization to facilitate gradient flow.
    - Feed-Forward Processing: Pass the attention output through the feed-forward network with another residual connection and normalization.
    - Pooling: Aggregate the sequence information by averaging over the sequence length.
    - Classification: Produce logits for each class.

- Training Step:
    - Input Generation: Create random input sequences with token indices.
    - Forward Pass: Compute the logits by passing inputs through the model.
    - Loss Computation: Calculate cross-entropy loss between predictions and target labels.
    - Backward Pass & Optimization: Perform backpropagation and update model parameters using the optimizer.

**Explanation of Hyperparameters:**
- vocab_size: This is usually related to the size of your dictionary of words or tokens. In this case, it could be used if you're working with a model that tokenizes input data (for instance, in NLP tasks).
- num_heads: Commonly used in multi-head attention mechanisms, particularly in Transformer models. The num_heads parameter determines how many different "attention" heads will operate in parallel.
- hidden_dim: The size of the hidden layers in a neural network. If you're using a neural network with multiple layers (like an LSTM or Transformer), this parameter controls how many features each layer has.
- num_classes: If your task is a classification task, num_classes defines the number of different possible output labels.
- batch_size: This specifies how many samples (rows) are processed at once. Since your DataFrame has 5 rows, the batch_size is 5 here.
- seq_length: This refers to the length of each sequence (the number of columns in the DataFrame, in this case, 6).

#### Example 2: A Bad Example

In [51]:
import pandas as pd
import torch
import yfinance as yf
from datetime import datetime
import torch.nn as nn
import torch.optim as optim

# Download stock data
start_date = datetime(2024, 9, 30)
end_date = datetime(2024, 10, 5)
stock_symbol = 'SPY'
stocks = yf.download(stock_symbol, start=start_date, end=end_date)

# Example DataFrame:
#              Open    High     Low   Close  Adj Close    Volume
# Date                                                       
# 2024-09-30  450.0  455.0  445.0  452.0      452.0  1000000
# ... (and so on for 6 days)

# Binning the data to fit vocab_size
vocab_size = 6  # Define number of bins
x = torch.tensor(pd.cut(stocks.values.flatten(), bins=vocab_size, labels=False), dtype=torch.long)
x = x.view(stocks.shape)  # Reshape to original DataFrame shape

# Hyperparameters
embed_dim = 16
num_heads = 2
hidden_dim = 3
num_classes = 2
batch_size = 5
seq_length = stocks.shape[1]  # Number of features (e.g., Open, High, Low, etc.)

# Define SimpleTransformer (Example Implementation)
class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, num_classes):
        super(SimpleTransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=1)
        self.fc = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)  # Shape: (batch_size, seq_length, embed_dim)
        embedded = embedded.permute(1, 0, 2)  # Transformer expects (seq_length, batch_size, embed_dim)
        transformer_out = self.transformer_encoder(embedded)
        transformer_out = transformer_out.mean(dim=0)  # Aggregate over seq_length
        logits = self.fc(transformer_out)
        return logits

# Initialize model, loss, optimizer
model = SimpleTransformer(vocab_size, embed_dim, num_heads, hidden_dim, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Ensure the batch size matches the data
if x.size(0) < batch_size:
    raise ValueError(f"Batch size {batch_size} exceeds the number of samples {x.size(0)}")

# Select a batch (for simplicity, use the entire data if batch_size matches)
x_batch = x[:batch_size]  # Shape: (batch_size, seq_length)

# Forward pass
logits = model(x_batch)
print("Logits:\n", logits)

# Sample target labels (ensure they match the batch size)
targets = torch.tensor([0, 1, 0, 0, 1], dtype=torch.long)

# Compute loss
loss = criterion(logits, targets)
print("Loss:", loss.item())

# Backward pass and optimization step
loss.backward()
optimizer.step()

[*********************100%%**********************]  1 of 1 completed

Logits:
 tensor([[0.1998, 0.2814],
        [0.2486, 0.3454],
        [0.4948, 0.2524],
        [0.4785, 0.2527],
        [0.4345, 0.2729]], grad_fn=<AddmmBackward0>)
Loss: 0.6647613644599915





**Explanation**

- Binning:
    - pd.cut: This function discretizes continuous data into specified bins. By setting bins=vocab_size, you ensure that all values are mapped to integers between 0 and vocab_size - 1.
    - Flattening and Reshaping: Since pd.cut operates on a 1D array, flatten the DataFrame values, apply binning, and then reshape back to the original DataFrame shape.

- Model Adjustments:
    - Batch Size Handling: Ensure that the batch_size does not exceed the number of available samples. In this example, since the data covers 6 days, setting batch_size=5 is acceptable.
    - Transformer Input Shape: PyTorch's TransformerEncoder expects input in the shape (seq_length, batch_size, embed_dim). Hence, use permute to rearrange dimensions accordingly.

- Target Labels:
- Ensure that the targets tensor length matches the batch_size.