# Attention Mechanisms for Neural Networks

This notebook contains PyTorch examples demonstrating attention mechanisms essential for understanding transformers.

## Table of Contents
1. [Scaled Dot-Product Attention](#scaled-dot-product-attention)

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

## Scaled Dot-Product Attention

**Formula:** $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Core operation of transformers.

In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    attention_weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

# Example usage
seq_len, d_model = 10, 64
Q = torch.randn(1, seq_len, d_model)
K = torch.randn(1, seq_len, d_model)
V = torch.randn(1, seq_len, d_model)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Attention weights sum per query: {weights.sum(dim=-1)}")  # Should be all 1s