# Reading Notes: "Attention Is All You Need"
Author: Muyang Han
Date: 2025-12-21
Status: Step 1 completed (rough read, grasped motivation and core problem)
Date: 2025-12-21
Status: Step 2 completed (detailed read, understood architecture and mechanisms)

## Overview
This notebook records my notes and understanding of the Transformer paper. 

**Progress:**
- âœ… Step 1: Skimmed the paper, understood motivation and core problem
- âœ… Step 2: Completed detailed reading, analyzed architecture components
- ðŸ”„ Step 3: Code implementation (planned)

## Why Transformer
Before the Transformer model, sequential modeling mainly relied on RNNs and CNNs. However, these models have obvious downsides. In particular, the RNN model suffers from the limitation of its inherently sequential computation that hinders prallelization. The Transformer model, on the other hand, abandons recurrence, relying only on the attention mechanism, enabling significantly more parallelization, more efficient training and better performance.

## Transformer Architecture

### Encoder Block
1. Divide the input into tokens, and map them into vectors by using embedding layer.
2. Add positional encoding to the vector to get a input matrix X.
3. Multiply X matrix with W_Q, W_K, W_V to get Q,K,V matrix. Here, W_Q, W_K, W_V are not single matrix. It is multiple independent W_Qs, W_Ks, W_Vs. Therefore, we will get multiple Q, K and V.
4. Each head use the Scaled Dot-Product Attention to get a matrix containing context-aware representation. The output of all heads are concatenated and pass through a linear layer.
5. Use residual connection and layer norm to enhance stability.
6. Put the matrix into a Feed Forward Network.
7. Apply a second Add & Norm

### Decoder Block
1. Same steps as the Encoder Block.
2. Instead of Multi-Head Attention, the decoder block uses a masked Multi-Head Attention.
3. After Add & Norm, the Q is send to the second Multi-Head Attention Block. K and V are from the encoder.
4. Feed Forward Network and Add & Norm.

Finally, the output of decoder will be send to a linear layer, and then convert into probabilities by using softmax function.

In [None]:
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (self.head_dim * heads == embed_size), "embed size need to be divisible by heads"