# Building Blocks of Language Models
In this notebook, we will explore:
- <b>Tokenization</b>: Breaking text into tokens
- <b>Embeddings</b>: Converting tokens to numerical vectors
- <b>Positional Encoding</b>: Adding position information
- <b>Self-Attention</b>: Using all of the above to understand relationships
- <b>Prediction</b>: Generating the next token using the model's output

## Tokenization
- The model can‚Äôt read words directly ‚Äî it only understands numbers. So we need to convert text ‚Üí tokens ‚Üí numbers.
- tokenize() splits into sub-word pieces.
- encode() converts them into numeric IDs.
- decode() reverses the process.

In [None]:
from transformers import AutoTokenizer

# Load a small open model's tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Small Language Models are powerful and efficient!"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Original text:", text)
print("Tokens:", tokens)
print("Token IDs:", ids)
print("Decoded back:", tokenizer.decode(ids))

### A Simple Tokenizer Implementation (Toy Tokenizer)

In [None]:
import re

def basic_tokenizer(text):
    # Lowercase + split on spaces & punctuation
    text = text.lower()
    tokens = re.findall(r"\w+|[^\w\s]", text)
    return tokens

sample_text = "Small Language Models are powerful and efficient!"
tokens = basic_tokenizer(sample_text)
print(tokens)
    

### Real GPT-2 BPE tokenizer
1. GPT-2 uses subword tokenization (Byte Pair Encoding, BPE) to handle rare words efficiently.
2. Tokens are not always whole words; BPE splits or merges based on frequency.
3. The tokenizer‚Äôs vocabulary is fixed, built before model training. 
   Tokenizer training is often done on a subset of the full corpus (e.g., 1‚Äì10% of data), if the corpus is extremely large, to save time.
   The subset must be representative of all text the model will encounter.
   The vocabulary generated is then fixed and used throughout model training.

### Analogy
- Think of the vocabulary as a dictionary for the model.
- You want the dictionary to cover all common words in the language you‚Äôre going to teach the model.
- Making a dictionary from Shakespeare when you‚Äôre training on Wikipedia would work poorly ‚Äî it won‚Äôt reflect the frequency of words in Wikipedia.

In [None]:
from transformers import AutoTokenizer

# Load GPT-2 tokenizer (uses real BPE)
hf_tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Small Language Models are powerful and efficient!"
tokens_hf = hf_tokenizer.tokenize(text)
ids_hf = hf_tokenizer.encode(text)

print("Original text:", text)
print("\nHugging Face BPE Tokens:", tokens_hf)
print("Token IDs:", ids_hf)
print("Decoded back:", hf_tokenizer.decode(ids_hf))
# Decoding some random token ID - Prints a word referred in the tokenizer vocabulary
print("Random Token:", hf_tokenizer.decode(18710))  

## Embeddings
### Definition of embedding:
* Numerical vector representing an object (word, token, or sentence) in high-dimensional space.
* Similar objects ‚Üí vectors close together; dissimilar ‚Üí far apart.

### Purpose in Transformers / LLMs:
* Converts symbolic text into numbers so the model can process it.
* Captures semantic meaning and, in last_hidden_state, contextual meaning.

#### Example:
- Text: "Small Language Models"
- Tokens: ["Small", "ƒ†Language", "ƒ†Models"]
- Each token ‚Üí 768-dimensional vector (GPT-2 small).

### Why embeddings are important:
* Provide contextualization: combined with attention, the model ‚Äúunderstands‚Äù meaning in context.
* Allow semantic similarity: similar words (e.g., ‚Äúking‚Äù & ‚Äúqueen‚Äù) have similar vectors.
* Serve as the starting point for downstream tasks: next-token prediction, classification, etc.

### Analogy:
* Think of a map of cities
* Each city = token
* Coordinates = embedding vector
* Nearby cities = similar meaning
* Moving through the map = attention + feed-forward layers updating embeddings

### last_hidden_state:
* Tensor of shape [batch_size, seq_len, hidden_size]
* Each token has a contextualized embedding vector after passing through Transformer layers.
* Before passing through layers ‚Üí embeddings are basic numeric representations of tokens.
* After layers ‚Üí embeddings are contextualized, rich representations used for predictions.


In [None]:
# Extract Embeddings from GPT-2
import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

text = "Small Language Models are powerful and efficient!"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # shape: [batch_size, seq_len, hidden_size]

print("Embedding shape:", embeddings.shape)
print("Number of tokens:", embeddings.shape[1])
print("Hidden size:", embeddings.shape[2])


In [None]:
# Visualizing Embeddings with PCA
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

tokens = tokenizer.tokenize(text)
emb_np = embeddings[0].numpy()  # convert from tensor to numpy

# Reduce 768-d to 2D
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(emb_np)

plt.figure(figsize=(8,4))
plt.scatter(emb_2d[:,0], emb_2d[:,1])
for i, token in enumerate(tokens):
    plt.annotate(token, (emb_2d[i,0], emb_2d[i,1]))
plt.title("2D PCA of GPT-2 Token Embeddings")
plt.show()


## Positional Encoding (Sinusoidal + Learned)
Transformers don‚Äôt process words sequentially like RNNs.
So they need to know where each token is in the sentence.

They add a positional vector (e.g., sine/cosine pattern or learned embedding)
so ‚ÄúThe cat‚Äù ‚â† ‚Äúcat The‚Äù.

Result = (embedding + position) for each token.

### 1. Sinusoidal Positional Encoding
#### Implementation

In [None]:
import torch
import math

def sinusoidal_positional_encoding(seq_len, d_model):
    """
    Create sinusoidal positional encodings of shape (seq_len, d_model)
    As described in 'Attention is All You Need'
    """
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)

    # frequencies: 10000^(2i/d_model)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

    # apply sin to even indices, cos to odd
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe


#### Visualize Sinusoidal Encoding Patterns

In [None]:
import matplotlib.pyplot as plt

seq_len = 1024
d_model = 768

pe = sinusoidal_positional_encoding(seq_len, d_model)

plt.figure(figsize=(12, 6))
plt.imshow(pe, aspect='auto', cmap='viridis')
plt.colorbar()
plt.title("Sinusoidal Positional Encoding (1024 positions √ó 768 dims)")
plt.xlabel("Embedding Dimension")
plt.ylabel("Position Index")
plt.show()

#### Extract GPT-2 Token Embeddings for Comparison

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

text = "The dog chased the ball."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    token_emb = outputs.last_hidden_state[0]  # shape: (seq_len, d_model)

print("Token embedding shape:", token_emb.shape)


#### Add Sinusoidal Positional Encoding to Token Embeddings

In [None]:
seq_len = token_emb.shape[0]
d_model = token_emb.shape[1]

pos_enc = sinusoidal_positional_encoding(seq_len, d_model)

# Add position encoding to token embeddings
final_emb = token_emb + pos_enc

print("Final embedding shape:", final_emb.shape)

### 2. Learned Positional Embeddings (GPT-style)
#### Implementation

In [None]:
import torch.nn as nn

class LearnedPositionEmbedding(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.pos_emb = nn.Embedding(max_len, d_model)

    def forward(self, seq_len):
        positions = torch.arange(0, seq_len, dtype=torch.long)
        return self.pos_emb(positions)  # (seq_len, d_model)

max_len = 1024
d_model = 768

learned_pe = LearnedPositionEmbedding(max_len, d_model)
pos_learned = learned_pe(seq_len)

print("Learned positional embedding shape:", pos_learned.shape)


#### Combine Learned Positional Embeddings with Token Embeddings

In [None]:
final_emb_learned = token_emb + pos_learned
print("Final embedding (learned PE) shape:", final_emb_learned.shape)

## Attention
### Understanding & Visualizing Attention (GPT-2)

#### üìò Model Context
- **Model:** `GPT-2` (decoder-only Transformer)
- **Architecture:** 12 layers √ó 12 attention heads  
- **Attention type:** *Causal self-attention* ‚Üí each token can only attend to **previous** tokens (no future look-ahead)

---

#### üîç What the Heatmap Shows
- **Rows (Y-axis):** Query tokens ‚Äî the ones *attending*  
- **Columns (X-axis):** Key tokens ‚Äî the ones *being attended to*  
- **Bright cells:** High attention weight ‚Üí strong focus/relationship  
- **Diagonal line:** Each token attends to itself (self-attention)  
- **Dark upper triangle:** Causal mask (future tokens are hidden)

---

#### üß© Data Shapes
- `attentions[layer]` ‚Üí shape `[batch, heads, seq_len, seq_len]`  
  - Each entry = one attention matrix for a single head  
- `last_hidden_state` ‚Üí contextualized token embeddings  
- `attentions` ‚Üí relationships between tokens (who ‚Äúlooks‚Äù at whom)

---

#### üé® Visualization Settings
- Color map: `"magma"` for strong contrast  
- Log scale: `np.log1p(attention)` to highlight smaller differences  
- Color range: `vmin=0.0`, `vmax‚âà0.15‚Äì0.25` to make bright regions pop  
- Convert tensor before plotting:  
  ```python
  attention = attention.cpu().numpy()


In [None]:
# Understanding & Visualizing Attention (GPT-2)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2", output_attentions=True)
model.eval()

sentence = "The dog chased the ball because it was excited."
inputs = tokenizer(sentence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs, output_attentions=True)

attentions = outputs.attentions  # List of length L (num layers), each element is a tensor of shape (num_heads, seq_len, seq_len)
# So attentions[0] ‚Üí first layer, all heads in that layer.
# attentions[1] ‚Üí second layer, all heads in that layer, and so on.
# attentions[0].shape gives (batch_size, num_heads, seq_len, seq_len). [1] selects the second dimension, which corresponds to number of heads (num_heads).
print(f"Layers: {len(attentions)} | Heads per layer: {attentions[0].shape[1]}")

# Visualize one attention head from the last layer
layer = -1  # last layer
head = 0   # first attention head

attention = attentions[layer][0, head].cpu()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

plt.figure(figsize=(8,6))
sns.heatmap(np.log1p(attention), xticklabels=tokens, yticklabels=tokens, cmap="magma", square=True, vmin=0.0, vmax=0.15)
plt.title(f"GPT-2 Attention Map (Layer {layer}, Head {head})")
plt.xlabel("Key Tokens")
plt.ylabel("Query Tokens")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()