In [None]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

## Day 7 - Part 2: Complete Bigram Language Model

---

### üîó **Continuing from Part 1**

In Part 1, we built our first language model - the **Bigram Model**! We learned:
- ‚úÖ How to tokenize text (characters ‚Üí numbers ‚Üí characters)
- ‚úÖ How to chunk and batch data for efficient training
- ‚úÖ How to build a simple neural network for text generation

**But there were some limitations:**
- ‚ùå No proper loss tracking (train vs validation)
- ‚ùå No GPU support for faster training
- ‚ùå Training loop was basic

---

### üéØ **Agenda for this Notebook**

| Section | Topic | Description |
|:-------:|-------|-------------|
| 1 | **Setup & Hyperparameters** | Configure all training settings in one place |
| 2 | **Data Loading** | Reuse our tokenization and batching code |
| 3 | **Loss Estimation** | Add proper evaluation on train AND validation sets |
| 4 | **Model Definition** | Same Bigram model, now with GPU support |
| 5 | **Training Loop** | Better training with periodic evaluation |
| 6 | **Text Generation** | Generate Shakespeare-like text! |

---

### üéì **What's New in Part 2?**

```python
# New: GPU support
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# New: Proper loss estimation
@torch.no_grad()
def estimate_loss():
    # Average loss over multiple batches for stability
```

Let's make our Bigram model production-ready! üöÄ

---
## Section 1: Importing Libraries

Same as before - we only need PyTorch!

In [13]:
import torch

---
## Section 2: Defining Hyperparameters

**Best Practice:** Define ALL hyperparameters at the top of your notebook!

This makes it easy to:
- üîß Experiment with different settings
- üìä Track what parameters you used
- üîÑ Reproduce your results

| Parameter | Description |
|-----------|-------------|
| `batch_size` | Number of sequences processed in parallel |
| `block_size` | Maximum context length (sequence length) |
| `max_iters` | Total training iterations |
| `eval_interval` | How often to evaluate loss |
| `learning_rate` | Step size for optimization |
| `device` | CPU or GPU |
| `eval_iters` | Batches to average for loss estimation |

In [14]:
batch_size = 32 # how many independent sequences to process in parallel
block_size = 8 # what is the maximum context length for predictions?
max_iters = 10000
eval_interval = 300
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

---
## Section 3: Loading & Tokenizing the Dataset

This is the same as Part 1 - we load Shakespeare and extract our vocabulary:

In [15]:
# Load the tiny shakespeare dataset
dataset = "tiny_shakespeare.txt"

# Load the dataset into a string
with open(dataset, "r", encoding="utf-8") as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)

Build our character-level encoder and decoder (same as Part 1):

In [16]:
stoi = { ch: i for i, ch in enumerate(chars)}
itos = { i: ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

---
## Section 4: Splitting the Data

Same 90/10 split for train and validation:

In [17]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(len(data) * 0.9)
train_data = data[:n]
val_data = data[n:]

---
## Section 5: Data Loading (with GPU Support!)

**üÜï What's New:** Data is now moved to the `device` (GPU if available)!

```python
x, y = x.to(device), y.to(device)  # Move to GPU
```

This is crucial for training larger models efficiently.

In [18]:
torch.manual_seed(1337)

def get_batch(split):
    # generate a small batch of input-target pairs
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i+block_size] for i in ix])
    y = torch.stack([data[i+1 : i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

---
## Section 6: Loss Estimation Function

**üÜï This is NEW!** In Part 1, we only looked at the loss of the current batch. But loss is noisy - it varies a lot from batch to batch.

**Solution:** Average the loss over MANY batches (`eval_iters = 200`) to get a stable estimate.

**Key Points:**
- `@torch.no_grad()`: Disables gradient computation (we're only evaluating, not training)
- `model.eval()`: Puts model in evaluation mode (important for dropout, batchnorm)
- We evaluate on BOTH train and validation sets

In [19]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

---
## Section 7: Bigram Model Definition

Same model as Part 1! The architecture hasn't changed:
- **Embedding table**: Maps each character to prediction scores
- **Forward pass**: Get logits and optionally compute loss
- **Generate**: Sample new characters one at a time

In [20]:
torch.manual_seed(1337)

class BigramLanguageModel(torch.nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # Each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = torch.nn.Embedding(vocab_size, vocab_size)

    def forward(self, x, y = None):
        # x is (B, T) tensor of indices.
        logits = self.token_embedding_table(x) # (B, T, C) = (4, 8, 65)
        
        # Loss
        if y is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            y = y.view(B*T)
            loss = torch.functional.F.cross_entropy(logits, y)

        return logits, loss

    def generate(self, x, max_new_tokens):
        # x is (B, T) tensor of indices in the current context
        for _ in range(max_new_tokens):
            # get predictions
            logits, loss = self(x)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = torch.nn.functional.softmax(logits, dim = -1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples = 1) # (B, 1)
            # append sampled index to the running sequence
            x = torch.cat((x, idx_next), dim = 1) # (B, T+1)
        return x

**üÜï Move model to device!** This is crucial for GPU training:

In [21]:
model = BigramLanguageModel(vocab_size)
model = model.to(device)

---
## Section 8: Training the Model


**üÜï Improved Training Loop!**

Now we:
- ‚úÖ Print loss every `eval_interval` steps (every 300 steps)
- ‚úÖ Show BOTH train and validation loss
- ‚úÖ Use `estimate_loss()` for stable measurements

**Why track validation loss?**
- If train loss goes down but val loss goes up ‚Üí **Overfitting!**
- We want both to decrease together

In [22]:
optimizer = torch.optim.AdamW(model.parameters(), lr = learning_rate)

In [23]:
for i in range(max_iters):
    
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {i}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)

    # perform backpropagation
    loss.backward()

    # update the weights
    optimizer.step()

    # zero the gradients
    optimizer.zero_grad(set_to_none = True)

step 0: train loss 4.7305, val loss 4.7241
step 300: train loss 4.3818, val loss 4.3896
step 600: train loss 4.0801, val loss 4.0784
step 900: train loss 3.8066, val loss 3.8117
step 1200: train loss 3.5844, val loss 3.5850
step 1500: train loss 3.3757, val loss 3.3829
step 1800: train loss 3.2182, val loss 3.2218
step 2100: train loss 3.0817, val loss 3.0810
step 2400: train loss 2.9663, val loss 2.9739
step 2700: train loss 2.8809, val loss 2.8800
step 3000: train loss 2.7984, val loss 2.8055
step 3300: train loss 2.7461, val loss 2.7386
step 3600: train loss 2.6850, val loss 2.7032
step 3900: train loss 2.6580, val loss 2.6647
step 4200: train loss 2.6236, val loss 2.6301
step 4500: train loss 2.5917, val loss 2.5941
step 4800: train loss 2.5686, val loss 2.5781
step 5100: train loss 2.5564, val loss 2.5685
step 5400: train loss 2.5441, val loss 2.5564
step 5700: train loss 2.5388, val loss 2.5335
step 6000: train loss 2.5245, val loss 2.5162
step 6300: train loss 2.5109, val loss 2

---
## Section 9: Text Generation

Time to see our model in action! Let's generate some Shakespeare-like text:

In [24]:
context = torch.zeros((1, 1), dtype = torch.long) # Since idx 0 is a new line character
out = model.generate(context, max_new_tokens = 200)
print(decode(out[0].tolist()))


Wh. te t beche? no Bu IR:


Sar tor, knfr hequs y' t wnin mant nscehesa thaspot nd

IRES thewisssttene ftKIOLERAUS tiey hanentherve s anerat w.
Ane s al ifre t, nd doororounond pugCO:
gh ng t?
DUTh, I


---
## üìù Summary & What's Next?

**What We Accomplished in Part 2:**
- ‚úÖ Organized hyperparameters at the top
- ‚úÖ Added GPU support for faster training
- ‚úÖ Implemented proper loss estimation (averaging over many batches)
- ‚úÖ Created a cleaner training loop with periodic evaluation
- ‚úÖ Tracked both training AND validation loss

**The Problem with Bigram:**

Even with perfect training, the Bigram model has a fundamental limitation:

> **It only looks at ONE character to predict the next!**

Consider: `"The cat sat on the _"`

- A Bigram sees: `"e"` ‚Üí predicts next character
- It has NO IDEA about "The cat sat on the"!

**The Solution: ATTENTION!**

What if we could let each position "look at" all the previous positions and decide which ones are important?

```
"The cat sat on the _"
       ‚Üë‚Üë‚Üë
   Maybe "cat" is important for predicting what sits!
```

---

### ‚û°Ô∏è Next Up: Part 3 - The Attention Mechanism

In Part 3, we'll learn:
- üéØ How attention allows tokens to "communicate"
- üîë The Query-Key-Value mechanism
- üîí Causal masking (preventing looking at the future)

This is the breakthrough that powers **ChatGPT, GPT-4, and all modern LLMs!**

Let's go! üöÄ