[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-05/exercise-00.ipynb)

# ðŸ§ª Exercise 1 â€” Why Bag of Words Fails


**Goal:** Destroy order mechanically.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["dog bites man", "man bites dog"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print("Vocabulary:", vectorizer.vocabulary_)
print("Vectors:\n", X.toarray())

Vocabulary: {'dog': 1, 'bites': 0, 'man': 2}
Vectors:
 [[1 1 1]
 [1 1 1]]


**Expected output:**

```
Vocabulary: {'dog': 0, 'bites': 1, 'man': 2}
Vectors:
[[1 1 1]
 [1 1 1]]
```

Same vector.

Different meaning.

This is the mechanical reason RNNs existed.

# ðŸ§ª Exercise 2 â€” Manual Tiny RNN Forward Pass

**Goal:** See recurrence happen.

In [2]:
import numpy as np

np.random.seed(42)

# Input size = 3, hidden size = 2
Wx = np.random.randn(3, 2)
Wh = np.random.randn(2, 2)
b = np.zeros((1, 2))

def rnn_step(x, h_prev):
    return np.tanh(x @ Wx + h_prev @ Wh + b)

# Sequence of 3 one-hot inputs
x_seq = [
    np.array([[1,0,0]]),
    np.array([[0,1,0]]),
    np.array([[0,0,1]])
]

h = np.zeros((1,2))

for t, x in enumerate(x_seq):
    h = rnn_step(x, h)
    print(f"Step {t+1} hidden:", h)

Step 1 hidden: [[ 0.45952909 -0.13738992]]
Step 2 hidden: [[0.89327092 0.94692458]]
Step 3 hidden: [[0.62425975 0.74656684]]


**Observe:**

- Same Wx, Wh reused
- Hidden state evolves
- State depends on previous state

This is recurrence in its purest form.

# ðŸ§ª Exercise 3 â€” Show Hidden State Overwrite

Now prepend noise:

In [3]:
noise = [np.random.randn(1,3) for _ in range(5)]
new_seq = noise + x_seq

h = np.zeros((1,2))

for t, x in enumerate(new_seq):
    h = rnn_step(x, h)

print("Final hidden state after noise:", h)

Final hidden state after noise: [[-0.14714332  0.23473421]]


Now compare:

- Without noise
- With noise

Early signal gets overwritten.

Memory fragility becomes obvious.

# ðŸ§ª Exercise 4 â€” Train a Real RNN (Tiny Shakespeare)

We'll do character-level modeling.

## Step 1 â€” Load tiny dataset

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim
import requests

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text[:50000]  # small subset

chars = sorted(list(set(text)))
vocab_size = len(chars)

char_to_idx = {ch:i for i,ch in enumerate(chars)}
idx_to_char = {i:ch for ch,i in char_to_idx.items()}

## Step 2 â€” Create sequences

In [5]:
sequence_length = 50

data = []
for i in range(len(text) - sequence_length):
    seq = text[i:i+sequence_length]
    target = text[i+1:i+sequence_length+1]
    data.append((seq, target))

def encode(seq):
    return torch.tensor([char_to_idx[c] for c in seq])

## Step 3 â€” Model

In [6]:
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, vocab_size)
        self.rnn = nn.RNN(vocab_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        out, _ = self.rnn(x)
        out = self.fc(out)
        return out

model = SimpleRNN(vocab_size, hidden_size=128)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

## Step 4 â€” Training loop

In [7]:
def train_epoch():
    model.train()
    total_loss = 0

    for i in range(0, 1000):  # small subset for speed
        seq, target = data[i]
        x = encode(seq).unsqueeze(0)
        y = encode(target)

        out = model(x)
        loss = criterion(out.view(-1, vocab_size), y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / 1000

for epoch in range(5):
    loss = train_epoch()
    print(f"Epoch {epoch+1} | Loss: {loss:.4f}")

Epoch 1 | Loss: 0.6322
Epoch 2 | Loss: 0.5484
Epoch 3 | Loss: 0.5348
Epoch 4 | Loss: 0.5000
Epoch 5 | Loss: 0.4635


Now Observe that:

- Same weights reused per step
- Sequential processing
- Loss decreasing

# ðŸ§ª Exercise 5 â€” Demonstrate Vanishing Gradient

Increase sequence length.

In [8]:
sequence_length = 200  # try 20, 50, 200

# Recreate data with new sequence length
data = []
for i in range(len(text) - sequence_length):
    seq = text[i:i+sequence_length]
    target = text[i+1:i+sequence_length+1]
    data.append((seq, target))

# Reinitialize model
model = SimpleRNN(vocab_size, hidden_size=128)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

Then inspect gradients:

In [9]:
# Run one training step to get gradients
seq, target = data[0]
x = encode(seq).unsqueeze(0)
y = encode(target)

out = model(x)
loss = criterion(out.view(-1, vocab_size), y)

optimizer.zero_grad()
loss.backward()

for name, param in model.named_parameters():
    if param.grad is not None:
        print(name, param.grad.norm().item())

embedding.weight 0.023717757314443588
rnn.weight_ih_l0 0.4803493320941925
rnn.weight_hh_l0 0.2789147198200226
rnn.bias_ih_l0 0.10563002526760101
rnn.bias_hh_l0 0.10563002526760101
fc.weight 0.5119559168815613
fc.bias 0.20093689858913422


As sequence length increases:

- Early layers' gradients shrink
- Training slows

This makes vanishing gradient concrete.

# ðŸ§ª Exercise 6 -> Serial Bottleneck Demonstration

Measure time per epoch.

In [10]:
import time

start = time.time()
train_epoch()
end = time.time()

print("Time per epoch:", end - start)

Time per epoch: 26.345571517944336


Now test:

- `sequence_length = 20`
- `sequence_length = 200`

Longer sequence â†’ slower epoch.

Why?

Because RNN must compute:

t1 â†’ t2 â†’ t3 â†’ ... â†’ t200

No parallel shortcut.

This is the structural bottleneck.

## What You Just Saw

- Bag of words destroys order.
- RNN carries evolving hidden state.
- Memory gets overwritten.
- Gradients weaken over long sequences.
- Computation is serial.
- Training time scales with sequence length.

No mythology.

No metaphors.

Just mechanics.