# RNN from Scratch

This notebook builds a character-level RNN from scratch using only NumPy. We'll implement the forward pass, backpropagation through time (BPTT), and train it to generate text.

**Goal:** Understand how RNNs process sequences and why they struggle with long-range dependencies.

**Prerequisites:** [rnns.md](../architectures/rnns.md), [backpropagation.md](../neural-networks/backpropagation.md)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

## 1. The RNN Cell

The core RNN update rule:

$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$

$$y_t = W_{hy} h_t + b_y$$

Where:
- $h_t$ is the hidden state at time $t$
- $x_t$ is the input at time $t$
- $y_t$ is the output at time $t$

In [None]:
class RNNCell:
    """
    Single RNN cell: processes one timestep.
    """
    
    def __init__(self, input_dim, hidden_dim):
        """
        Args:
            input_dim: Size of input vectors
            hidden_dim: Size of hidden state
        """
        self.hidden_dim = hidden_dim
        
        # Xavier initialization
        scale_xh = np.sqrt(1.0 / input_dim)
        scale_hh = np.sqrt(1.0 / hidden_dim)
        
        self.W_xh = np.random.randn(input_dim, hidden_dim) * scale_xh
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
        self.b_h = np.zeros(hidden_dim)
    
    def forward(self, x, h_prev):
        """
        Forward pass for single timestep.
        
        Args:
            x: Input at current timestep [batch, input_dim]
            h_prev: Hidden state from previous timestep [batch, hidden_dim]
            
        Returns:
            h: New hidden state [batch, hidden_dim]
        """
        # Store for backprop
        self.x = x
        self.h_prev = h_prev
        
        # h = tanh(W_xh @ x + W_hh @ h_prev + b)
        self.z = x @ self.W_xh + h_prev @ self.W_hh + self.b_h
        h = np.tanh(self.z)
        
        return h
    
    def backward(self, dh):
        """
        Backward pass for single timestep.
        
        Args:
            dh: Gradient w.r.t. hidden state [batch, hidden_dim]
            
        Returns:
            dx: Gradient w.r.t. input
            dh_prev: Gradient w.r.t. previous hidden state
            grads: Dictionary of parameter gradients
        """
        # Gradient through tanh: d/dz tanh(z) = 1 - tanh(z)^2
        dz = dh * (1 - np.tanh(self.z)**2)
        
        # Parameter gradients
        dW_xh = self.x.T @ dz
        dW_hh = self.h_prev.T @ dz
        db_h = dz.sum(axis=0)
        
        # Input gradients
        dx = dz @ self.W_xh.T
        dh_prev = dz @ self.W_hh.T
        
        grads = {'W_xh': dW_xh, 'W_hh': dW_hh, 'b_h': db_h}
        return dx, dh_prev, grads

In [None]:
# Test the cell
cell = RNNCell(input_dim=10, hidden_dim=20)

batch_size = 4
x = np.random.randn(batch_size, 10)
h_prev = np.zeros((batch_size, 20))

h = cell.forward(x, h_prev)
print(f"Input shape: {x.shape}")
print(f"Hidden shape: {h.shape}")
print(f"Hidden range: [{h.min():.3f}, {h.max():.3f}] (should be in [-1, 1] due to tanh)")

## 2. Character-Level Language Model

Now let's build a full RNN that:
1. Takes characters as input (one-hot encoded)
2. Predicts the next character
3. Trains on text data

In [None]:
class CharRNN:
    """
    Character-level RNN for text generation.
    """
    
    def __init__(self, vocab_size, hidden_dim):
        """
        Args:
            vocab_size: Number of unique characters
            hidden_dim: Size of hidden state
        """
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        
        # Initialize weights
        scale_xh = np.sqrt(1.0 / vocab_size)
        scale_hh = np.sqrt(1.0 / hidden_dim)
        scale_hy = np.sqrt(1.0 / hidden_dim)
        
        self.W_xh = np.random.randn(vocab_size, hidden_dim) * scale_xh
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
        self.W_hy = np.random.randn(hidden_dim, vocab_size) * scale_hy
        self.b_h = np.zeros(hidden_dim)
        self.b_y = np.zeros(vocab_size)
        
    def forward(self, inputs, h_init=None):
        """
        Forward pass through sequence.
        
        Args:
            inputs: List of character indices [seq_len]
            h_init: Initial hidden state [hidden_dim]
            
        Returns:
            outputs: Output logits at each timestep
            hidden_states: Hidden state at each timestep
        """
        seq_len = len(inputs)
        
        if h_init is None:
            h_init = np.zeros(self.hidden_dim)
        
        # Storage for backprop
        self.inputs = inputs
        self.xs = {}  # One-hot inputs
        self.hs = {-1: h_init}  # Hidden states
        self.os = {}  # Output logits
        self.ps = {}  # Softmax probabilities
        
        for t in range(seq_len):
            # One-hot encode input
            x = np.zeros(self.vocab_size)
            x[inputs[t]] = 1
            self.xs[t] = x
            
            # RNN step: h_t = tanh(W_xh @ x + W_hh @ h_{t-1} + b_h)
            self.hs[t] = np.tanh(
                x @ self.W_xh + self.hs[t-1] @ self.W_hh + self.b_h
            )
            
            # Output: y_t = W_hy @ h_t + b_y
            self.os[t] = self.hs[t] @ self.W_hy + self.b_y
            
            # Softmax for probabilities
            exp_o = np.exp(self.os[t] - self.os[t].max())
            self.ps[t] = exp_o / exp_o.sum()
        
        return self.ps, self.hs
    
    def loss(self, targets):
        """
        Compute cross-entropy loss.
        
        Args:
            targets: List of target character indices [seq_len]
            
        Returns:
            loss: Average cross-entropy loss
        """
        loss = 0
        for t in range(len(targets)):
            loss -= np.log(self.ps[t][targets[t]] + 1e-10)
        return loss / len(targets)
    
    def backward(self, targets):
        """
        Backpropagation through time (BPTT).
        
        Args:
            targets: List of target character indices [seq_len]
            
        Returns:
            grads: Dictionary of parameter gradients
        """
        seq_len = len(targets)
        
        # Initialize gradients
        dW_xh = np.zeros_like(self.W_xh)
        dW_hh = np.zeros_like(self.W_hh)
        dW_hy = np.zeros_like(self.W_hy)
        db_h = np.zeros_like(self.b_h)
        db_y = np.zeros_like(self.b_y)
        
        # Gradient flowing back through hidden states
        dh_next = np.zeros(self.hidden_dim)
        
        # Go backwards through time
        for t in reversed(range(seq_len)):
            # Output gradient: softmax + cross-entropy
            do = self.ps[t].copy()
            do[targets[t]] -= 1  # d(loss)/d(output)
            
            # Output layer gradients
            dW_hy += np.outer(self.hs[t], do)
            db_y += do
            
            # Hidden state gradient (from output AND from next timestep)
            dh = do @ self.W_hy.T + dh_next
            
            # Gradient through tanh
            dh_raw = dh * (1 - self.hs[t]**2)
            
            # RNN parameter gradients
            dW_xh += np.outer(self.xs[t], dh_raw)
            dW_hh += np.outer(self.hs[t-1], dh_raw)
            db_h += dh_raw
            
            # Pass gradient to previous timestep
            dh_next = dh_raw @ self.W_hh.T
        
        # Clip gradients to prevent exploding gradients
        for grad in [dW_xh, dW_hh, dW_hy, db_h, db_y]:
            np.clip(grad, -5, 5, out=grad)
        
        return {
            'W_xh': dW_xh, 'W_hh': dW_hh, 'W_hy': dW_hy,
            'b_h': db_h, 'b_y': db_y
        }
    
    def update(self, grads, lr):
        """Update parameters with gradient descent."""
        self.W_xh -= lr * grads['W_xh']
        self.W_hh -= lr * grads['W_hh']
        self.W_hy -= lr * grads['W_hy']
        self.b_h -= lr * grads['b_h']
        self.b_y -= lr * grads['b_y']
    
    def sample(self, seed_char, length, temperature=1.0):
        """
        Generate text by sampling from the model.
        
        Args:
            seed_char: Starting character index
            length: Number of characters to generate
            temperature: Higher = more random, lower = more deterministic
            
        Returns:
            List of generated character indices
        """
        h = np.zeros(self.hidden_dim)
        x = seed_char
        generated = [x]
        
        for _ in range(length):
            # One-hot encode
            x_vec = np.zeros(self.vocab_size)
            x_vec[x] = 1
            
            # Forward step
            h = np.tanh(x_vec @ self.W_xh + h @ self.W_hh + self.b_h)
            o = h @ self.W_hy + self.b_y
            
            # Sample with temperature
            o = o / temperature
            exp_o = np.exp(o - o.max())
            probs = exp_o / exp_o.sum()
            
            x = np.random.choice(self.vocab_size, p=probs)
            generated.append(x)
        
        return generated

## 3. Prepare Training Data

We'll train on a small text dataset.

In [None]:
# Sample text (you can use any text)
text = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to. 'Tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause."""

# Create character vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"Text length: {len(text)} characters")
print(f"Vocabulary size: {vocab_size}")
print(f"Vocabulary: {''.join(chars)}")

In [None]:
# Convert text to indices
data = [char_to_idx[ch] for ch in text]

print(f"First 50 characters: {text[:50]}")
print(f"As indices: {data[:50]}")

## 4. Training

Train the RNN with truncated BPTT (process sequences in chunks).

In [None]:
def train_rnn(model, data, seq_length=25, epochs=100, lr=0.1, print_every=10):
    """
    Train the RNN on character data.
    
    Args:
        model: CharRNN model
        data: List of character indices
        seq_length: Length of sequences for BPTT
        epochs: Number of passes through data
        lr: Learning rate
    """
    losses = []
    smooth_loss = -np.log(1.0 / model.vocab_size) * seq_length
    
    for epoch in range(epochs):
        h_prev = np.zeros(model.hidden_dim)
        epoch_loss = 0
        n_batches = 0
        
        # Go through data in chunks
        for i in range(0, len(data) - seq_length - 1, seq_length):
            inputs = data[i:i+seq_length]
            targets = data[i+1:i+seq_length+1]
            
            # Forward pass
            probs, hidden_states = model.forward(inputs, h_prev)
            loss = model.loss(targets)
            
            # Backward pass
            grads = model.backward(targets)
            
            # Update
            model.update(grads, lr)
            
            # Carry hidden state forward (detached)
            h_prev = hidden_states[seq_length - 1].copy()
            
            epoch_loss += loss
            n_batches += 1
            
            smooth_loss = 0.999 * smooth_loss + 0.001 * loss * seq_length
        
        avg_loss = epoch_loss / n_batches
        losses.append(avg_loss)
        
        if epoch % print_every == 0:
            print(f"Epoch {epoch:3d}: Loss = {avg_loss:.4f}")
            
            # Generate sample
            sample_idx = model.sample(data[0], 100, temperature=0.8)
            sample_text = ''.join([idx_to_char[i] for i in sample_idx])
            print(f"  Sample: {sample_text[:60]}...\n")
    
    return losses

In [None]:
# Create and train model
model = CharRNN(vocab_size=vocab_size, hidden_dim=100)

print("Training RNN...\n")
losses = train_rnn(model, data, seq_length=25, epochs=200, lr=0.1, print_every=20)

In [None]:
# Plot training loss
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True)
plt.show()

## 5. Generate Text

In [None]:
# Generate with different temperatures
print("Generated text at different temperatures:\n")

for temp in [0.5, 0.8, 1.0, 1.5]:
    seed = char_to_idx['T']
    sample_idx = model.sample(seed, 150, temperature=temp)
    sample_text = ''.join([idx_to_char[i] for i in sample_idx])
    print(f"Temperature {temp}:")
    print(f"  {sample_text}")
    print()

**Temperature effects:**
- **Low (0.5):** More conservative, repeats common patterns
- **Medium (0.8-1.0):** Balanced creativity and coherence
- **High (1.5):** More random, might produce nonsense

## 6. Visualizing Hidden States

Let's see what the RNN "remembers" as it processes text.

In [None]:
# Process a sentence and visualize hidden states
test_text = "To be, or not to be"
test_data = [char_to_idx[ch] for ch in test_text]

# Forward pass
probs, hidden_states = model.forward(test_data)

# Extract hidden states into array
H = np.array([hidden_states[t] for t in range(len(test_data))])

# Plot
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Heatmap of hidden states
ax = axes[0]
im = ax.imshow(H.T, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
ax.set_xlabel('Time step')
ax.set_ylabel('Hidden unit')
ax.set_title('Hidden State Evolution')
ax.set_xticks(range(len(test_text)))
ax.set_xticklabels(list(test_text))
plt.colorbar(im, ax=ax)

# Plot a few hidden units over time
ax = axes[1]
for i in [0, 10, 20, 30, 40]:
    ax.plot(H[:, i], label=f'Unit {i}')
ax.set_xlabel('Time step')
ax.set_ylabel('Activation')
ax.set_title('Selected Hidden Units Over Time')
ax.set_xticks(range(len(test_text)))
ax.set_xticklabels(list(test_text))
ax.legend()
ax.grid(True)

plt.tight_layout()
plt.show()

## 7. The Vanishing Gradient Problem

Let's visualize how gradients decay over long sequences.

In [None]:
def measure_gradient_flow(model, data, seq_lengths):
    """
    Measure how gradients flow back through different sequence lengths.
    """
    gradient_norms = []
    
    for seq_len in seq_lengths:
        if seq_len >= len(data):
            break
            
        inputs = data[:seq_len]
        targets = data[1:seq_len+1]
        
        # Forward
        model.forward(inputs)
        
        # Backward and measure gradient at first timestep
        seq_len_actual = len(targets)
        
        # We'll track gradient magnitude at each timestep
        dh_next = np.zeros(model.hidden_dim)
        grad_magnitudes = []
        
        for t in reversed(range(seq_len_actual)):
            do = model.ps[t].copy()
            do[targets[t]] -= 1
            
            dh = do @ model.W_hy.T + dh_next
            dh_raw = dh * (1 - model.hs[t]**2)
            
            grad_magnitudes.append(np.linalg.norm(dh_raw))
            
            dh_next = dh_raw @ model.W_hh.T
        
        # Reverse to get chronological order
        grad_magnitudes = grad_magnitudes[::-1]
        gradient_norms.append(grad_magnitudes)
    
    return gradient_norms

In [None]:
# Measure gradient flow for different sequence lengths
seq_lengths = [10, 25, 50, 100]
gradient_flows = measure_gradient_flow(model, data, seq_lengths)

# Plot
fig, ax = plt.subplots(figsize=(12, 6))

for seq_len, grads in zip(seq_lengths[:len(gradient_flows)], gradient_flows):
    # Normalize to show relative decay
    grads = np.array(grads)
    if grads[-1] > 0:
        grads = grads / grads[-1]  # Normalize to final (most recent) gradient
    ax.plot(range(len(grads)), grads, label=f'Seq length {seq_len}')

ax.set_xlabel('Position in sequence (from start)')
ax.set_ylabel('Relative gradient magnitude')
ax.set_title('Gradient Flow Through Time (Vanishing Gradient Problem)')
ax.legend()
ax.set_yscale('log')
ax.grid(True)
plt.show()

print("\nGradient at position 0 vs position -1 (final):")
for seq_len, grads in zip(seq_lengths[:len(gradient_flows)], gradient_flows):
    if len(grads) > 0 and grads[-1] > 0:
        ratio = grads[0] / grads[-1]
        print(f"  Seq {seq_len}: {ratio:.2e}")

**Key insight:** Gradients decay exponentially as they flow back through time. For long sequences, the gradient at early timesteps becomes vanishingly small, making it hard to learn long-range dependencies.

This is why LSTMs were invented (next notebook) and why transformers with attention (later notebooks) work even better.

## 8. Summary

| Component | Description |
|-----------|-------------|
| **RNN Cell** | h_t = tanh(W_xh @ x_t + W_hh @ h_{t-1} + b) |
| **Output** | y_t = W_hy @ h_t + b_y |
| **BPTT** | Unroll through time, sum gradients |
| **Gradient clipping** | Prevent exploding gradients |

**Key takeaways:**

1. RNNs process sequences one step at a time, maintaining a hidden state
2. BPTT computes gradients by unrolling the network through time
3. **Vanishing gradients** make it hard to learn long-range dependencies
4. **Gradient clipping** prevents exploding gradients but doesn't solve vanishing
5. Character-level models learn structure (words, punctuation) from raw characters

**Next:** [06-lstm-from-scratch.ipynb](06-lstm-from-scratch.ipynb) solves the vanishing gradient problem with gating.

## 9. Exercises

1. **Deeper RNN:** Add a second hidden layer. Does it help?

2. **Different text:** Train on different text (song lyrics, code, etc.). What patterns does it learn?

3. **Bidirectional:** Implement a bidirectional RNN that processes the sequence both forwards and backwards.

4. **Gradient analysis:** Track which positions in the input have the most influence on the output.

In [None]:
# Exercise 2 starter: Train on different text
new_text = """def hello_world():
    print("Hello, World!")
    return True

def add(a, b):
    return a + b

def multiply(a, b):
    return a * b
"""

# Your code here to train on this Python code