# Module 07: GRU (Gated Recurrent Unit)

**A Simplified Alternative to LSTM**

---

## 1. Objectives

- âœ… Understand GRU architecture
- âœ… Compare GRU vs LSTM
- âœ… Implement GRU from scratch
- âœ… Know when to use which

## 2. Prerequisites

- [Module 06: LSTM](../06_lstm/06_lstm.ipynb)

## 3. Intuition & Motivation

### GRU: LSTM Simplified

| Aspect | LSTM | GRU |
|--------|------|-----|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| States | 2 (hidden, cell) | 1 (hidden only) |
| Parameters | More | ~25% fewer |
| Performance | Often similar | Often similar |

**Key insight**: GRU merges forget and input gates into one "update" gate.

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

print("Setup complete!")

## 4. Mathematical Foundation

### GRU Equations

**1. Update Gate** - How much of new state to use:
$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$$

**2. Reset Gate** - How much of past to forget:
$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$$

**3. Candidate State** - Proposed new state:
$$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)$$

**4. Final State** - Interpolate old and new:
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

### Key Insight

- When $z_t \approx 0$: Keep old state ($h_t \approx h_{t-1}$)
- When $z_t \approx 1$: Use new state ($h_t \approx \tilde{h}_t$)

## 5. GRU from Scratch

In [None]:
class GRUCell:
    """GRU cell from scratch (NumPy)."""
    
    def __init__(self, input_size: int, hidden_size: int):
        self.hidden_size = hidden_size
        combined_size = input_size + hidden_size
        scale = np.sqrt(2.0 / combined_size)
        
        # Update gate
        self.W_z = np.random.randn(hidden_size, combined_size) * scale
        self.b_z = np.zeros((hidden_size, 1))
        
        # Reset gate
        self.W_r = np.random.randn(hidden_size, combined_size) * scale
        self.b_r = np.zeros((hidden_size, 1))
        
        # Candidate
        self.W_h = np.random.randn(hidden_size, combined_size) * scale
        self.b_h = np.zeros((hidden_size, 1))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, x, h_prev):
        """
        Single GRU step.
        
        Args:
            x: (input_size, 1)
            h_prev: (hidden_size, 1)
        """
        combined = np.vstack([h_prev, x])
        
        # Update gate
        z = self.sigmoid(self.W_z @ combined + self.b_z)
        
        # Reset gate
        r = self.sigmoid(self.W_r @ combined + self.b_r)
        
        # Candidate (reset applied to h_prev)
        combined_reset = np.vstack([r * h_prev, x])
        h_tilde = np.tanh(self.W_h @ combined_reset + self.b_h)
        
        # Final state: interpolate
        h = (1 - z) * h_prev + z * h_tilde
        
        return h, {'z': z, 'r': r, 'h_tilde': h_tilde}

# Test
cell = GRUCell(input_size=10, hidden_size=20)
x = np.random.randn(10, 1)
h = np.zeros((20, 1))

h_new, gates = cell.forward(x, h)
print(f"Input: {x.shape}")
print(f"Hidden: {h_new.shape}")
print(f"Gates: z={gates['z'].mean():.3f}, r={gates['r'].mean():.3f}")

In [None]:
class GRU:
    """Full GRU layer from scratch."""
    
    def __init__(self, input_size: int, hidden_size: int):
        self.cell = GRUCell(input_size, hidden_size)
        self.hidden_size = hidden_size
    
    def forward(self, inputs, h0=None):
        if h0 is None:
            h0 = np.zeros((self.hidden_size, 1))
        
        h = h0
        outputs = []
        
        for x in inputs:
            h, _ = self.cell.forward(x, h)
            outputs.append(h)
        
        return outputs, h

# Test
gru = GRU(input_size=10, hidden_size=20)
seq = [np.random.randn(10, 1) for _ in range(15)]
outputs, h_n = gru.forward(seq)
print(f"Outputs: {len(outputs)}, Final hidden: {h_n.shape}")

## 6. PyTorch Implementation

In [None]:
# PyTorch GRU
gru_pt = nn.GRU(
    input_size=10,
    hidden_size=20,
    num_layers=2,
    batch_first=True,
    dropout=0.1
)

x = torch.randn(32, 15, 10)  # (batch, seq, features)
h0 = torch.zeros(2, 32, 20)  # (layers, batch, hidden)

output, h_n = gru_pt(x, h0)

print(f"Input: {x.shape}")
print(f"Output: {output.shape}")
print(f"h_n: {h_n.shape}")

# Compare parameter counts
lstm = nn.LSTM(10, 20, 2, batch_first=True)
print(f"\nLSTM params: {sum(p.numel() for p in lstm.parameters()):,}")
print(f"GRU params: {sum(p.numel() for p in gru_pt.parameters()):,}")

## 7. GRU vs LSTM Comparison

In [None]:
# Speed comparison
import time

def benchmark(model, x, h0, n_runs=100):
    # Warmup
    for _ in range(10):
        _ = model(x, h0) if isinstance(h0, tuple) else model(x, h0)
    
    torch.cuda.synchronize() if torch.cuda.is_available() else None
    start = time.time()
    for _ in range(n_runs):
        _ = model(x, h0) if isinstance(h0, tuple) else model(x, h0)
    torch.cuda.synchronize() if torch.cuda.is_available() else None
    return (time.time() - start) / n_runs * 1000

x = torch.randn(32, 100, 128)
lstm = nn.LSTM(128, 256, 2, batch_first=True)
gru = nn.GRU(128, 256, 2, batch_first=True)

h0_lstm = (torch.zeros(2, 32, 256), torch.zeros(2, 32, 256))
h0_gru = torch.zeros(2, 32, 256)

lstm_time = benchmark(lstm, x, h0_lstm)
gru_time = benchmark(gru, x, h0_gru)

print(f"LSTM: {lstm_time:.2f} ms")
print(f"GRU: {gru_time:.2f} ms")
print(f"GRU is {lstm_time/gru_time:.2f}x faster")

## 8. ðŸ”¥ Real-World Usage

### When to Use GRU vs LSTM

| Factor | Choose GRU | Choose LSTM |
|--------|-----------|-------------|
| Model size matters | âœ… Fewer params | |
| Training speed | âœ… Faster | |
| Very long sequences | | âœ… Better memory |
| Default choice | Try both! | Try both! |

### In Practice

- Performance is usually **similar**
- LSTM is **slightly more common** (historical reasons)
- **Try both**, pick what works for your task

## 9. Interview Questions

**Q1: What's the difference between GRU and LSTM?**
<details><summary>Answer</summary>

- GRU has 2 gates (update, reset), LSTM has 3 (forget, input, output)
- GRU has 1 state, LSTM has 2 (hidden + cell)
- GRU is ~25% fewer parameters
- Performance is usually similar
</details>

**Q2: How does GRU handle long-term dependencies?**
<details><summary>Answer</summary>

The update gate z can be close to 0, making h_t â‰ˆ h_{t-1}. This allows information to flow unchanged through time, similar to LSTM's cell state.
</details>

## 10. Summary

- **GRU**: Simplified LSTM with 2 gates (update, reset)
- **Update gate (z)**: Interpolate old and new state
- **Reset gate (r)**: Control how much past to use
- **Equation**: $h_t = (1-z) \odot h_{t-1} + z \odot \tilde{h}_t$
- **Practice**: ~25% fewer params, often similar performance

## 11. Exercises

1. Compare GRU vs LSTM on sentiment classification
2. Visualize gate activations for both models
3. Implement backward pass for GRU

## 12. References

- [GRU Paper (2014)](https://arxiv.org/abs/1406.1078)
- [Empirical Evaluation of Gated RNNs](https://arxiv.org/abs/1412.3555)
- [PyTorch GRU Docs](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html)

---
**Next:** [Module 08: Bidirectional & Deep RNNs](../08_bidirectional_deep_rnns/08_bidirectional_deep_rnns.ipynb)