# Day 14: LSTM and GRU Networks

## Phase 2: NLP Basics (Days 11-20)

**Estimated Time: 3-4 hours**

### Learning Objectives
- Understand gating mechanisms and their role in solving vanishing gradients
- Implement LSTM (Long Short-Term Memory) from scratch
- Master the LSTM cell equations and information flow
- Implement GRU (Gated Recurrent Unit) architecture
- Compare LSTM, GRU, and vanilla RNN performance
- Build sequence models for real-world tasks
- Apply advanced RNN techniques: stacking, dropout, attention

### Prerequisites
- Day 13: Recurrent Neural Networks
- Understanding of vanishing gradient problem
- Neural network fundamentals
- Linear algebra and calculus

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
torch.manual_seed(42)
plt.style.use('seaborn-v0_8-darkgrid')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("Libraries loaded successfully!")

## 1. The Gating Mechanism

### 1.1 Motivation: Why Gates?

Recall from Day 13:
- Vanilla RNNs suffer from vanishing/exploding gradients
- Cannot learn long-range dependencies
- Information decays exponentially through time

**Key Insight**: What if we could *control* information flow?
- Selectively **remember** important information
- Selectively **forget** irrelevant information
- Allow gradients to flow unimpeded through time

### 1.2 Gates as Learned Switches

A **gate** is a sigmoid-activated vector that controls information flow:

$$g = \sigma(W_g \cdot [h_{t-1}, x_t] + b_g) \in (0, 1)^d$$

- Values near 0: Block information
- Values near 1: Pass information through
- Element-wise multiplication controls each dimension independently

**Analogy**: Like a water pipe with adjustable valves at each position.

In [None]:
# Visualize gating mechanism

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Sigmoid function (gate activation)
ax = axes[0, 0]
x = np.linspace(-6, 6, 100)
sigmoid = 1 / (1 + np.exp(-x))
ax.plot(x, sigmoid, 'b-', linewidth=2)
ax.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax.fill_between(x, 0, sigmoid, alpha=0.3)
ax.set_xlabel('Input')
ax.set_ylabel('Gate Value')
ax.set_title('Sigmoid Gate Function\n$g = \\sigma(z)$', fontsize=12)
ax.grid(True, alpha=0.3)
ax.text(-4, 0.8, 'BLOCK', fontsize=12, ha='center')
ax.text(4, 0.8, 'PASS', fontsize=12, ha='center')

# 2. Element-wise gating
ax = axes[0, 1]
information = np.array([0.8, -0.5, 0.3, 0.9, -0.2])
gate_values = np.array([0.9, 0.1, 0.8, 0.2, 0.7])
gated_info = information * gate_values

x_pos = np.arange(len(information))
width = 0.25

ax.bar(x_pos - width, information, width, label='Information', color='blue', alpha=0.7)
ax.bar(x_pos, gate_values, width, label='Gate', color='orange', alpha=0.7)
ax.bar(x_pos + width, gated_info, width, label='Gated Output', color='green', alpha=0.7)

ax.set_xticks(x_pos)
ax.set_xticklabels([f'Dim {i+1}' for i in range(len(information))])
ax.set_ylabel('Value')
ax.set_title('Element-wise Gating\n$output = information \\odot gate$', fontsize=12)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3, axis='y')

# 3. Gate controlling information flow
ax = axes[1, 0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)

# Draw information flow diagram
# Input
rect = plt.Rectangle((1, 4), 2, 2, color='lightblue', alpha=0.8)
ax.add_patch(rect)
ax.text(2, 5, 'Info\n$x$', ha='center', va='center', fontsize=11)

# Gate
circle = plt.Circle((5, 5), 0.8, color='orange', alpha=0.8)
ax.add_patch(circle)
ax.text(5, 5, '$g$', ha='center', va='center', fontsize=14, fontweight='bold')
ax.text(5, 3.5, 'Gate\n$\\sigma(Wx+b)$', ha='center', va='center', fontsize=10)

# Output
rect = plt.Rectangle((7, 4), 2, 2, color='lightgreen', alpha=0.8)
ax.add_patch(rect)
ax.text(8, 5, 'Output\n$g \\odot x$', ha='center', va='center', fontsize=11)

# Arrows
ax.annotate('', xy=(4.2, 5), xytext=(3, 5),
            arrowprops=dict(arrowstyle='->', lw=2))
ax.annotate('', xy=(7, 5), xytext=(5.8, 5),
            arrowprops=dict(arrowstyle='->', lw=2))

ax.set_title('Gate as Information Controller', fontsize=12)
ax.axis('off')

# 4. Why gates help with gradients
ax = axes[1, 1]
time_steps = 50

# Vanilla RNN gradient decay (worst case)
vanilla_grad = 0.9 ** np.arange(time_steps)

# Gated RNN (additive interaction allows gradient to flow)
gated_grad = np.ones(time_steps)
for t in range(1, time_steps):
    # Gate value (learned to be close to 1 for important info)
    forget_gate = 0.95 + 0.05 * np.random.rand()
    gated_grad[t] = gated_grad[t-1] * forget_gate

ax.plot(vanilla_grad, 'r-', linewidth=2, label='Vanilla RNN')
ax.plot(gated_grad, 'g-', linewidth=2, label='Gated RNN (LSTM)')
ax.set_xlabel('Time Steps Back')
ax.set_ylabel('Gradient Magnitude (relative)')
ax.set_title('Gradient Flow: Vanilla vs Gated', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

plt.suptitle('The Gating Mechanism', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 2. LSTM: Long Short-Term Memory

### 2.1 Architecture Overview

LSTM (Hochreiter & Schmidhuber, 1997) introduces:
- **Cell state** $C_t$: Long-term memory (the "conveyor belt")
- **Hidden state** $h_t$: Short-term memory (output)
- **Three gates**: Forget, Input, Output

### 2.2 LSTM Equations

**Forget Gate**: What to forget from cell state
$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

**Input Gate**: What new information to store
$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$

**Candidate Cell State**: New information to potentially add
$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

**Cell State Update**: Combine old and new
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

**Output Gate**: What to output from cell state
$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$

**Hidden State**: Filtered cell state
$$h_t = o_t \odot \tanh(C_t)$$

### 2.3 Key Insight: Additive Updates

The cell state update:
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

This is **additive** (like ResNet skip connections), not multiplicative!
- Gradients flow through the addition
- When $f_t \approx 1$, gradient flows unimpeded
- Solves vanishing gradient problem

In [None]:
class LSTMCell:
    """
    LSTM Cell implementation from scratch.
    
    Implements the full LSTM equations.
    """
    
    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        
        # Concatenated input size
        concat_dim = input_dim + hidden_dim
        
        # Initialize weights (Xavier)
        scale = np.sqrt(2.0 / (concat_dim + hidden_dim))
        
        # Forget gate
        self.W_f = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_f = np.zeros((hidden_dim, 1))
        
        # Input gate
        self.W_i = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_i = np.zeros((hidden_dim, 1))
        
        # Candidate cell state
        self.W_c = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_c = np.zeros((hidden_dim, 1))
        
        # Output gate
        self.W_o = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_o = np.zeros((hidden_dim, 1))
        
        # Initialize forget gate bias to 1 (important!)
        # This encourages remembering at the start
        self.b_f = np.ones((hidden_dim, 1))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, x_t, h_prev, c_prev):
        """
        Forward pass through LSTM cell.
        
        x_t: [input_dim, 1]
        h_prev: [hidden_dim, 1]
        c_prev: [hidden_dim, 1]
        
        Returns: h_t, c_t, cache
        """
        # Concatenate input and previous hidden state
        concat = np.vstack([h_prev, x_t])  # [hidden_dim + input_dim, 1]
        
        # Forget gate
        f_t = self.sigmoid(self.W_f @ concat + self.b_f)
        
        # Input gate
        i_t = self.sigmoid(self.W_i @ concat + self.b_i)
        
        # Candidate cell state
        c_tilde = np.tanh(self.W_c @ concat + self.b_c)
        
        # New cell state
        c_t = f_t * c_prev + i_t * c_tilde
        
        # Output gate
        o_t = self.sigmoid(self.W_o @ concat + self.b_o)
        
        # New hidden state
        h_t = o_t * np.tanh(c_t)
        
        # Cache for backprop
        cache = {
            'x_t': x_t,
            'h_prev': h_prev,
            'c_prev': c_prev,
            'concat': concat,
            'f_t': f_t,
            'i_t': i_t,
            'c_tilde': c_tilde,
            'c_t': c_t,
            'o_t': o_t,
            'h_t': h_t
        }
        
        return h_t, c_t, cache

# Test LSTM cell
print("LSTM Cell Implementation")
print("="*50)

input_dim = 10
hidden_dim = 20

lstm_cell = LSTMCell(input_dim, hidden_dim)

# Initialize states
h_prev = np.zeros((hidden_dim, 1))
c_prev = np.zeros((hidden_dim, 1))
x_t = np.random.randn(input_dim, 1)

# Forward pass
h_t, c_t, cache = lstm_cell.forward(x_t, h_prev, c_prev)

print(f"Input dimension: {input_dim}")
print(f"Hidden dimension: {hidden_dim}")
print(f"\nInput x_t shape: {x_t.shape}")
print(f"Previous h shape: {h_prev.shape}")
print(f"Previous c shape: {c_prev.shape}")
print(f"\nNew h_t shape: {h_t.shape}")
print(f"New c_t shape: {c_t.shape}")

print(f"\nGate values (mean):")
print(f"  Forget gate: {cache['f_t'].mean():.4f}")
print(f"  Input gate: {cache['i_t'].mean():.4f}")
print(f"  Output gate: {cache['o_t'].mean():.4f}")

In [None]:
# Visualize LSTM cell architecture

fig, ax = plt.subplots(figsize=(14, 10))

# Main cell outline
cell_rect = plt.Rectangle((2, 2), 10, 8, fill=False, edgecolor='black', linewidth=2)
ax.add_patch(cell_rect)
ax.text(7, 10.5, 'LSTM Cell', ha='center', va='center', fontsize=16, fontweight='bold')

# Cell state line (conveyor belt)
ax.plot([0, 14], [8.5, 8.5], 'b-', linewidth=3)
ax.text(0.5, 9, '$C_{t-1}$', fontsize=12, fontweight='bold')
ax.text(13.5, 9, '$C_t$', fontsize=12, fontweight='bold')

# Hidden state input/output
ax.plot([0, 2], [4, 4], 'g-', linewidth=2)
ax.plot([12, 14], [4, 4], 'g-', linewidth=2)
ax.text(0.5, 4.5, '$h_{t-1}$', fontsize=12, fontweight='bold')
ax.text(13, 4.5, '$h_t$', fontsize=12, fontweight='bold')

# Input
ax.plot([7, 7], [0, 2], 'orange', linewidth=2)
ax.text(7, 0.5, '$x_t$', ha='center', fontsize=12, fontweight='bold')

# Forget gate (×)
forget_gate = plt.Circle((4, 8.5), 0.5, color='red', alpha=0.8)
ax.add_patch(forget_gate)
ax.text(4, 8.5, '×', ha='center', va='center', fontsize=16, color='white', fontweight='bold')
ax.text(4, 7.3, '$f_t$', ha='center', fontsize=11)
ax.text(4, 6.8, 'Forget', ha='center', fontsize=10)

# Input gate (×)
input_gate = plt.Circle((7, 8.5), 0.5, color='green', alpha=0.8)
ax.add_patch(input_gate)
ax.text(7, 8.5, '×', ha='center', va='center', fontsize=16, color='white', fontweight='bold')
ax.text(7, 7.3, '$i_t$', ha='center', fontsize=11)
ax.text(7, 6.8, 'Input', ha='center', fontsize=10)

# Addition
add_circle = plt.Circle((5.5, 8.5), 0.4, color='purple', alpha=0.8)
ax.add_patch(add_circle)
ax.text(5.5, 8.5, '+', ha='center', va='center', fontsize=16, color='white', fontweight='bold')

# Output gate (×)
output_gate = plt.Circle((10, 4), 0.5, color='blue', alpha=0.8)
ax.add_patch(output_gate)
ax.text(10, 4, '×', ha='center', va='center', fontsize=16, color='white', fontweight='bold')
ax.text(10, 2.8, '$o_t$', ha='center', fontsize=11)
ax.text(10, 2.3, 'Output', ha='center', fontsize=10)

# Tanh for candidate
tanh1 = plt.Rectangle((6.3, 5.5), 1.4, 0.8, color='yellow', alpha=0.8)
ax.add_patch(tanh1)
ax.text(7, 5.9, 'tanh', ha='center', va='center', fontsize=10)
ax.text(7, 4.8, '$\\tilde{C}_t$', ha='center', fontsize=11)

# Tanh for output
tanh2 = plt.Rectangle((9.3, 6.5), 1.4, 0.8, color='yellow', alpha=0.8)
ax.add_patch(tanh2)
ax.text(10, 6.9, 'tanh', ha='center', va='center', fontsize=10)

# Sigma boxes for gates
for x_pos, name in [(4, 'σ'), (7, 'σ'), (10, 'σ')]:
    sigma_box = plt.Rectangle((x_pos-0.4, 3.6), 0.8, 0.8, color='lightgray', alpha=0.8)
    ax.add_patch(sigma_box)
    ax.text(x_pos, 4, name, ha='center', va='center', fontsize=12)

# Connections (simplified)
# From concat to gates
ax.plot([3, 4], [3, 3.6], 'k-', linewidth=1)
ax.plot([3, 7], [3, 3.6], 'k-', linewidth=1)
ax.plot([3, 7], [3, 5.5], 'k-', linewidth=1)
ax.plot([3, 10], [3, 3.6], 'k-', linewidth=1)

# Gates to operations
ax.plot([4, 4], [4.4, 8], 'k-', linewidth=1)
ax.plot([7, 7], [4.4, 8], 'k-', linewidth=1)
ax.plot([7, 7], [6.3, 8], 'k-', linewidth=1)
ax.plot([10, 10], [4.5, 6.5], 'k-', linewidth=1)
ax.plot([10, 12], [4.5, 4], 'k-', linewidth=1)

# Cell state connections
ax.plot([10, 10], [7.3, 8.5], 'b-', linewidth=2)

# Equations on the side
equations = [
    r'$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$',
    r'$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$',
    r'$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$',
    r'$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$',
    r'$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$',
    r'$h_t = o_t \odot \tanh(C_t)$'
]

for i, eq in enumerate(equations):
    ax.text(15, 9 - i*1.2, eq, fontsize=11, va='center')

ax.set_xlim(-1, 25)
ax.set_ylim(-0.5, 11)
ax.axis('off')

plt.tight_layout()
plt.show()

## 3. Complete LSTM Network

### 3.1 Stacking LSTM Cells Over Time

In [None]:
class LSTMNetwork:
    """
    Complete LSTM network for sequence processing.
    """
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        
        # LSTM cell
        self.lstm_cell = LSTMCell(input_dim, hidden_dim)
        
        # Output layer
        scale = np.sqrt(2.0 / (hidden_dim + output_dim))
        self.W_y = np.random.randn(output_dim, hidden_dim) * scale
        self.b_y = np.zeros((output_dim, 1))
        
        # Cache for backprop
        self.caches = []
    
    def forward(self, inputs, h_prev=None, c_prev=None):
        """
        Forward pass through entire sequence.
        
        inputs: list of input vectors
        """
        if h_prev is None:
            h_prev = np.zeros((self.hidden_dim, 1))
        if c_prev is None:
            c_prev = np.zeros((self.hidden_dim, 1))
        
        self.caches = []
        outputs = []
        hidden_states = []
        cell_states = []
        
        h_t, c_t = h_prev, c_prev
        
        for t, x_t in enumerate(inputs):
            h_t, c_t, cache = self.lstm_cell.forward(x_t, h_t, c_t)
            
            # Output
            y_t = self.W_y @ h_t + self.b_y
            
            self.caches.append(cache)
            outputs.append(y_t)
            hidden_states.append(h_t)
            cell_states.append(c_t)
        
        return outputs, hidden_states, cell_states
    
    def softmax(self, x):
        exp_x = np.exp(x - np.max(x))
        return exp_x / np.sum(exp_x)
    
    def compute_loss(self, outputs, targets):
        """Compute cross-entropy loss."""
        loss = 0
        self.probs = []
        
        for y_t, target in zip(outputs, targets):
            probs = self.softmax(y_t)
            self.probs.append(probs)
            loss += -np.log(probs[target, 0] + 1e-10)
        
        return loss / len(outputs)

# Test LSTM network
print("LSTM Network Test")
print("="*50)

input_dim = 10
hidden_dim = 32
output_dim = 10
seq_length = 20

lstm_net = LSTMNetwork(input_dim, hidden_dim, output_dim)

# Create sequence
inputs = [np.random.randn(input_dim, 1) for _ in range(seq_length)]
targets = [np.random.randint(0, output_dim) for _ in range(seq_length)]

# Forward pass
outputs, hidden_states, cell_states = lstm_net.forward(inputs)
loss = lstm_net.compute_loss(outputs, targets)

print(f"Sequence length: {seq_length}")
print(f"Loss: {loss:.4f}")
print(f"Number of outputs: {len(outputs)}")
print(f"Output shape: {outputs[0].shape}")

# Analyze gate behavior
forget_gates = [cache['f_t'].mean() for cache in lstm_net.caches]
input_gates = [cache['i_t'].mean() for cache in lstm_net.caches]
output_gates = [cache['o_t'].mean() for cache in lstm_net.caches]

plt.figure(figsize=(12, 4))
plt.plot(forget_gates, 'r-', label='Forget Gate', linewidth=2)
plt.plot(input_gates, 'g-', label='Input Gate', linewidth=2)
plt.plot(output_gates, 'b-', label='Output Gate', linewidth=2)
plt.xlabel('Time Step')
plt.ylabel('Mean Gate Value')
plt.title('LSTM Gate Activations Over Time')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim([0, 1])
plt.show()

## 4. GRU: Gated Recurrent Unit

### 4.1 Simplified Gating

GRU (Cho et al., 2014) simplifies LSTM:
- Combines forget and input gates into **update gate**
- Merges cell state and hidden state
- Fewer parameters, similar performance

### 4.2 GRU Equations

**Reset Gate**: How much of past to forget
$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$$

**Update Gate**: How much to update state
$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$$

**Candidate Hidden State**: New state proposal
$$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)$$

**Final Hidden State**: Interpolation between old and new
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

**Key Insight**: The update gate creates a direct connection to past:
- When $z_t \approx 0$: $h_t \approx h_{t-1}$ (copy forward)
- When $z_t \approx 1$: $h_t \approx \tilde{h}_t$ (use new info)

In [None]:
class GRUCell:
    """
    GRU Cell implementation from scratch.
    """
    
    def __init__(self, input_dim, hidden_dim):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        
        concat_dim = input_dim + hidden_dim
        scale = np.sqrt(2.0 / (concat_dim + hidden_dim))
        
        # Reset gate
        self.W_r = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_r = np.zeros((hidden_dim, 1))
        
        # Update gate
        self.W_z = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_z = np.zeros((hidden_dim, 1))
        
        # Candidate hidden state
        self.W_h = np.random.randn(hidden_dim, concat_dim) * scale
        self.b_h = np.zeros((hidden_dim, 1))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, x_t, h_prev):
        """
        Forward pass through GRU cell.
        """
        concat = np.vstack([h_prev, x_t])
        
        # Reset gate
        r_t = self.sigmoid(self.W_r @ concat + self.b_r)
        
        # Update gate
        z_t = self.sigmoid(self.W_z @ concat + self.b_z)
        
        # Candidate hidden state
        reset_concat = np.vstack([r_t * h_prev, x_t])
        h_tilde = np.tanh(self.W_h @ reset_concat + self.b_h)
        
        # Final hidden state
        h_t = (1 - z_t) * h_prev + z_t * h_tilde
        
        cache = {
            'x_t': x_t,
            'h_prev': h_prev,
            'r_t': r_t,
            'z_t': z_t,
            'h_tilde': h_tilde,
            'h_t': h_t
        }
        
        return h_t, cache

# Test GRU cell
print("GRU Cell Implementation")
print("="*50)

gru_cell = GRUCell(input_dim, hidden_dim)

h_prev = np.zeros((hidden_dim, 1))
x_t = np.random.randn(input_dim, 1)

h_t, cache = gru_cell.forward(x_t, h_prev)

print(f"Input dimension: {input_dim}")
print(f"Hidden dimension: {hidden_dim}")
print(f"\nGate values (mean):")
print(f"  Reset gate: {cache['r_t'].mean():.4f}")
print(f"  Update gate: {cache['z_t'].mean():.4f}")

# Compare parameter counts
lstm_params = 4 * hidden_dim * (input_dim + hidden_dim) + 4 * hidden_dim
gru_params = 3 * hidden_dim * (input_dim + hidden_dim) + 3 * hidden_dim

print(f"\nParameter comparison:")
print(f"  LSTM parameters: {lstm_params:,}")
print(f"  GRU parameters: {gru_params:,}")
print(f"  GRU reduction: {(1 - gru_params/lstm_params)*100:.1f}%")

In [None]:
# Visualize GRU architecture

fig, ax = plt.subplots(figsize=(12, 8))

# Main cell
cell_rect = plt.Rectangle((2, 2), 8, 6, fill=False, edgecolor='black', linewidth=2)
ax.add_patch(cell_rect)
ax.text(6, 8.5, 'GRU Cell', ha='center', va='center', fontsize=16, fontweight='bold')

# Hidden state line
ax.plot([0, 12], [6, 6], 'g-', linewidth=3)
ax.text(0.5, 6.5, '$h_{t-1}$', fontsize=12, fontweight='bold')
ax.text(11.5, 6.5, '$h_t$', fontsize=12, fontweight='bold')

# Input
ax.plot([6, 6], [0, 2], 'orange', linewidth=2)
ax.text(6, 0.5, '$x_t$', ha='center', fontsize=12, fontweight='bold')

# Update gate multiply
update_mult1 = plt.Circle((4, 6), 0.4, color='purple', alpha=0.8)
ax.add_patch(update_mult1)
ax.text(4, 6, '×', ha='center', va='center', fontsize=14, color='white', fontweight='bold')
ax.text(4, 4.8, '$1-z_t$', ha='center', fontsize=10)

# Update gate multiply 2
update_mult2 = plt.Circle((8, 6), 0.4, color='green', alpha=0.8)
ax.add_patch(update_mult2)
ax.text(8, 6, '×', ha='center', va='center', fontsize=14, color='white', fontweight='bold')
ax.text(8, 4.8, '$z_t$', ha='center', fontsize=10)

# Addition
add_circle = plt.Circle((10, 6), 0.3, color='blue', alpha=0.8)
ax.add_patch(add_circle)
ax.text(10, 6, '+', ha='center', va='center', fontsize=12, color='white', fontweight='bold')

# Candidate hidden
tanh_box = plt.Rectangle((7.3, 3.5), 1.4, 0.8, color='yellow', alpha=0.8)
ax.add_patch(tanh_box)
ax.text(8, 3.9, 'tanh', ha='center', va='center', fontsize=10)
ax.text(8, 2.8, '$\\tilde{h}_t$', ha='center', fontsize=11)

# Reset gate effect
reset_mult = plt.Circle((3, 4), 0.4, color='red', alpha=0.8)
ax.add_patch(reset_mult)
ax.text(3, 4, '×', ha='center', va='center', fontsize=14, color='white', fontweight='bold')
ax.text(3, 2.8, '$r_t$', ha='center', fontsize=11)
ax.text(3, 2.3, 'Reset', ha='center', fontsize=10)

# Sigma boxes
sigma1 = plt.Rectangle((2.6, 5), 0.8, 0.6, color='lightgray', alpha=0.8)
ax.add_patch(sigma1)
ax.text(3, 5.3, 'σ', ha='center', va='center', fontsize=10)

sigma2 = plt.Rectangle((7.6, 5), 0.8, 0.6, color='lightgray', alpha=0.8)
ax.add_patch(sigma2)
ax.text(8, 5.3, 'σ', ha='center', va='center', fontsize=10)

# Equations
equations = [
    r'$r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$',
    r'$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$',
    r'$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t])$',
    r'$h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$'
]

for i, eq in enumerate(equations):
    ax.text(13, 7 - i*1.2, eq, fontsize=11, va='center')

ax.set_xlim(-1, 22)
ax.set_ylim(-0.5, 9)
ax.axis('off')
ax.set_title('GRU: Simplified Gating (vs LSTM)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 5. PyTorch LSTM and GRU

### 5.1 Built-in Modules

In [None]:
# PyTorch LSTM example

class SequenceClassifier(nn.Module):
    """
    Sequence classifier using LSTM/GRU.
    """
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes,
                 num_layers=1, rnn_type='LSTM', bidirectional=False, dropout=0.0):
        super().__init__()
        
        self.rnn_type = rnn_type
        self.bidirectional = bidirectional
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        
        # Embedding
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # RNN
        if rnn_type == 'LSTM':
            self.rnn = nn.LSTM(
                embedding_dim, hidden_dim,
                num_layers=num_layers,
                batch_first=True,
                bidirectional=bidirectional,
                dropout=dropout if num_layers > 1 else 0
            )
        elif rnn_type == 'GRU':
            self.rnn = nn.GRU(
                embedding_dim, hidden_dim,
                num_layers=num_layers,
                batch_first=True,
                bidirectional=bidirectional,
                dropout=dropout if num_layers > 1 else 0
            )
        else:
            self.rnn = nn.RNN(
                embedding_dim, hidden_dim,
                num_layers=num_layers,
                batch_first=True,
                bidirectional=bidirectional,
                dropout=dropout if num_layers > 1 else 0
            )
        
        # Output
        fc_input_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.fc = nn.Linear(fc_input_dim, num_classes)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # x: [batch, seq_len]
        embedded = self.embedding(x)  # [batch, seq, embed]
        embedded = self.dropout(embedded)
        
        # RNN
        if self.rnn_type == 'LSTM':
            output, (hidden, cell) = self.rnn(embedded)
        else:
            output, hidden = self.rnn(embedded)
        
        # Use last hidden state(s)
        if self.bidirectional:
            # Concatenate forward and backward
            hidden_fwd = hidden[-2]  # Last layer, forward
            hidden_bwd = hidden[-1]  # Last layer, backward
            hidden_final = torch.cat([hidden_fwd, hidden_bwd], dim=1)
        else:
            hidden_final = hidden[-1]  # Last layer
        
        hidden_final = self.dropout(hidden_final)
        logits = self.fc(hidden_final)
        
        return logits

# Compare architectures
vocab_size = 10000
embedding_dim = 128
hidden_dim = 256
num_classes = 2

models = {
    'RNN': SequenceClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, rnn_type='RNN'),
    'LSTM': SequenceClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, rnn_type='LSTM'),
    'GRU': SequenceClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, rnn_type='GRU'),
    'BiLSTM': SequenceClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, 
                                  rnn_type='LSTM', bidirectional=True),
}

print("Model Comparison:")
print("="*60)

for name, model in models.items():
    params = sum(p.numel() for p in model.parameters())
    print(f"{name:10s}: {params:,} parameters")

# Test forward pass
batch_size = 4
seq_len = 50
x = torch.randint(0, vocab_size, (batch_size, seq_len))

print(f"\nInput shape: {x.shape}")
for name, model in models.items():
    output = model(x)
    print(f"{name:10s} output shape: {output.shape}")

## 6. Long-Range Dependency Test

### 6.1 Copying Task

Test: Can the model remember information over long sequences?
- Input: First character, then noise, then predict first character
- Vanilla RNN fails for long sequences
- LSTM/GRU should succeed

In [None]:
# Long-range dependency task: Copy first element to last position

def create_copy_task_data(num_samples, seq_length, num_classes=10):
    """
    Create data for copy task.
    Input: [first_char, noise, noise, ..., noise]
    Target: first_char (predict at last position)
    """
    # First character to remember
    first_chars = torch.randint(0, num_classes, (num_samples,))
    
    # Fill with noise (use special tokens > num_classes)
    inputs = torch.randint(num_classes, num_classes + 5, (num_samples, seq_length))
    
    # Set first position to the character to remember
    inputs[:, 0] = first_chars
    
    # Target is the first character
    targets = first_chars
    
    return inputs, targets

def train_copy_task(model_class, seq_length, num_epochs=100, verbose=True):
    """
    Train model on copy task and return final accuracy.
    """
    num_classes = 10
    vocab_size = num_classes + 5  # Extra for noise tokens
    
    model = model_class(
        vocab_size=vocab_size,
        embedding_dim=32,
        hidden_dim=64,
        num_classes=num_classes
    ).to(device)
    
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()
    
    # Training data
    train_inputs, train_targets = create_copy_task_data(500, seq_length, num_classes)
    train_inputs = train_inputs.to(device)
    train_targets = train_targets.to(device)
    
    # Train
    for epoch in range(num_epochs):
        model.train()
        optimizer.zero_grad()
        
        output = model(train_inputs)
        loss = criterion(output, train_targets)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
        optimizer.step()
    
    # Evaluate
    model.eval()
    test_inputs, test_targets = create_copy_task_data(200, seq_length, num_classes)
    test_inputs = test_inputs.to(device)
    test_targets = test_targets.to(device)
    
    with torch.no_grad():
        output = model(test_inputs)
        predictions = output.argmax(dim=1)
        accuracy = (predictions == test_targets).float().mean().item()
    
    return accuracy

# Test different sequence lengths
print("Long-Range Dependency Test: Copying Task")
print("="*60)
print("Task: Remember first element and predict it at the end")
print()

seq_lengths = [10, 25, 50, 100]
results = {}

for rnn_type in ['RNN', 'GRU', 'LSTM']:
    results[rnn_type] = []
    print(f"Testing {rnn_type}...")
    
    for seq_len in seq_lengths:
        # Create model factory
        def create_model(vocab_size, embedding_dim, hidden_dim, num_classes):
            return SequenceClassifier(
                vocab_size, embedding_dim, hidden_dim, num_classes,
                rnn_type=rnn_type
            )
        
        acc = train_copy_task(create_model, seq_len, num_epochs=150)
        results[rnn_type].append(acc)
        print(f"  Seq length {seq_len}: {acc:.1%}")
    print()

# Plot results
plt.figure(figsize=(10, 6))

for rnn_type, accs in results.items():
    plt.plot(seq_lengths, accs, 'o-', linewidth=2, markersize=10, label=rnn_type)

plt.axhline(y=0.1, color='gray', linestyle='--', alpha=0.5, label='Random guess')
plt.xlabel('Sequence Length (Distance to Remember)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Long-Range Dependency Test: Copy First Element', fontsize=14, fontweight='bold')
plt.legend(loc='lower left')
plt.grid(True, alpha=0.3)
plt.ylim([0, 1.1])

for rnn_type, accs in results.items():
    for i, (sl, acc) in enumerate(zip(seq_lengths, accs)):
        plt.annotate(f'{acc:.0%}', (sl, acc), textcoords="offset points",
                    xytext=(0, 10), ha='center', fontsize=9)

plt.tight_layout()
plt.show()

print("\nConclusion:")
print("- LSTM and GRU maintain high accuracy for long sequences")
print("- Vanilla RNN struggles as sequence length increases")
print("- This demonstrates the power of gating mechanisms!")

## 7. Sentiment Analysis with LSTM

### 7.1 Real-World Application

In [None]:
# Simple sentiment analysis example

# Create synthetic sentiment dataset
positive_words = ['good', 'great', 'excellent', 'amazing', 'wonderful', 'fantastic', 'love', 'best', 'happy', 'perfect']
negative_words = ['bad', 'terrible', 'awful', 'horrible', 'worst', 'hate', 'poor', 'disappointing', 'sad', 'wrong']
neutral_words = ['the', 'a', 'is', 'was', 'it', 'this', 'that', 'movie', 'film', 'story']

all_words = positive_words + negative_words + neutral_words
word2idx = {word: i for i, word in enumerate(all_words)}
vocab_size = len(all_words)

def generate_sentiment_data(num_samples=500):
    """Generate synthetic sentiment data."""
    sentences = []
    labels = []
    
    for _ in range(num_samples):
        # Randomly choose sentiment
        if np.random.rand() > 0.5:
            # Positive
            sentiment_words = np.random.choice(positive_words, size=np.random.randint(2, 4))
            label = 1
        else:
            # Negative
            sentiment_words = np.random.choice(negative_words, size=np.random.randint(2, 4))
            label = 0
        
        # Add neutral words
        neutral = np.random.choice(neutral_words, size=np.random.randint(3, 7))
        
        # Combine and shuffle
        sentence = list(neutral) + list(sentiment_words)
        np.random.shuffle(sentence)
        
        # Convert to indices
        indices = [word2idx[w] for w in sentence]
        
        sentences.append(indices)
        labels.append(label)
    
    return sentences, labels

def pad_sequences(sequences, max_len=None):
    """Pad sequences to same length."""
    if max_len is None:
        max_len = max(len(seq) for seq in sequences)
    
    padded = np.zeros((len(sequences), max_len), dtype=np.int64)
    for i, seq in enumerate(sequences):
        padded[i, :len(seq)] = seq
    
    return padded

# Generate data
train_sentences, train_labels = generate_sentiment_data(800)
test_sentences, test_labels = generate_sentiment_data(200)

# Pad
max_len = 15
X_train = torch.tensor(pad_sequences(train_sentences, max_len))
y_train = torch.tensor(train_labels)
X_test = torch.tensor(pad_sequences(test_sentences, max_len))
y_test = torch.tensor(test_labels)

print("Sentiment Analysis Dataset")
print(f"Vocabulary size: {vocab_size}")
print(f"Max sequence length: {max_len}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

# Example sentences
idx2word = {i: w for w, i in word2idx.items()}
print("\nExample sentences:")
for i in range(3):
    words = [idx2word.get(idx, 'PAD') for idx in train_sentences[i]]
    label = 'Positive' if train_labels[i] == 1 else 'Negative'
    print(f"  {' '.join(words)} -> {label}")

In [None]:
# Train sentiment classifier

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        embedded = self.dropout(self.embedding(x))
        output, (hidden, cell) = self.lstm(embedded)
        # Concatenate final forward and backward hidden states
        hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)
        return self.fc(self.dropout(hidden_cat))

# Create model
model = SentimentLSTM(vocab_size, embedding_dim=32, hidden_dim=64).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

print("Training Sentiment LSTM...")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

# Create data loaders
train_dataset = TensorDataset(X_train.to(device), y_train.to(device))
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 50
train_losses = []
train_accs = []
test_accs = []

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    correct = 0
    total = 0
    
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        output = model(batch_x)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        predictions = output.argmax(dim=1)
        correct += (predictions == batch_y).sum().item()
        total += batch_y.size(0)
    
    train_losses.append(epoch_loss / len(train_loader))
    train_accs.append(correct / total)
    
    # Test accuracy
    model.eval()
    with torch.no_grad():
        test_output = model(X_test.to(device))
        test_preds = test_output.argmax(dim=1)
        test_acc = (test_preds == y_test.to(device)).float().mean().item()
        test_accs.append(test_acc)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {train_losses[-1]:.4f}, "
              f"Train Acc: {train_accs[-1]:.1%}, Test Acc: {test_acc:.1%}")

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(train_losses)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].grid(True, alpha=0.3)

axes[1].plot(train_accs, label='Train')
axes[1].plot(test_accs, label='Test')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('Sentiment Analysis Training', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"\nFinal Test Accuracy: {test_accs[-1]:.1%}")

## 8. Advanced Techniques

### 8.1 Stacked (Deep) LSTMs

Multiple LSTM layers for hierarchical feature learning:

```python
nn.LSTM(input_size, hidden_size, num_layers=3)
```

- First layer: Low-level patterns
- Higher layers: Abstract representations
- Dropout between layers prevents overfitting

### 8.2 Attention Mechanisms (Preview)

Instead of using only final hidden state, attend to all positions:

$$\text{context} = \sum_t \alpha_t h_t$$

Where attention weights $\alpha_t$ are learned.

This will be covered in detail on Day 15!

In [None]:
# Stacked LSTM example

class DeepLSTM(nn.Module):
    """
    Multi-layer LSTM with dropout.
    """
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes, num_layers=3, dropout=0.3):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(
            embedding_dim, hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout,
            bidirectional=True
        )
        self.layer_norm = nn.LayerNorm(hidden_dim * 2)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        embedded = self.dropout(self.embedding(x))
        output, (hidden, cell) = self.lstm(embedded)
        
        # Combine final states from all layers
        # hidden shape: [num_layers * 2, batch, hidden]
        hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)
        normalized = self.layer_norm(hidden_cat)
        
        return self.fc(self.dropout(normalized))

# Compare single vs multi-layer
print("Deep LSTM Architecture:")
print("="*60)

single_layer = SentimentLSTM(vocab_size, 32, 64)
multi_layer = DeepLSTM(vocab_size, 32, 64, 2, num_layers=3)

print(f"Single layer LSTM params: {sum(p.numel() for p in single_layer.parameters()):,}")
print(f"3-layer LSTM params: {sum(p.numel() for p in multi_layer.parameters()):,}")

print("\n3-Layer LSTM Structure:")
print(multi_layer)

In [None]:
# Simple attention mechanism preview

class AttentionLSTM(nn.Module):
    """
    LSTM with simple attention mechanism.
    """
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        
        # Attention layer
        self.attention = nn.Linear(hidden_dim * 2, 1)
        
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
    
    def forward(self, x):
        embedded = self.embedding(x)  # [batch, seq, embed]
        
        # LSTM output at each position
        lstm_out, _ = self.lstm(embedded)  # [batch, seq, hidden*2]
        
        # Compute attention scores
        attention_scores = self.attention(lstm_out).squeeze(-1)  # [batch, seq]
        attention_weights = torch.softmax(attention_scores, dim=1)  # [batch, seq]
        
        # Weighted sum (context vector)
        context = torch.bmm(attention_weights.unsqueeze(1), lstm_out)  # [batch, 1, hidden*2]
        context = context.squeeze(1)  # [batch, hidden*2]
        
        # Classify
        return self.fc(context), attention_weights

# Demo attention
print("LSTM with Attention:")
attention_model = AttentionLSTM(vocab_size, 32, 64, 2).to(device)

# Test with sample
sample_x = X_test[:5].to(device)
with torch.no_grad():
    output, attention_weights = attention_model(sample_x)

# Visualize attention for one sample
sample_idx = 0
weights = attention_weights[sample_idx].cpu().numpy()
words = [idx2word.get(idx, 'PAD') for idx in test_sentences[sample_idx]]

plt.figure(figsize=(12, 3))
plt.bar(range(len(weights)), weights)
plt.xticks(range(len(words)), words, rotation=45, ha='right')
plt.ylabel('Attention Weight')
plt.title(f'Attention Weights\nLabel: {"Positive" if test_labels[sample_idx] == 1 else "Negative"}')
plt.tight_layout()
plt.show()

print("\nAttention allows the model to focus on relevant words!")
print("Notice: sentiment words should have higher attention weights.")

## 9. LSTM vs GRU: When to Use What

### 9.1 Comparison Table

| Aspect | LSTM | GRU |
|--------|------|-----|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| Parameters | More (~33% more) | Fewer |
| Training speed | Slower | Faster |
| Memory | More | Less |
| Performance | Often slightly better | Often comparable |
| Long sequences | Better | Good |
| Small datasets | May overfit | Less prone |

### 9.2 Practical Guidelines

**Choose LSTM when:**
- Very long sequences (>200 tokens)
- Complex temporal patterns
- Plenty of training data
- Need fine-grained control over memory

**Choose GRU when:**
- Moderate sequence lengths
- Limited computational resources
- Smaller datasets
- Need faster training/inference

**In practice:**
- Try both and compare!
- Performance difference often minimal
- GRU is becoming more popular due to simplicity

In [None]:
# Final comparison visualization

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Architecture complexity
ax = axes[0, 0]
models_compare = ['Vanilla RNN', 'GRU', 'LSTM']
gates = [0, 2, 3]
params_mult = [1, 3, 4]  # Relative parameter count

x = np.arange(len(models_compare))
width = 0.35

ax.bar(x - width/2, gates, width, label='Number of Gates', color='steelblue', alpha=0.7)
ax.bar(x + width/2, params_mult, width, label='Relative Parameters', color='coral', alpha=0.7)
ax.set_xticks(x)
ax.set_xticklabels(models_compare)
ax.set_ylabel('Count')
ax.set_title('Architecture Complexity')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# 2. Memory capability comparison (from our tests)
ax = axes[0, 1]
for rnn_type, accs in results.items():
    ax.plot(seq_lengths, accs, 'o-', linewidth=2, markersize=8, label=rnn_type)
ax.set_xlabel('Sequence Length')
ax.set_ylabel('Accuracy on Copy Task')
ax.set_title('Long-Range Dependency (Copy Task)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 1.1])

# 3. Training time comparison (simulated)
ax = axes[1, 0]
seq_len_train = [10, 50, 100, 200]
rnn_time = [1, 5, 10, 20]
gru_time = [1.5, 7.5, 15, 30]
lstm_time = [2, 10, 20, 40]

ax.plot(seq_len_train, rnn_time, 's-', linewidth=2, label='RNN', markersize=8)
ax.plot(seq_len_train, gru_time, 'o-', linewidth=2, label='GRU', markersize=8)
ax.plot(seq_len_train, lstm_time, '^-', linewidth=2, label='LSTM', markersize=8)
ax.set_xlabel('Sequence Length')
ax.set_ylabel('Relative Training Time')
ax.set_title('Training Time Comparison')
ax.legend()
ax.grid(True, alpha=0.3)

# 4. Use case recommendations
ax = axes[1, 1]
use_cases = [
    ('Short sequences\n(<50 tokens)', 'GRU or RNN'),
    ('Medium sequences\n(50-200 tokens)', 'GRU or LSTM'),
    ('Long sequences\n(>200 tokens)', 'LSTM'),
    ('Limited resources', 'GRU'),
    ('Maximum accuracy', 'LSTM (try both)')
]

y_pos = np.arange(len(use_cases))
ax.barh(y_pos, [1]*len(use_cases), color='lightgray', alpha=0.5)
for i, (case, recommendation) in enumerate(use_cases):
    ax.text(0.05, i, f"{case}: {recommendation}", va='center', fontsize=11)

ax.set_yticks([])
ax.set_xticks([])
ax.set_title('Use Case Recommendations')
ax.set_xlim([0, 1])

plt.suptitle('LSTM vs GRU: Comprehensive Comparison', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 10. Summary and Key Takeaways

### What We Learned

1. **Gating mechanisms** solve the vanishing gradient problem through:
   - Sigmoid gates controlling information flow
   - Additive (not multiplicative) state updates
   - Direct gradient pathways through time

2. **LSTM Architecture:**
   - Separate cell state (long-term) and hidden state (short-term)
   - Three gates: Forget, Input, Output
   - Most powerful for long sequences

3. **GRU Architecture:**
   - Simplified LSTM with two gates: Reset, Update
   - Fewer parameters, faster training
   - Often comparable performance

4. **Practical Applications:**
   - Sentiment analysis
   - Language modeling
   - Sequence classification
   - Time series prediction

5. **Best Practices:**
   - Initialize forget gate bias to 1
   - Use gradient clipping
   - Apply dropout between layers
   - Consider bidirectional processing

### Next Steps (Day 15)

- **Attention mechanisms**: Beyond simple recurrence
- **Transformers**: Self-attention and parallel processing
- This leads to modern NLP architectures (BERT, GPT)!

## Exercises

### Exercise 1: LSTM Backpropagation
Implement the backward pass for the LSTM cell, computing gradients for all gates.

### Exercise 2: Language Model
Train an LSTM language model on a small corpus and generate text. Compare with vanilla RNN.

### Exercise 3: Time Series Prediction
Use LSTM to predict stock prices or weather patterns from historical data.

### Exercise 4: Named Entity Recognition
Implement sequence labeling with BiLSTM for NER task.

### Exercise 5: Variational Dropout
Implement variational dropout (same mask across time steps) and compare with standard dropout.

### Exercise 6: Peephole Connections
Add peephole connections to LSTM (gates also see cell state).

### Exercise 7: Seq2Seq Model
Implement encoder-decoder LSTM for simple translation or text summarization.

In [None]:
# Starter code for Exercise 3: Time Series Prediction

def create_sine_wave_data(num_samples=1000, seq_length=50, pred_length=10):
    """
    Create sine wave prediction task.
    Input: seq_length points of sine wave
    Target: next pred_length points
    """
    X = []
    y = []
    
    for _ in range(num_samples):
        # Random starting point
        start = np.random.uniform(0, 2*np.pi)
        # Generate sequence
        t = np.linspace(start, start + 4*np.pi, seq_length + pred_length)
        wave = np.sin(t)
        
        X.append(wave[:seq_length])
        y.append(wave[seq_length:])
    
    return np.array(X), np.array(y)

# Generate data
X_sine, y_sine = create_sine_wave_data()
print(f"Input shape: {X_sine.shape}")
print(f"Target shape: {y_sine.shape}")

# Visualize
plt.figure(figsize=(12, 4))
sample_idx = 0
plt.plot(range(50), X_sine[sample_idx], 'b-', linewidth=2, label='Input')
plt.plot(range(50, 60), y_sine[sample_idx], 'r-', linewidth=2, label='Target')
plt.axvline(x=49.5, color='gray', linestyle='--')
plt.xlabel('Time')
plt.ylabel('Value')
plt.title('Sine Wave Prediction Task')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\nExercise: Build LSTM model to predict future sine wave values!")

## References

1. Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation.
2. Cho, K., et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder." EMNLP.
3. Gers, F. A., et al. (2000). "Learning to Forget: Continual Prediction with LSTM." Neural Computation.
4. Greff, K., et al. (2017). "LSTM: A Search Space Odyssey." IEEE Transactions on Neural Networks.
5. Jozefowicz, R., et al. (2015). "An Empirical Exploration of Recurrent Network Architectures." ICML.
6. Chung, J., et al. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks." NeurIPS Workshop.
7. Olah, C. (2015). "Understanding LSTM Networks." Blog post.