## Vanilla RNN Cell Explained

A Recurrent Neural Network (RNN) processes sequences by maintaining a **hidden state** that is updated at each time step based on the current input and the previous hidden state.

### Update Rule

At each time step $t$, the hidden state $h_t$ is updated using:

$$
h_t = \tanh(W_{ih} \cdot x_t + W_{hh} \cdot h_{t-1} + b)
$$

Where:
- $x_t \in \mathbb{R}^{n}$: input at time step $t$
- $h_{t-1} \in \mathbb{R}^{m}$: previous hidden state
- $W_{ih} \in \mathbb{R}^{m \times n}$: input-to-hidden weight matrix
- $W_{hh} \in \mathbb{R}^{m \times m}$: hidden-to-hidden weight matrix
- $b \in \mathbb{R}^{m}$: bias vector
- $\tanh(\cdot)$: element-wise hyperbolic tangent activation

### Intuition

- The hidden state $h_t$ captures information from **all previous inputs** $x_0, x_1, ..., x_t$.
- The weights $W_{ih}$ and $W_{hh}$ are **shared** across all time steps.
- This allows the RNN to process variable-length sequences using a fixed number of parameters.

### Limitation

RNNs are prone to the **vanishing gradient problem**, especially for long sequences. This occurs because the gradient of the loss with respect to earlier hidden states involves a product of many small Jacobian terms, which can shrink towards zero:

$$
\frac{\partial \mathcal{L}}{\partial h_t} = \frac{\partial \mathcal{L}}{\partial h_T} \cdot \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}}
$$

As a result, it becomes difficult to learn long-term dependencies.

You can test a simple RNN implementation using raw PyTorch operations to observe how the hidden state evolves over time.


In [2]:
import torch
import torch.nn.functional as F

# Seed for reproducibility
torch.manual_seed(0)

# Input sequence: shape (seq_len, input_size)
x = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])  # (5, 1)

# Dimensions
seq_len, input_size = x.shape
hidden_size = 1

# Initialize weights and bias
W_ih = torch.randn(hidden_size, input_size) * 0.1  # input → hidden
W_hh = torch.randn(hidden_size, hidden_size) * 0.1  # hidden → hidden
b = torch.zeros(hidden_size)                       # bias

# Output list
outputs = []

# Initial hidden state
h = torch.zeros(hidden_size)
# RNN forward loop
for t in range(seq_len):
    x_t = x[t]
    h = torch.tanh(W_ih @ x_t + W_hh @ h + b)
    outputs.append(h.item())

# Print output hidden states
print("RNN outputs:", outputs)

RNN outputs: [0.15289130806922913, 0.294706791639328, 0.4248957335948944, 0.5398406982421875, 0.6379193663597107]


##  Vanilla RNN Using PyTorch

PyTorch provides a built-in implementation of vanilla RNNs through `nn.RNN` or `nn.RNNCell` (for a single cell). It handles weight initialization, recurrence, and backpropagation through time (BPTT) internally.

### PyTorch Structure

- **Input shape**: `(seq_len, batch_size, input_size)`
- **Output**:
  - `output`: all hidden states $(h_0, h_1, ..., h_T)$ → shape `(seq_len, batch_size, hidden_size)`
  - `h_n`: final hidden state $h_T$

In [19]:
import torch
import torch.nn as nn

# Define input parameters
input_size = 1      # size of each input vector
hidden_size = 4     # number of hidden units
seq_len = 5         # number of time steps

# Input sequence: shape (seq_len, input_size)
x = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])  # shape (5, 1)

# Define the RNN module
rnn = nn.RNNCell(input_size=input_size, hidden_size=hidden_size, nonlinearity='tanh')

# Initial hidden state (optional)
h0 = torch.zeros(5, hidden_size)

# Run the RNN
output = rnn(x, h0)

# Print outputs
print("All hidden states:\n", output)

All hidden states:
 tensor([[-0.7895, -0.0327, -0.3268, -0.0262],
        [-0.9114, -0.2199, -0.6464, -0.3298],
        [-0.9641, -0.3921, -0.8333, -0.5777],
        [-0.9857, -0.5407, -0.9259, -0.7511],
        [-0.9943, -0.6618, -0.9679, -0.8596]], grad_fn=<TanhBackward0>)


## Gated Recurrent Unit (GRU) Explained

A **GRU** is a type of recurrent neural network that introduces **gates** to control the flow of information, helping it retain memory over longer sequences and mitigate the vanishing gradient problem.

### GRU Equations

At each time step $t$, given input $x$ and previous hidden state $h$, the GRU computes:

#### 1. **Reset Gate** (controls how much past to forget)

$$
t_{hr} = W_{hr} h + b_{hr} \\
t_{xr} = W_{ir} x + b_{ir} \\
r = \sigma(t_{hr} + t_{xr})
$$

#### 2. **Update Gate** (controls how much new info to use)

$$
t_{hz} = W_{hz} h + b_{hz} \\
t_{xz} = W_{iz} x + b_{iz} \\
z = \sigma(t_{hz} + t_{xz})
$$

#### 3. **Candidate Activation** (proposed new hidden state)

$$
t_{hn} = W_{hn} h + b_{hn} \\
t_{xn} = W_{in} x + b_{in} \\
\tilde{h} = \tanh(t_{xn} + r \odot t_{hn})
$$

#### 4. **Final Hidden State Update**

$$
h_t = (1 - z) \odot h + z \odot \tilde{h}
$$

Where:
- $\sigma(\cdot)$ is the sigmoid function
- $\tanh(\cdot)$ is the hyperbolic tangent
- $\odot$ denotes element-wise multiplication

### Intuition

- The **reset gate** $r$ decides how much of the past state to forget when computing the candidate $\tilde{h}$.
- The **update gate** $z$ decides how much of the new candidate to use vs. keeping the old state.
- This gating mechanism enables the GRU to **learn long-range dependencies** and avoid gradient vanishing.

In [5]:
import torch
import torch.nn.functional as F

# Seed for reproducibility
torch.manual_seed(42)

# Input sequence: shape (seq_len, input_size)
sequence = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])  # shape (5, 1)
seq_len, input_size = sequence.shape
hidden_size = 1  # scalar GRU cell

# Initialize GRU weights and biases
W_ir = torch.randn(hidden_size, input_size) * 0.1
W_hr = torch.randn(hidden_size, hidden_size) * 0.1
b_ir = torch.zeros(hidden_size)
b_hr = torch.zeros(hidden_size)

W_iz = torch.randn(hidden_size, input_size) * 0.1
W_hz = torch.randn(hidden_size, hidden_size) * 0.1
b_iz = torch.zeros(hidden_size)
b_hz = torch.zeros(hidden_size)

W_in = torch.randn(hidden_size, input_size) * 0.1
W_hn = torch.randn(hidden_size, hidden_size) * 0.1
b_in = torch.zeros(hidden_size)
b_hn = torch.zeros(hidden_size)

# Initial hidden state
h = torch.zeros(hidden_size)

# GRU forward pass
outputs = []
for t in range(seq_len):
    x_t = sequence[t]

    # Reset gate
    r = torch.sigmoid(W_ir @ x_t + b_ir + W_hr @ h + b_hr)

    # Update gate
    z = torch.sigmoid(W_iz @ x_t + b_iz + W_hz @ h + b_hz)

    # Candidate hidden state
    n = torch.tanh(W_in @ x_t + b_in + r * (W_hn @ h + b_hn))

    # Final hidden state
    h = (1 - z) * h + z * n
    outputs.append(h.item())

print("GRU outputs:", outputs)


GRU outputs: [-0.05656343698501587, -0.14032450318336487, -0.2349534034729004, -0.3311828374862671, -0.42367157340049744]


## GRU (Gated Recurrent Unit) Using PyTorch

PyTorch provides a built-in GRU module via `nn.GRU` or `nn.GRUCell` (for single cell), which simplifies training and inference for recurrent networks with gating mechanisms.

### PyTorch GRU Structure

- **Input shape**: `(seq_len, batch_size, input_size)`
- **Output**:
  - `output`: all hidden states $(h_0, ..., h_T)$ → shape `(seq_len, batch_size, hidden_size)`
  - `h_n`: final hidden state $h_T`

In [18]:
import torch
import torch.nn as nn

# Parameters
input_size = 1
hidden_size = 4
seq_len = 5

# Input sequence: shape (seq_len, input_size)
x = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])  # shape (5, 1)

# Define GRU model
gru = nn.GRUCell(input_size=input_size, hidden_size=hidden_size)

# Initial hidden state (optional)
h0 = torch.zeros(5, hidden_size)

# Forward pass
output = gru(x, h0)

print("All hidden states (output):", output)

All hidden states (output): tensor([[-0.4216, -0.0231, -0.2176, -0.2058],
        [-0.5988,  0.0482, -0.4087, -0.3461],
        [-0.7258,  0.1265, -0.5571, -0.4507],
        [-0.8130,  0.2086, -0.6629, -0.5267],
        [-0.8723,  0.2918, -0.7372, -0.5849]], grad_fn=<AddBackward0>)


## Long Short-Term Memory (LSTM) in PyTorch

An **LSTM** is a gated recurrent neural network designed to capture **long-range dependencies** and mitigate the **vanishing gradient problem** by introducing a **cell state** that is updated additively.

### LSTM Equations

At each time step $t$, the LSTM takes input $x_t$ and previous states $(h_{t-1}, c_{t-1})$ and computes:

$$
\begin{align*}
f_t &= \sigma(W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf}) \quad \text{(forget gate)} \\
i_t &= \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \quad \text{(input gate)} \\
o_t &= \sigma(W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho}) \quad \text{(output gate)} \\
\tilde{c}_t &= \tanh(W_{ic} x_t + b_{ic} + W_{hc} h_{t-1} + b_{hc}) \quad \text{(cell candidate)} \\
c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad \text{(new cell state)} \\
h_t &= o_t \odot \tanh(c_t) \quad \text{(new hidden state)}
\end{align*}
$$

- $\sigma(\cdot)$ is the sigmoid function
- $\tanh(\cdot)$ is the hyperbolic tangent
- $\odot$ is element-wise multiplication


In [20]:
import torch
import torch.nn as nn

# Parameters
input_size = 1
hidden_size = 4
seq_len = 5

# Input sequence: shape (seq_len, input_size)
x = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])  # shape (5, 1)

# LSTM cell
lstm = nn.LSTMCell(input_size=input_size, hidden_size=hidden_size)

# Initial hidden and cell states
h_t = torch.zeros(1, hidden_size)
c_t = torch.zeros(1, hidden_size)

# Forward loop
outputs = []
for t in range(seq_len):
    h_t, c_t = lstm(x[t].unsqueeze(0), (h_t, c_t))  # (1, hidden_size)
    outputs.append(h_t)

# Stack outputs to a tensor: shape (seq_len, hidden_size)
outputs = torch.cat(outputs, dim=0)

print("All hidden states (output):", outputs)

All hidden states (output): tensor([[ 0.0153,  0.0893, -0.1494, -0.1791],
        [-0.0168,  0.0562, -0.3511, -0.3371],
        [-0.0600, -0.0069, -0.5769, -0.4640],
        [-0.0847, -0.0451, -0.7566, -0.5572],
        [-0.0849, -0.0594, -0.8628, -0.6209]], grad_fn=<CatBackward0>)
