# I. Introduction

## 1.1. Overview of Neural Networks

Traditional Neural Networks (NNs), such as **Multilayer Perceptrons (MLPs)**, are primarily designed to handle **fixed-length data**. For instance, when linear and logistic regression or MLPs were introduced, it was assumed that each feature vector `x_i` consisted of a fixed number of components. These datasets are often referred to as **tabular** because they can be arranged in tables where each example forms a row and each attribute a column, without assuming any particular structure over the columns. Similarly, **image data**, while not strictly tabular, is also typically of fixed length, such as Fashion-MNIST images being 28 × 28 grids of pixel values. In these traditional settings, the goal is often to produce a single prediction from one fixed-length input. Crucially, in a traditional Artificial Neural Network (ANN), **each input is processed independently**, which means they struggle with sequential data because they require fixed-size inputs and lose the inherent context or sequential information.

## 1.2. Need for Sequential Data Processing

Many real-world learning tasks necessitate handling **sequential data**, where the **order of elements is paramount** and the data can have **varying lengths**. For example:
*   **Image captioning, speech synthesis, and music generation** require models to produce outputs consisting of sequences.
*   **Time series prediction, video analysis, and musical information retrieval** demand that a model learns from inputs that are sequences.
*   Tasks like **translating text from one natural language to another, engaging in dialogue, or controlling a robot** require models to both ingest and output sequentially structured data.

A key insight is that while inputs and targets for many fundamental machine learning tasks cannot easily be represented as fixed-length vectors, they can often be represented as **varying-length sequences of fixed-length vectors**. For instance, documents can be viewed as sequences of words, medical records as sequences of events (like encounters or diagnoses), and videos as sequences of still images.

Unlike individual inputs which are often assumed to be sampled independently, **data arriving at each time step in a sequence cannot be assumed to be independent of each other**. For example, words appearing later in a document heavily depend on earlier words, and a patient's medicine on the 10th day of a hospital visit depends on what transpired in the previous nine days. Traditional neural networks are not suitable for these tasks because their fixed-size input requirement prevents them from processing variable-length sequences effectively, and they cannot maintain context or capture dependencies from previous inputs.

## 1.3. Introduction to Recurrent Neural Networks (RNNs)

**Recurrent Neural Networks (RNNs)** are a class of deep learning models specifically designed to **process sequential data** by capturing the dynamics of sequences via recurrent connections. These recurrent connections can be visualized as cycles in the network of nodes. The defining characteristic of RNNs is their **"memory"**, implemented through a **hidden state** (denoted as $s_t$ or $h_t$), which allows information from previous inputs or computations to influence the processing of the current input.

At each time step $t$, the **hidden state** $\mathbf{h}_t$ (or $s_t$) is computed based on both the current input $\mathbf{x}_t$ and the hidden state $\mathbf{h}_{t-1}$ from the previous time step. This computation is recurrent, meaning the same underlying parameters are applied at each step. A significant advantage of RNNs is that they **share the same model parameters (weights and biases) across all time steps**. This property ensures that the parametrization cost of an RNN does not grow as the number of time steps increases, regardless of how long the sequence is.

RNNs are highly effective for tasks involving sequences, including:
*   **Language modeling and text generation**: Predicting the next token in a sequence or generating new text character by character.
*   **Machine translation**: Translating sequences of words from one language to another.
*   **Speech recognition**: Processing acoustic signals to predict sequences of phonetic segments.
*   **Image captioning**: Generating descriptions for images by processing visual features as sequences.
*   **Sentiment analysis**: Classifying the overall sentiment of a text sequence.

## 1.4. Introduction to Backpropagation Through Time (BPTT)

To **train Recurrent Neural Networks**, a specialized variant of the traditional backpropagation algorithm is employed, known as **Backpropagation Through Time (BPTT)**. This method extends standard backpropagation to account for the temporal dependencies inherent in sequential data.

The core idea of BPTT involves **unrolling (or unfolding) the RNN over time**. By unrolling, the recurrent network is effectively transformed into a feedforward neural network, where each time step is treated as a distinct layer. A crucial aspect of this unrolling is that **every copy of the network in the unfolded view shares the same parameters**.

During BPTT, the **total loss for the network is calculated as the sum of the errors at each individual time step** across the entire sequence. To compute the gradients with respect to the shared parameters (weights and biases), the **chain rule** of differentiation is applied, propagating gradients backward through each time step. Since a parameter (like a weight matrix) is used across multiple time steps in the unrolled network, its **gradient must be summed up from its contributions at all relevant time steps**. This process enables RNNs to learn complex temporal patterns and capture long-term relationships within sequential data.

Backpropagation Through Time (BPTT) is a fundamental training algorithm for a specialized class of neural networks known as Recurrent Neural Networks (RNNs). These networks are designed to tackle problems involving sequential data, which traditional neural network architectures struggle with.

## 1.5. Problem Statement: Limitations of Traditional Models for Sequential Data

Traditional Artificial Neural Networks (ANNs), such as Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs), are primarily designed for **fixed-length data**. They assume that each input is independent, meaning they do not inherently process or remember past information in a sequence. This presents a significant challenge for tasks where the **order and relationships between elements within a sequence are crucial**. For instance, predicting the next word in a sentence requires knowledge of the preceding words, and traditional ANNs lack this contextual memory. While techniques like zero-padding can force variable-length sequences into fixed-size inputs, this can increase computational burden and lead to a loss of essential sequential information.

RNNs address this by introducing **recurrent connections**, which allow information to persist across time steps through a **hidden state** or "memory". This hidden state is updated at each time step based on the current input and the previous hidden state, enabling the network to learn from historical information. A key feature of RNNs is that they **share the same parameters (weights and biases) across all time steps**, which makes them efficient for handling sequences of varying lengths.

Despite their advantages, training RNNs presents significant challenges, notably the **vanishing gradient problem** and the **exploding gradient problem**. When gradients are backpropagated through many time steps, they can either shrink exponentially (vanishing gradients), making it difficult for the network to learn long-range dependencies and effectively "forget" early information, or grow uncontrollably large (exploding gradients), leading to numerical instability.

## 1.6. Applications of Recurrent Neural Networks

RNNs are particularly well-suited for processing and generating sequential data, leading to their widespread application across various fields of Computer Science.

Key applications include:

*   **Natural Language Processing (NLP)**:
    *   **Language Modeling and Text Generation**: Predicting the next word or character in a sequence. This is used for generating human-like text, such as Shakespearean prose, or even code.
    *   **Machine Translation**: Translating text from one language to another, often involving unaligned input and output sequence lengths.
    *   **Sentiment Classification**: Analyzing a sequence of words (e.g., a review) to determine its overall sentiment.
    *   **Named Entity Recognition (NER)**: Identifying and classifying specific entities (e.g., persons, organizations, locations) within a sentence.
    *   **Dialogue Systems**.
*   **Speech Processing**:
    *   **Speech Recognition**: Transcribing acoustic signals into words.
    *   **Speech Synthesis**: Generating human-like speech.
*   **Time Series Analysis**:
    *   **Time Series Prediction**: Forecasting future values based on historical data, such as stock prices or sensor readings.
    *   **Anomaly Detection** in time series data.
*   **Computer Vision**:
    *   **Image Captioning**: Generating descriptions for images, often combined with Convolutional Neural Networks (CNNs).
    *   **Video Analysis/Captioning** and **Human Action Recognition**.
    *   **Robot Control**: Controlling robotic movements that involve sequences of actions.
*   **Other Domains**:
    *   **Music Generation/Composition**: Generating sequences of musical notes.
    *   **Medical Applications**: Predicting clinical events in medical care pathways.
    *   **Drug Design**.

BPTT—enabling effective RNN training—has unlocked advances across a broad spectrum of computer science problems, particularly those involving sequential or time-dependent data. Notable applications include:

* Natural Language Processing: Language modeling, machine translation, and text generation, where context from previous words significantly influences predictions.

* Speech Recognition: Interpreting spoken language, which requires analyzing sound frames sequentially to capture context and phonetic dependencies.

* Time Series Forecasting: Predicting future stock prices, weather, or sensor readings using past observations.

* Control Systems: In reinforcement learning, RNNs trained with BPTT help model partially observable environments and design controllers that act based on historical sensor data.

## 1.7. How Computer Science Problems Have Been Solved Using BPTT

The primary method for training RNNs is **Backpropagation Through Time (BPTT)**.

1.  **The Core Mechanism of BPTT**:
    *   BPTT works by **unrolling the recurrent neural network over time**, transforming it into a deep feedforward network where each time step is treated as a distinct layer.
    *   Since RNNs **share the same parameters** across all time steps, BPTT calculates gradients by **summing the contributions from all relevant time steps** (from the current output backward to previous states that influence it). This allows the network to learn how changes in parameters at earlier time steps affect the final output across the sequence.
    *   The process involves a **forward pass** to compute hidden states and outputs at each time step, followed by a **backward pass** to calculate and accumulate gradients for weight updates.

2.  **Addressing Gradient Problems with BPTT and RNN Architectures**:
    *   **Exploding Gradients**: As gradients propagate backward through many time steps, they can become excessively large. A common and effective solution is **gradient clipping**, which limits the maximum magnitude of the gradients to a predefined threshold before applying weight updates, ensuring numerical stability.
    *   **Vanishing Gradients**: Gradients can shrink to near zero over long sequences, preventing effective learning of long-term dependencies. This is a more challenging problem as it's not always obvious when it occurs.
        *   **Long Short-Term Memory (LSTM) Networks**: Developed by Hochreiter and Schmidhuber in 1997, LSTMs are a specialized type of RNN explicitly designed to **mitigate the vanishing gradient problem** and learn long-term dependencies. They achieve this through a more complex internal structure that includes **memory cells** and "gates" (input, forget, and output gates). These gates selectively control the flow of information into and out of the memory cell, allowing the network to retain or discard information over long periods, thereby enabling gradients to flow with less attenuation.
        *   **Gated Recurrent Units (GRUs)**: Introduced by Cho et al. in 2014, GRUs are a simplified variant of LSTMs. They combine the forget and input gates into a single **update gate** and merge the cell state and hidden state, resulting in a more streamlined architecture with fewer parameters. GRUs often perform comparably to LSTMs with reduced computational burden.
        *   **Truncated BPTT (TBPTT)**: For very long sequences, a practical approach is to truncate the backpropagation process to a fixed number of steps. This manages computational expense and numerical instability by limiting the length of the gradient path, though it may make the model less sensitive to very long-term dependencies.
        *   Other methods to combat vanishing gradients include using **ReLU activation functions** (instead of tanh or sigmoid) and proper initialization of weight matrices.

In essence, BPTT, when combined with architectural advancements like LSTMs and GRUs, has provided a powerful framework for tackling a wide array of complex Computer Science problems that involve sequential data, enabling models to learn and leverage temporal dependencies effectively.



# III. Backpropagation Through Time (BPTT)

Backpropagation Through Time (BPTT) is a **gradient-based technique specifically designed for training Recurrent Neural Networks (RNNs)**. It extends the traditional backpropagation algorithm to effectively handle sequential data.

## 3.1. Core Idea of BPTT
    
*   BPTT operates by **unrolling the RNN across time steps (or sequence steps)**. This process converts the cyclic structure of an RNN into what can be thought of as a **feedforward neural network** where each layer corresponds to a specific time step in the sequence.
*   A crucial characteristic of this unrolled network is that it **shares the same underlying parameters (weights and biases) across all time steps**. This parameter sharing means that the size of the model does not increase with the length of the input sequence.
*   When training RNNs, instead of solely updating weights based on the current time step's error, BPTT requires considering how errors at the current time step depend on **all previous time steps**. This is because the hidden state at any given time step depends on the input at that step and the hidden state from the previous step, creating a chain of dependencies stretching back to the beginning of the sequence.
*   Therefore, to calculate the gradients for the network's parameters (like weights $\mathbf{W}_{xh}$, $\mathbf{W}_{hh}$, and $\mathbf{W}_{hq}$), BPTT sums up the contributions of the gradients from **each relevant time step**. This ensures that the learning process accounts for the temporal dependencies embedded in the sequential data.

## 3.2. How BPTT Works

Backpropagation Through Time (BPTT) is the specific algorithm used to train Recurrent Neural Networks (RNNs) for sequential data. It extends the traditional backpropagation algorithm by **unrolling the network over time** to enable gradient calculation across all time steps.

The process of BPTT can be broken down into three main phases:

***Forward Pass***

- During the forward pass, input data is processed **sequentially, one time step at a time**.

- At each time step $t$, the RNN calculates a **hidden state** ($\mathbf{h}_t$, $h_t$, $\mathbf{s}_t$, or $a$) based on both the **current input** ($\mathbf{x}_t$ or $x_t$) and the hidden state from the **previous time step** ($\mathbf{h}_{t-1}$, $h_{t-1}$, $S_{t-1}$, or $a^{<t-1>}$).

  A common formula for this is:

  $$
  \mathbf{h}_t = \phi(\mathbf{x}_t \mathbf{W}_{xh} + \mathbf{h}_{t-1} \mathbf{W}_{hh} + \mathbf{b}_h)
  $$

  The hidden state acts as the network's memory, summarizing all information seen up to that point.

- Simultaneously, an **output** ($\mathbf{o}_t$ or $\mathbf{y}_t$) for the current time step is generated, typically based on the current hidden state:

  $$
  \mathbf{o}_t = \mathbf{h}_t \mathbf{W}_{hq} + \mathbf{b}_q
  $$

- This process continues until the entire input sequence is processed.

***Error Calculation***

- After the forward pass, the discrepancy between the network’s predicted outputs ($\mathbf{o}_t$, $\mathbf{y}_t$) and the desired target outputs ($\mathbf{y}_t$, $d_t$) is evaluated.

    *   The **total error (L or E)** for the entire sequence is computed as the **sum of the errors at each individual time step**. For instance, if using a loss function $l(\mathbf{y}_t, o_t)$, the total objective is $L = \frac{1}{T}\sum_{t=1}^T l(\mathbf{y}_t, o_t)$.

*   **Backward Pass**:
    *   BPTT then computes the **gradients of the objective (loss) function with respect to all the network's parameters** (weights and biases), such as $\mathbf{W}_{xh}$, $\mathbf{W}_{hh}$, $\mathbf{W}_{hq}$, $\mathbf{b}_h$, and $\mathbf{b}_q$. This is achieved by **applying the chain rule**.
    *   The core challenge arises because the hidden state at a given time step *t* (**h_t**) depends on the hidden state of the previous time step *t-1* (**h_t-1**). This creates a **chain of dependencies** that stretches back through time.
    *   Consequently, the gradient for weights that influence the hidden state (e.g., $\mathbf{W}_{xh}$, $\mathbf{W}_{hh}$) must account for their impact across **all relevant previous time steps**. This means **summing up the contributions of the gradients from each time step** where the parameter occurs in the unrolled network.
    *   This process effectively "goes backward in time" to determine how the current error affects the parameters at previous time steps. For example, to calculate $\frac{\partial L}{\partial \mathbf{W}_{xh}}$, you sum contributions from $i=0$ to $t$ where each contribution involves a product of partial derivatives of hidden states through time (e.g., $\prod_{j=(t-i+1)}^{t}\frac{\partial h_j}{\partial h_{j-1}}$).
    *   During backpropagation, **intermediate values are cached and reused** to avoid redundant computations.
    *   Finally, once the gradients for all parameters are computed, the **weights and biases are updated** using an optimization algorithm, such as **Stochastic Gradient Descent (SGD)**, to minimize the total loss. This involves nudging each weight by a small amount in the negative gradient direction.
    *   Due to the "depth" introduced by long sequences in RNNs, issues like **vanishing and exploding gradients** can occur during BPTT. Exploding gradients, where gradients become uncontrollably large, are often handled by **gradient clipping**, a heuristic that limits their magnitude. Vanishing gradients, where gradients become too small to effectively update weights for long-term dependencies, are a more challenging problem addressed by architectures like LSTMs and GRUs.

## 3.3. Detail computations in step-by-step with a concrete step-by-step example for a simple vanilla RNN over a short sequence.

### Setup and Notation

- Sequence length $T=3$ (3 time steps)
- Input dimension = 1 (scalar per step), hidden dimension = 1 (single neuron for simplicity), output dimension = 1
- Inputs: $x_1 = 0.5$, $x_2 = 0.1$, $x_3 = 0.4$
- Targets: $y_1 = 0.4$, $y_2 = 0.2$, $y_3 = 0.1$
- Activation functions: hidden state $h_t = \tanh(z_t)$, where $z_t = W_{hh} h_{t-1} + W_{xh} x_t$; output $o_t = W_{hy} h_t$ (linear output)
- Loss: Mean Squared Error $L = \frac{1}{2} \sum_{t=1}^T (o_t - y_t)^2$
- Initial hidden state $h_0 = 0$

Assume initial weights:

- $W_{xh} = 0.6$
- $W_{hh} = 0.9$
- $W_{hy} = 0.8$


### Step 1: Forward Pass (Unroll through time)

We compute the hidden states $h_t$ and outputs $o_t$ for each time step.

- $t=1$:

$$
z_1 = W_{hh} h_0 + W_{xh} x_1 = 0.9 \times 0 + 0.6 \times 0.5 = 0.3
$$

$$
h_1 = \tanh(0.3) \approx 0.2913
$$

$$
o_1 = W_{hy} h_1 = 0.8 \times 0.2913 = 0.2330
$$
- $t=2$:

$$
z_2 = W_{hh} h_1 + W_{xh} x_2 = 0.9 \times 0.2913 + 0.6 \times 0.1 = 0.2622 + 0.06 = 0.3222
$$

$$
h_2 = \tanh(0.3222) \approx 0.3115
$$

$$
o_2 = 0.8 \times 0.3115 = 0.2492
$$
- $t=3$:

$$
z_3 = W_{hh} h_2 + W_{xh} x_3 = 0.9 \times 0.3115 + 0.6 \times 0.4 = 0.2803 + 0.24 = 0.5203
$$

$$
h_3 = \tanh(0.5203) \approx 0.4775
$$

$$
o_3 = 0.8 \times 0.4775 = 0.3820
$$


### Step 2: Compute Loss

$$
L = \frac{1}{2} [(0.2330 - 0.4)^2 + (0.2492 - 0.2)^2 + (0.3820 - 0.1)^2]
$$

$$
= \frac{1}{2}[0.0279 + 0.0024 + 0.0789] = \frac{1}{2} \times 0.1092 = 0.0546
$$

### Step 3: Backward Pass — Gradients of Output Layer

For each $t$,

$$
\frac{\partial L}{\partial o_t} = o_t - y_t
$$

- $t=3$: \$ \delta_{o_3} = 0.3820 - 0.1 = 0.2820 \$
- $t=2$: \$ \delta_{o_2} = 0.2492 - 0.2 = 0.0492 \$
- $t=1$: \$ \delta_{o_1} = 0.2330 - 0.4 = -0.1670 \$

Gradient w.r.t $W_{hy}$:

$$
\frac{\partial L}{\partial W_{hy}} = \sum_{t=1}^3 \delta_{o_t} h_t
$$

Calculate each term:

- $t=1: -0.1670 \times 0.2913 = -0.0486$
- $t=2: 0.0492 \times 0.3115 = 0.0153$
- $t=3: 0.2820 \times 0.4775 = 0.1347$

Summing:

$$
\frac{\partial L}{\partial W_{hy}} = -0.0486 + 0.0153 + 0.1347 = 0.1014
$$

### Step 4: Backpropagate through Time to Hidden Layers

We now find $\frac{\partial L}{\partial h_t}$ for all $t$, then backpropagate the error through $h_t$ to weights $W_{hh}$ and $W_{xh}$.

Initialize:

$$
\delta h_t = \frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial o_t} \frac{\partial o_t}{\partial h_t} + \text{contribution from } h_{t+1}
$$

For output:

$$
\frac{\partial o_t}{\partial h_t} = W_{hy} = 0.8
$$

We also backpropagate through the recurrent connection. The key recursive relation for error flowing backward is:

$$
\delta h_t = \delta o_t \cdot W_{hy} + \delta h_{t+1} \cdot W_{hh} \cdot (1 - h_t^2)
$$

Where $(1 - h_t^2)$ is the derivative of $\tanh$.

Calculate starting from $t=3$:

- $t=3$:

$$
\delta h_3 = \delta o_3 \times 0.8 = 0.2820 \times 0.8 = 0.2256
$$
- $t=2$:

$$
\delta h_2 = \delta o_2 \times 0.8 + \delta h_3 \times 0.9 \times (1 - 0.3115^2)
$$

$$
= 0.0492 \times 0.8 + 0.2256 \times 0.9 \times (1 - 0.0970) = 0.0394 + 0.2256 \times 0.9 \times 0.9030
$$

$$
= 0.0394 + 0.1830 = 0.2224
$$
- $t=1$:

$$
\delta h_1 = \delta o_1 \times 0.8 + \delta h_2 \times 0.9 \times (1 - 0.2913^2)
$$

$$
= (-0.1670) \times 0.8 + 0.2224 \times 0.9 \times (1 - 0.0849) = -0.1336 + 0.2224 \times 0.9 \times 0.9151
$$

$$
= -0.1336 + 0.1832 = 0.0496
$$


### Step 5: Gradients for $W_{hh}$ and $W_{xh}$

Recall:

$$
z_t = W_{hh} h_{t-1} + W_{xh} x_t
$$

$$
h_t = \tanh(z_t)
$$

Gradients wrt weights come from chain rule:

$$
\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^3 \delta h_t \cdot (1 - h_t^2) \cdot h_{t-1}
$$

$$
\frac{\partial L}{\partial W_{xh}} = \sum_{t=1}^3 \delta h_t \cdot (1 - h_t^2) \cdot x_t
$$

Calculate for each $t$:

- $t=1$:

$$
\delta_t = \delta h_1 (1 - h_1^2) = 0.0496 \times (1 - 0.2913^2) = 0.0496 \times 0.9151 = 0.0454
$$
- $t=2$:

$$
\delta_t = 0.2224 \times (1 - 0.3115^2) = 0.2224 \times 0.9030 = 0.2008
$$
- $t=3$:

$$
\delta_t = 0.2256 \times (1 - 0.4775^2) = 0.2256 \times (1 - 0.2280) = 0.2256 \times 0.7720 = 0.1743
$$

Now sum for each weight:

- $W_{hh}$:

$$
\sum \delta_t \times h_{t-1}
$$

Recall $h_0=0$, so terms:

$$
t=1: 0.0454 \times 0 = 0
$$

$$
t=2: 0.2008 \times 0.2913 = 0.0585
$$

$$
t=3: 0.1743 \times 0.3115 = 0.0543
$$

$$
\frac{\partial L}{\partial W_{hh}} = 0 + 0.0585 + 0.0543 = 0.1128
$$
- $W_{xh}$:

$$
\sum \delta_t \times x_t
$$

$$
t=1: 0.0454 \times 0.5 = 0.0227
$$

$$
t=2: 0.2008 \times 0.1 = 0.0201
$$

$$
t=3: 0.1743 \times 0.4 = 0.0697
$$

$$
\frac{\partial L}{\partial W_{xh}} = 0.0227 + 0.0201 + 0.0697 = 0.1125
$$

# IV. Challenges in Training RNNs with BPTT

## **4.1. Vanishing Gradient Problem**

The Vanishing Gradient Problem is a **significant obstacle** encountered during the training of Recurrent Neural Networks (RNNs), particularly for "vanilla" (basic) RNN architectures. It makes the network **difficult to train**.

*   **Description**: This problem occurs when **gradients**, which are the signals used to update the network's weights during the Backpropagation Through Time (BPTT) process, **become extremely small or "vanish"** as they are propagated backward through the network across many time steps. When this happens, the error signal effectively becomes negligible.

*   **Mechanism**: The core reason for vanishing gradients lies in the **multiplicative nature of gradient calculations** across time steps during BPTT.
    *   In an RNN, the hidden state at any given time step depends on the hidden state of the previous time step. Consequently, when calculating gradients with respect to weights that influence the hidden state (like \(W_{hh}\)), the chain rule requires **multiplying partial derivatives of hidden states across multiple time steps**.
    *   Many common activation functions used in RNNs, such as **hyperbolic tangent (tanh) and sigmoid**, have derivatives that are bounded and often small. For example, the derivative of the sigmoid function is always below 0.25, and both tanh and sigmoid derivatives approach zero at their extremes, causing neurons to "saturate".
    *   If the absolute values of the weights (or the 2-norm of the Jacobian matrices involved in the chain rule) are **less than 1**, multiplying these small values repeatedly over many time steps (especially in long sequences) causes the overall gradient to **shrink exponentially towards zero**. This is analogous to raising a number less than 1 to a large power, resulting in a number very close to zero.
    *   The problem is particularly prevalent in RNNs because they often represent "deep" networks, where the depth is determined by the sequence length, making the issue more common than in traditional feedforward networks.

*   **Consequences on Learning**: The vanishing gradient problem makes it **difficult for the network to learn long-term dependencies**. This means that information from earlier time steps in a sequence becomes too small to effectively influence the weight updates for later time steps. As a result, the network **"forgets"** information from the distant past, making it incapable of capturing relationships between elements that are far apart in the sequence, which is crucial for many real-world sequential tasks like understanding the full context of a long sentence. The weights and biases are not updated properly, leading to inaccuracy in the network.

## **4.2. Exploding Gradient Problem**

Conversely to the vanishing gradient problem, the **Exploding Gradient Problem** occurs when gradients grow uncontrollably large during the backpropagation process. This numerical instability makes training difficult and can cause the model to diverge.

*   **Mechanism**:
    *   This problem arises when the absolute values of the weights (specifically, the eigenvalues of the Jacobian matrices involved in the chain rule) are **greater than 1**.
    *   During Backpropagation Through Time (BPTT), gradients are calculated by applying the chain rule, which involves **multiplying partial derivatives across multiple time steps**. If these derivatives or the weights are large (greater than 1), their repeated multiplication over long sequences causes the gradient values to **grow exponentially**. For example, if a weight is set to 2, and the network is unrolled 50 times (like with 50 days of stock market data), the initial input value can be amplified by 2 raised to the 50th power, resulting in a huge number.
    *   The "depth" introduced by long sequences in RNNs (where depth is the sequence length) exacerbates this issue, as inputs from early time steps pass through many matrix products before reaching the output, and similar matrix products are required for gradient computation.

*   **Consequences**:
    *   When gradients become excessively large, training often becomes **unstable**, and the model parameters may fail to converge.
    *   A single gradient step with a huge gradient can undo progress made over thousands of training iterations. This means that instead of taking small, optimal steps towards minimizing the loss function, the optimization algorithm takes large, erratic steps, causing the parameters to "bounce around a lot" instead of finding the optimal values.
    *   In extreme cases, exploding gradients can cause the gradients to become "NaN" (Not a Number), leading to program crashes.

*   **Numerical Instability**: Both the vanishing and exploding gradient problems lead to **numerical instability** during training. This means that the numerical computations become unreliable, hindering the network's ability to learn effectively.



# V. Advanced RNN Solutions and Architectures

## **5.1. Gradient Clipping**

**Gradient clipping** is a common, though not perfect, solution specifically designed to prevent the **exploding gradient problem** in Recurrent Neural Networks (RNNs) during training. It is considered an inelegant but ubiquitous solution.

*   **Problem it Addresses**: When gradients become **uncontrollably large** (exploding gradients), they can lead to **unstable weight updates**, causing the model to **diverge** and fail to converge, or even result in **"NaN" (Not a Number)** values, causing program crashes. Large gradients can undo significant training progress in a single step, making the optimization process erratic instead of taking small, optimal steps towards minimizing the loss.

*   **How it Works**:
    *   Gradient clipping operates by **limiting the magnitude (or "norm") of the gradients** to a predefined threshold (𝜃).
    *   If the calculated norm of the gradient vector (**‖g‖**) exceeds this threshold, the gradient is **scaled down proportionally** so that its norm becomes equal to the threshold. The direction of the gradient remains the same, but its size is constrained.
    *   Formally, the updated gradient `g` is calculated as `g = min(1, 𝜃 / ‖g‖) * g`. This ensures that `‖g‖` never exceeds `𝜃`.
    *   In practice, when computing the gradient norm, all model parameters' gradients are concatenated and treated as a single large vector.

*   **Benefits and Nature**:
    *   This technique is highly **effective and simple** in preventing exploding gradients.
    *   It helps to **stabilize training** by ensuring that weight updates remain within a reasonable range.
    *   It also limits the influence any single minibatch or sample can have on the parameter vector, adding a degree of **robustness** to the model.
    *   However, it is considered a "hack" because it means the training process is **not always following the true gradient**, and its analytical side effects are hard to reason about.

*   **Limitations (Does NOT Address)**: Importantly, **gradient clipping only mitigates exploding gradients; it does not address the problem of vanishing gradients**. Vanishing gradients are a more challenging issue that requires architectural solutions like Long Short-Term Memory (LSTM) networks or Gated Recurrent Units (GRUs).

## **5.2. Long Short-Term Memory (LSTM) Networks**

**Long Short-Term Memory (LSTM) networks** are a specialized type of Recurrent Neural Network (RNN) that were introduced to specifically **address the vanishing gradient problem** commonly encountered by traditional RNNs, and to **enable them to capture long-term dependencies** in sequential data. LSTMs were first proposed by Hochreiter and Schmidhuber in 1997.

**Mechanism of Operation**:
Each LSTM unit incorporates a more complex internal structure designed to maintain information over longer sequences, making them relatively insensitive to the length of the time lag between important events. The fundamental innovation lies in the introduction of a **memory cell** (also called a cell state, \(c_t\)) and **three distinct "gates"**: the forget gate, the input gate, and the output gate. These gates regulate the flow of information into and out of the cell.

*   **Forget Gate**: This gate decides which information from the previous **cell state** (\(c_{t-1}\)) should be **retained or discarded**. It operates by mapping the previous hidden state and the current input to a value between 0 and 1 (typically using a sigmoid activation function). A value close to 1 indicates that the information should be kept, while a value close to 0 means it should be forgotten. This selective forgetting mechanism is crucial for allowing the model to **"forget" irrelevant past information** and prevent the vanishing gradient problem by enabling gradients to flow with little to no attenuation along the cell state.

*   **Input Gate**: This gate controls **how much of the new information from the current input** (\(x_t\)) and the previous hidden state (\(h_{t-1}\)) **should be added to the current cell state** (\(c_t\)). It has two parts: an input gate layer (sigmoid) that decides which values to update, and a candidate cell state layer (tanh) that creates a vector of new candidate values. The outputs of these two parts are then combined element-wise to update the cell state.

*   **Output Gate**: This gate determines **how much of the information from the current cell state** (\(c_t\)) **should be exposed as the next hidden state** (\(h_t\)). It uses a sigmoid function to filter the cell state (which is often passed through a tanh function first) and produce the output hidden state. This allows the LSTM network to selectively output relevant information, which is vital for making predictions in current and future time-steps.

These gates work in a coordinated fashion, allowing the LSTM to **selectively remember and forget information in a controlled manner**. This mechanism enables LSTMs to **maintain a more stable gradient flow across many time steps**, effectively mitigating the vanishing gradient problem that plagues vanilla RNNs when dealing with long sequences. By turning the gradient flow from a multiplication chain into a more additive process (especially within the cell state), LSTMs provide an easier way for the model to learn long-distance dependencies.

## **5.3. Gated Recurrent Units (GRUs)**

**Gated Recurrent Units (GRUs)** are a more recent and simplified variation of Recurrent Neural Networks, introduced by Cho et al. in 2014. They were developed to address the vanishing gradient problem, similar to LSTMs, and have demonstrated comparable performance to LSTMs on various tasks, often with the added benefit of reduced computational cost due to their simpler architecture.

**Mechanism of Operation**:
GRUs simplify the LSTM architecture by combining the forget and input gates into a **single "update gate"** and merging the cell state and hidden state into one. This results in a more streamlined structure with fewer parameters. GRUs primarily rely on two gates:

*   **Update Gate (Cổng cập nhật)**: This gate plays a similar role to the combined effect of the input and forget gates in LSTMs. It determines **how much of the previous hidden state should be retained and updated with new information** from the current input. A value closer to 1 indicates that more of the past information should be carried forward, while a value closer to 0 suggests forgetting the past.
*   **Reset Gate (Cổng đặt lại)**: This gate controls **how much of the previous hidden state should be forgotten**. If the reset gate outputs a value close to 0, the previous hidden state is effectively ignored, allowing the model to "reset" its memory when irrelevant information is encountered.

By selectively retaining and discarding information, GRUs, like LSTMs, enable a more stable flow of gradients over time, which helps in **mitigating the vanishing gradient problem** and allows them to capture long-term dependencies in sequential data.

## **5.4. Other Strategies**

Beyond advanced architectures like LSTMs and GRUs, several other strategies are employed to tackle the challenges, particularly the exploding and vanishing gradient problems, during RNN training:

*   **Truncated Backpropagation Through Time (BPTT)**:
    *   **Purpose**: Full Backpropagation Through Time (BPTT) can be computationally very expensive, especially when dealing with long sequences. It requires computing gradients by summing contributions across all time steps, potentially leading to very long chains of multiplications.
    *   **Mechanism**: To address this, **Truncated BPTT** approximates the true gradient by **limiting the backpropagation to a fixed number of recent time steps** (\(\tau\) steps) instead of the entire sequence history. This means that while hidden states are carried forward indefinitely, gradients are only propagated backward for a computationally manageable segment of the sequence.
    *   **Benefits**: This truncation improves **computational convenience and numerical stability**. It biases the model towards simpler and more stable learning, as it primarily focuses on short-term influence, which can be desirable in practice and has a slight regularizing effect. It helps prevent exploding gradients by limiting the length of the multiplicative chain.

*   **Using ReLU Activation Function (instead of tanh or sigmoid)**:
    *   **Problem Addressed**: Vanishing gradients occur when the derivatives of activation functions, especially sigmoid and hyperbolic tangent (tanh), become very small (approaching 0) in their saturated regions. When these small derivatives are multiplied across many time steps during backpropagation, the gradients for earlier layers or time steps exponentially shrink, making it difficult for the network to learn long-term dependencies.
    *   **Solution**: Replacing `tanh` or `sigmoid` with the **Rectified Linear Unit (ReLU)** activation function is a preferred solution. ReLU's derivative is either 0 or 1 (for positive inputs), which means it is **less prone to vanishing gradients** in the same way `tanh` or `sigmoid` functions are, as it avoids the issue of derivatives approaching zero over large ranges of inputs.

- **Appropriate Weight Initialization ($\mathbf{W}$ matrix)**:
  - **Problem Addressed**: Improper weight initialization can exacerbate both vanishing and exploding gradient problems. Large initial weights can lead to exploding gradients, while very small weights can contribute to vanishing gradients.
  - **Solution**: **Properly initializing the weight matrix $\mathbf{W}$** (especially the recurrent weight matrix $\mathbf{W}_{hh}$) can help mitigate the effect of vanishing gradients. For instance, initializing $\mathbf{W}$ with an identity matrix has been observed to help in addressing the vanishing gradient problem. This careful initialization ensures that gradients can flow more effectively through the network during early training stages, contributing to better learning of long-term dependencies.


# Step-by-Step Solutions to All Exercises

## 1. Assume that we have a symmetric matrix $\mathbf{M} \in \mathbb{R}^{n \times n}$ with eigenvalues $\lambda_i$ whose corresponding eigenvectors are $\mathbf{v}_i$ ($i = 1, \ldots, n$). Without loss of generality, assume that they are ordered in the order $|\lambda_i| \geq |\lambda_{i+1}|$. 
   1. Show that $\mathbf{M}^k$ has eigenvalues $\lambda_i^k$.
   1. Prove that for a random vector $\mathbf{x} \in \mathbb{R}^n$, with high probability $\mathbf{M}^k \mathbf{x}$ will be very much aligned with the eigenvector $\mathbf{v}_1$ 
of $\mathbf{M}$. Formalize this statement.
   1. What does the above result mean for gradients in RNNs?

### Problem Statement

Given:

- A symmetric matrix $\mathbf{M} \in \mathbb{R}^{n \times n}$,
- Eigenvalues $\lambda_i$ and corresponding eigenvectors $\mathbf{v}_i$, ordered such that $|\lambda_1| \geq |\lambda_2| \geq \cdots \geq |\lambda_n|$.


### 1a) Show that $\mathbf{M}^k$ has eigenvalues $\lambda_i^k$.

**Proof:**

Since $\mathbf{M}$ is symmetric, it is diagonalizable with an orthonormal eigenbasis:

$$
\mathbf{M} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^\top,
$$

where $\mathbf{V} = [\mathbf{v}_1, \cdots, \mathbf{v}_n]$ and $\mathbf{\Lambda} = \operatorname{diag}(\lambda_1, \cdots, \lambda_n)$.

Then,

$$
\mathbf{M}^k = (\mathbf{V} \mathbf{\Lambda} \mathbf{V}^\top)^k = \mathbf{V} \mathbf{\Lambda}^k \mathbf{V}^\top,
$$

and

$$
\mathbf{\Lambda}^k = \operatorname{diag}(\lambda_1^k, \cdots, \lambda_n^k).
$$

Therefore, $\mathbf{M}^k$ has eigenvalues $\lambda_i^k$ with the same eigenvectors $\mathbf{v}_i$.

### 1b) Prove that for a random vector $\mathbf{x} \in \mathbb{R}^n$, with high probability, $\mathbf{M}^k \mathbf{x}$ will be very aligned with $\mathbf{v}_1$.

**Proof:**

Express $\mathbf{x}$ in the eigenbasis of $\mathbf{M}$:

$$
\mathbf{x} = \sum_{i=1}^n a_i \mathbf{v}_i, \quad \text{where } a_i = \mathbf{v}_i^\top \mathbf{x}.
$$

Then,

$$
\mathbf{M}^k \mathbf{x} = \sum_{i=1}^n a_i \mathbf{M}^k \mathbf{v}_i = \sum_{i=1}^n a_i \lambda_i^k \mathbf{v}_i.
$$

Because $|\lambda_1| \ge |\lambda_2| \ge \cdots$, and assuming $|\lambda_1| > |\lambda_2|$, we have

$$
\mathbf{M}^k \mathbf{x} = \lambda_1^k \left(a_1 \mathbf{v}_1 + \sum_{i=2}^n a_i \left(\frac{\lambda_i}{\lambda_1}\right)^k \mathbf{v}_i\right).
$$

As $k \to \infty$,

$$
\left|\frac{\lambda_i}{\lambda_1}\right|^k \to 0, \quad \text{for } i > 1.
$$

Thus,

$$
\mathbf{M}^k \mathbf{x} \approx \lambda_1^k a_1 \mathbf{v}_1,
$$

and normalized,

$$
\frac{\mathbf{M}^k \mathbf{x}}{\|\mathbf{M}^k \mathbf{x}\|} \to \pm \mathbf{v}_1,
$$

where the sign depends on $a_1$ and $\lambda_1^k$.

Since $\mathbf{x}$ is random, the probability that $a_1 = 0$ is zero (measure zero). Hence, with very high probability, $\mathbf{M}^k \mathbf{x}$ becomes aligned with $\mathbf{v}_1$.

### 1c) What does this mean for gradients in RNNs?

**Explanation:**

- In a Recurrent Neural Network (RNN), the gradient at time $t$ depends on multiplying the gradient at time $t+1$ by powers of the recurrent weight matrix $\mathbf{W}_{hh}$.
- If $\mathbf{W}_{hh}$ has eigenvalues $\lambda_i$, then after $k$ steps, gradient components are scaled approximately by $\lambda_i^k$.
- If $|\lambda_1| < 1$, gradient components vanish exponentially (vanishing gradient problem).
- If $|\lambda_1| > 1$, gradient components explode exponentially (exploding gradient problem).
- Additionally, gradients tend to align with the eigenvector corresponding to $\lambda_1$, potentially losing useful directionality.
- Therefore, this result explains why RNNs have difficulties learning long-term dependencies and motivates architectures such as LSTM/GRU or methods like gradient clipping and orthogonal weight constraints.




In [2]:
import numpy as np

# Fix seed
np.random.seed(42)
n = 5  # matrix dimension

# Create a random symmetric matrix M
A = np.random.randn(n, n)
M = (A + A.T) / 2

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(M)

# Sort by absolute eigenvalues descending
idx = np.argsort(np.abs(eigenvalues))[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print("Eigenvalues (sorted by absolute value):", eigenvalues)

# Generate a random vector x (normalized)
x = np.random.randn(n)
x /= np.linalg.norm(x)
print("\nRandom vector x (normalized):", x)

# Function for alignment
def alignment(u, v):
    return np.abs(np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)))

for k in [1, 5, 10, 20]:
    Mk_x = np.linalg.matrix_power(M, k).dot(x)
    Mk_x /= np.linalg.norm(Mk_x)
    align = alignment(Mk_x, eigenvectors[:, 0])
    print(f"\nAfter multiplying by M^{k}:")
    print("Result vector (normalized):", Mk_x)
    print(f"Alignment with top eigenvector v1: {align:.6f}")


Eigenvalues (sorted by absolute value): [-2.81046932  1.99111341  0.97150943  0.48952788  0.22380105]

Random vector x (normalized): [ 0.07996633 -0.8297745   0.27084828 -0.43301255 -0.21028791]

After multiplying by M^1:
Result vector (normalized): [-0.07229559 -0.60910758  0.28609581  0.6894858   0.25791437]
Alignment with top eigenvector v1: 0.695460

After multiplying by M^5:
Result vector (normalized): [-0.17541917 -0.14597706  0.29776944  0.75241773  0.54140533]
Alignment with top eigenvector v1: 0.969567

After multiplying by M^10:
Result vector (normalized): [ 0.22346416 -0.11136051 -0.35208554 -0.66161548 -0.6131585 ]
Alignment with top eigenvector v1: 0.998986

After multiplying by M^20:
Result vector (normalized): [ 0.21724148 -0.07306004 -0.34561514 -0.67893253 -0.60586233]
Alignment with top eigenvector v1: 0.999999


## Goal of the Code

- Given a symmetric matrix \$ M \in \mathbb{R}^{n \times n} \$ with eigenvectors \$ v_i \$ and eigenvalues \$ \lambda_i \$, ordered such that

$$
|\lambda_1| \geq |\lambda_2| \geq \cdots
$$
- Given a random vector \$ x \$ in \$ \mathbb{R}^n \$.
- We apply powers of \$ M \$ to \$ x \$, i.e., compute \$ M^k x \$,
- Then we observe whether \$ M^k x \$ becomes increasingly close (aligned) to the principal eigenvector \$ v_1 \$ corresponding to the largest eigenvalue \$ \lambda_1 \$.
- We calculate and display the *alignment*, which is the absolute value of the dot product between the normalized vector \$ \frac{M^k x}{\| M^k x \|} \$ and \$ v_1 \$.


## Detailed Explanation of the Main Code Snippet

```python
# Initialize a random vector x and normalize its norm
x = np.random.randn(n)
x /= np.linalg.norm(x)  # Normalize to have norm = 1

# Loop over powers k = 1, 5, 10, 20
for k in [1, 5, 10, 20]:
    Mk_x = np.linalg.matrix_power(M, k).dot(x)  # Compute M^k * x
    Mk_x /= np.linalg.norm(Mk_x)                 # Normalize the resulting vector
    align = np.abs(np.dot(Mk_x, eigenvectors[:, 0]))  # Compute alignment w.r.t. v1
    print(f"\nAfter multiplying by M^{k}:")
    print("Vector (normalized):", Mk_x)
    print("Alignment with principal eigenvector v1: {:.6f}".format(align))
```

- `np.linalg.matrix_power(M, k)` computes the matrix power \$ M^k \$.
- `.dot(x)` multiplies the matrix \$ M^k \$ by the vector \$ x \$.
- The resulting vector is normalized to focus on its *direction* rather than magnitude.
- The alignment is the absolute value of the dot product between the normalized vector \$ M^k x \$ and the principal eigenvector \$ v_1 \$. An alignment closer to 1 means the vectors are nearly parallel.



| Step \$ k \$ | Initial Vector \$ x \$ | Vector After \$ M^k x \$ (unnormalized) | Normalized Vector \$ \frac{M^k x}{\| M^k x \|} \$ | Alignment with \$ v_1 \$ (unitless) |
| :-- | :-- | :-- | :-- | :-- |
| 0 (initial) | $\begin{bmatrix} 0.56 \\ -0.82 \\ 0.10 \end{bmatrix}$ (example) | — | $\begin{bmatrix} 0.56 \\ -0.82 \\ 0.10 \end{bmatrix}$ (same as input) | $\approx 0.58$ |
| 1 | — | $\begin{bmatrix} 0.7 \\ -1.1 \\ 0.15 \end{bmatrix}$ | $\begin{bmatrix} 0.57 \\ -0.89 \\ 0.12 \end{bmatrix}$ | 0.75 |
| 5 | — | $\begin{bmatrix} 5.6 \\ -8.2 \\ 1.2 \end{bmatrix}$ | $\begin{bmatrix} 0.60 \\ -0.79 \\ 0.12 \end{bmatrix}$ | 0.90 |
| 10 | — | $\begin{bmatrix} 20.5 \\ -30.3 \\ 4.5 \end{bmatrix}$ | $\begin{bmatrix} 0.65 \\ -0.75 \\ 0.11 \end{bmatrix}$ | 0.97 |
| 20 | — | $\begin{bmatrix} 500 \\ -720 \\ 111 \end{bmatrix}$ | $\begin{bmatrix} 0.68 \\ -0.72 \\ 0.11 \end{bmatrix}$ | 0.999 |

*Notes:*

- The values in the table are illustrative and approximate, intended for understanding the trend.
- The normalized vectors show the direction of \$ M^k x \$ at each step \$ k \$.
- The alignment increases as \$ k \$ grows, approaching 1, indicating \$ M^k x \$ aligns more closely with the principal eigenvector \$ v_1 \$.


## 2. Besides gradient clipping, can you think of any other methods to cope with gradient explosion in recurrent neural networks?

[Discussions](https://discuss.d2l.ai/t/334)

**Other methods include:**

1. **Orthogonal or unitary weight initialization or constraints:**
Keep recurrent weight matrices orthogonal/unitary to maintain norm and avoid explosion or vanishing.
2. **Gated architectures (LSTM, GRU):**
Gates control information flow to stabilize gradients over long sequences.
3. **Truncated Backpropagation Through Time (TBPTT):**
Limit gradient backpropagation to manageable sequence lengths, avoiding very large powers.
4. **Normalization techniques:**
Apply LayerNorm or BatchNorm in RNNs to stabilize training.
5. **Regularization:**
Spectral norm regularization or weight decay to control weight matrix norms.
6. **Skip connections or residual connections across time or layers:**
Help gradients flow better across many time steps.
7. **Adaptive optimizers (Adam, RMSProp):**
Help mitigate poor step sizes due to gradient scaling.

# References

1. [Dive into Deep Learning - 9.7 Backpropagation Through Time](https://d2l.ai/chapter_recurrent-neural-networks/bptt.html)
2. [Back Propagation Through Time - RNN - GeeksforGeeks](https://www.geeksforgeeks.org/back-propagation-through-time-rnn/)
3. [Backpropagation Through Time Explained With Derivations - Quark.ai](https://quark.ml/backpropagation-through-time-explained-with-derivations/)
4. [Backpropagation Through Time - Wikipedia](https://en.wikipedia.org/wiki/Backpropagation_through_time)
5. [CS230 RNN Cheatsheet - Stanford](https://cs230.stanford.edu/section/rnn-cheatsheet.pdf)
6. [L15.4 BPTT Overview - Sebastian Raschka (YouTube)](https://www.youtube.com/watch?v=6g4O5UOH304)
7. [LSTM: Derivation of Backpropagation Through Time - GeeksforGeeks](https://www.geeksforgeeks.org/lstm-derivation-of-back-propagation-through-time/)
8. [Let's Understand the Problems with RNNs](https://towardsdatascience.com/lets-understand-the-problems-with-recurrent-neural-networks-6f4fafd9fa4e)
9. [Long Short-Term Memory - Wikipedia](https://en.wikipedia.org/wiki/Long_short-term_memory)
10. [RNN from Basic to Advanced — Sachin Soni (Medium)](https://medium.com/@sachinsoni2507/recurrent-neural-networks-from-basic-to-advanced-688a5e15e82)
11. [RNNs Clearly Explained – StatQuest with Josh Starmer (YouTube)](https://www.youtube.com/watch?v=LHXXI4-IEns)
12. [Recurrent Neural Networks - University of Washington](https://courses.cs.washington.edu/courses/cse599w/15wi/slides/rnn.pdf)
13. [RNN Tutorial Part 1 – Denny's Blog](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
14. [RNN Tutorial Part 3 – Denny's Blog](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/)
15. [RNN Performance Tips and Strategies – Medium](https://medium.com/@btd/recurrent-neural-networks-100-tips-and-strategies-for-fine-tuning-rnn-performance-74b5f534222)
16. [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
17. [RNN Challenges and LSTM/GRU Solutions – EITCA Academy](https://eitca.org/faq/what-are-the-main-challenges-faced-by-rnns-during-training-and-how-do-long-short-term-memory-lstm-networks-and-gated-recurrent-units-grus-address-these-issues/)
18. [Backpropagation in AI – Great Learning](https://www.mygreatlearning.com/blog/backpropagation-in-artificial-intelligence/)
19. [arXiv: Backpropagation Through Time for Long-Term Dependencies (2103.15589)](https://arxiv.org/abs/2103.15589)
20. [RNN Tutorial Part 3 (Mirror) – GitHub](https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch15/ch15.ipynb)

