# L12a: Introduction to Recurrent Neural Networks (RNNs)
In this lecture, we will explore the fundamentals of Recurrent Neural Networks (RNNs), a class of neural networks designed to handle sequential data. RNNs are particularly useful for tasks such as language modeling, time series prediction, and sequence classification.

> __Learning Objectives:__
> 
> By the end of this lecture, you should be able to:
> 
> Three learning objectives for this lecture.

Let's get started!
___

## Example
Today, we will use the following examples to illustrate key concepts:
 
> [▶ Can we estimate the similarity of different firms?](CHEME-5820-L4a-Example-MeasureFirmSimilarityScores-Spring-2026.ipynb). In this example, let's explore how to measure the similarity between different firms based upon the similarity of their daily growth rates over 10-year periods. Does this similarity correlate with other firm metrics, e.g., business sector, market capitalization, etc.?
___

## General problem: Modeling a Sequence
Suppose we have a _sequence of data_ $(x_1, x_2, \ldots, x_T)$ where $T$ is the sequence length, and $x_i$ is the $i$-th element (token) of the sequence. What are some examples of sequences?

> __Examples__
> 
> In natural language processing, $x_{i}$ could be words or characters in a text. On the other hand, in time series analysis, $x_t$ could be a measurement, i.e., temperature, pressure, price, etc, at time $t$.

To model a sequence, i.e., predict the next token given past tokens, we _could try_ to use tools such as [Hidden Markov Models (HMMs)](https://en.wikipedia.org/wiki/Hidden_Markov_model). However, HMMs are limited in their ability to capture _long-range dependencies_ and complex relationships between elements in the sequence. 

> __Why not HMMs?__
>
> Hidden Markov Models (HMMs) use the [Markov property](https://en.wikipedia.org/wiki/Markov_property), which says that the future state of a system depends only on its current state and not on its past states. This assumption is often too restrictive for many real-world applications, where the relationships between elements in a sequence can be more complex and require a more flexible modeling approach.

This is where RNNs come in. RNNs are designed to handle data sequences by maintaining a hidden state that captures information about previous inputs. This allows them to model long-range dependencies and contextual relationships between elements in the sequence.

<div>
    <center>
      <img
        src="figs/Recurrent_neural_network_unfold.svg"
        alt="General Tree Example"
        height="400"
        width="800"
      />
    </center>
</div>

## What are Recurrent Neural Networks (RNNs)?
Recurrent Neural Networks (RNNs) are artificial neural networks designed to process sequential data by retaining information about previous inputs through their internal memory. 

> __How are RNNs different from feedforward neural networks?__
>
> * __Do feedforward neural networks have memory?__ No, feedforward neural networks process do not retain information about previous inputs. Thus, the parameters (weights and bias values) do not change once training is over. This means that the network is done learning and evolving. When we feed in values, an FNN applies the operations that make up the network using the values it has learned.
> * __How are RNNs different from feedforward neural networks?__ RNNs have connections that loop back on themselves, allowing them to maintain a _hidden state_ that captures information about previous inputs. This makes RNNs particularly effective for tasks such as language modeling, time-series prediction, and speech recognition, where context and dependencies between data points are crucial. 

Let's look at two types of _simple_ RNNs: the Elman and Jordan networks.

### Elman Network: Mathematical Formulation
The Elman network is a simple RNN type consisting of an input layer, a hidden layer, and an output layer. The hidden layer has recurrent connections that allow it to maintain a hidden state over time: [Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.](https://onlinelibrary.wiley.com/doi/10.1207/s15516709cog1402_1)

__At each time step__: an Elman RNN takes an _input_ and the previous hidden state (memory) and computes the output entry at time $t$.  Let the input vector at time $t$ be denoted as $\mathbf{x}_t\in\mathbb{R}^{d_{in}}$, the hidden state at time $t$ as $\mathbf{h}_t\in\mathbb{R}^{h}$, and the output at time $t$ as $\mathbf{y}_t\in\mathbb{R}^{d_{out}}$. 

> __Elman RNN Architecture__
>
> The following equations can describe the Elman RNN:
> $$
\boxed{
\begin{align*}
\mathbf{h}_t &= \sigma_{h}(\mathbf{U}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}_h) \\
\mathbf{y}_t &= \sigma_{y}(\mathbf{W}_y \mathbf{h}_t + \mathbf{b}_y)
\end{align*}}
> $$
> where the parameters are:
> * __Network weights__: the term $\mathbf{U}_h\in\mathbb{R}^{h\times{h}}$ is the weight matrix for the hidden state, $\mathbf{W}_x\in\mathbb{R}^{h\times{d_{in}}}$ is the weight matrix for the input, and $\mathbf{W}_y\in\mathbb{R}^{d_{out}\times{h}}$ is the weight matrix for the output
> * __Network bias__: the $\mathbf{b}_h\in\mathbb{R}^{h}$ terms denote the bias vector for the hidden state, and $\mathbf{b}_y\in\mathbb{R}^{d_{out}}$ is the bias vector for the output.
> * __Activation function__: the $\sigma_{h}$ function is a _hidden layer activation function_, such as the sigmoid or hyperbolic tangent (tanh) function, which introduces non-linearity into the RNN. The activation function $\sigma_{y}$ is an _output activation function_ that can be a softmax function for classification tasks or a linear function for regression tasks.

How many parameters are there in the Elman network? The number of parameters in an Elman RNN can be calculated as follows:
* _Hidden state_: The number of parameters for the hidden state is $h^2 + d_{in}h + h = h(h + d_{in} + 1)$
* _Output_: The number of parameters for the output is $d_{out}h + d_{out} = d_{out}(h + 1)$
* _Total_: The total number of parameters in the Elman RNN is $h(h + d_{in} + 1) + d_{out}(h + 1)$

### Jordan Network: Mathematical Formulation
The Jordan network is another type of RNN similar to the Elman network but with a different architecture. In a Jordan network, the output layer is connected back to the hidden layer, allowing the network to maintain a hidden state based on the output at the previous time step.
* [Jordan, Michael I. (1997-01-01). "Serial Order: A Parallel Distributed Processing Approach". Neural-Network Models of Cognition — Biobehavioral Foundations. Advances in Psychology. Vol. 121. pp. 471–495. doi:10.1016/s0166-4115(97)80111-2. ISBN 978-0-444-81931-4. S2CID 15375627.](https://www.sciencedirect.com/science/article/pii/S0166411597801112?via%3Dihub)

__At each time step__: a Jordan RNN takes an _input_, the previous hidden state (memory), and the previous output and computes the output entry at time $t$. Thus, the Jordan network has a similar structure to the Elman network but with a different way of maintaining the hidden state (i.e., the output layer is connected back to the hidden layer).

Let the input vector at time $t$ be denoted as $\mathbf{x}_t\in\mathbb{R}^{d_{in}}$, the hidden state at time $t$ as $\mathbf{h}_t\in\mathbb{R}^{h}$, 
the state vector at time $t$

> __Jordan RNN Architecture__
> 
> The Jordan RNN can be described by the following equations:
> $$
\boxed{
\begin{align*}
\mathbf{h}_t &= \sigma_{h}(\mathbf{U}_h \mathbf{s}_{t} + \mathbf{W}_h \mathbf{x}_t + \mathbf{b}_h) \\
\mathbf{y}_t &= \sigma_{y}(\mathbf{W}_y \mathbf{h}_t + \mathbf{b}_y) \\
\mathbf{s}_t &= \sigma_{s}(\mathbf{W}_{ss} \mathbf{s}_{t-1} + \mathbf{W}_{sy} \mathbf{y}_{t-1} + \mathbf{b}_s) \\
\end{align*}}
> $$
> where the parameters are:
> * __Network weights__: the term $\mathbf{U}_h\in\mathbb{R}^{h\times{s}}$ is the weight matrix for the hidden state with respect to $s$, $\mathbf{W}_h\in\mathbb{R}^{h\times{d_{in}}}$ is the weight matrix for the input, and $\mathbf{W}_y\in\mathbb{R}^{d_{out}\times{h}}$ is the weight matrix for the output. In addition, a Jordan network has parameters associated with the state $\mathbf{s}$, the $\mathbf{W}_{ss}\in\mathbb{R}^{h\times{s}}$ matrix is the weight matrix for the state with respect to the previous $s$, and $\mathbf{W}_{sy}\in\mathbb{R}^{h\times{d_{out}}}$ is the weight matrix for the state with respect to the previous $y$.
> * __Network bias__: the $\mathbf{b}_h\in\mathbb{R}^{h}$ terms denotes the bias vector for the hidden state, $\mathbf{b}_y\in\mathbb{R}^{d_{out}}$ is the bias vector for the output and $\mathbf{b}_s\in\mathbb{R}^{h}$ is the bias vector for the state.
> * __Activation function__: the $\sigma_{h}$ function is a _hidden layer activation function_, such as the sigmoid or hyperbolic tangent (tanh) function, which introduces non-linearity into the RNN. The activation function $\sigma_{y}$ is an _output activation function_ that can be a softmax function for classification tasks or a linear function for regression tasks, and $\sigma_{s}$ is a _state activation function_ that can be a sigmoid or tanh function.

How many parameters are there in the Jordan network? 

> __Parameter Count in Jordan RNN__
>
> The number of parameters in a Jordan RNN can be calculated as follows:
> * _Hidden state_: The number of parameters for the hidden state is $N_{hidden} = sh + d_{in}h + h = h(s + d_{in} + 1)$
> * _Output_: The number of parameters for the output is $N_{output} = d_{out}h + d_{out} = d_{out}(h + 1)$
> * _State_: The number of parameters for the state is $N_{state} = s^2 + sd_{out} + s = s(s + d_{out} + 1)$
>
> The total number of parameters in the Jordan RNN is given by: 
> $$
\begin{align*}
N_{p} &= N_{hidden} + N_{output} + N_{state} \\
N_{p} & = h(s + d_{in} + 1) + d_{out}(h + 1) + s(s + d_{out} + 1)
\end{align*}
$$

## Training challenges with RNNs
The training process for RNNs is similar to that of feedforward neural networks but with a few key differences. The main difference is that RNNs are trained using _backpropagation through time_ (BPTT), which _unrolls the network_ across sequential steps to compute gradients and update shared weights. 

> __Backpropagation through time (BPTT)__
>
> * __What is BPTT?__ Backpropagation through time (BPTT) is a variant of the backpropagation algorithm that trains recurrent neural networks (RNNs). It involves _unrolling_ the RNN across time steps, treating it as a feedforward network, and then applying the standard backpropagation algorithm to compute gradients and update weights. BPTT allows RNNs to learn from data sequences by capturing temporal dependencies and adjusting weights based on the entire sequence.
> * __Issues__: However, BPTT is prone to the __vanishing gradients problem__, where gradients shrink exponentially during backpropagation, hindering the learning of long-term dependencies, and the __exploding gradients problem__, where unchecked gradient growth destabilizes training. 

__Hmmm__: Suppose we didn't use gradient descent but instead used a different optimization algorithm, such as genetic algorithms, simulated annealing, or particle swarm optimization. Would that help with the vanishing gradients problem?

For more information (and intuition) about BPTT and the vanishing and exploding gradients problem, see [Chapter 10 of Goodfellow et al.](http://www.deeplearningbook.org/). These training challenges (and other factors) led to advanced architectures like [Long short-term memory (LSTMs) and Gated Recurrent Units (GRUs)](https://arxiv.org/pdf/1412.3555), which use gating mechanisms to better regulate information flow and mitigate gradient issues.

> __What is gating?__ Gating mechanisms are components in neural networks, particularly in recurrent neural networks (RNNs), that control the flow of information by selectively allowing or blocking specific inputs or activations. They help manage the network's memory and learning process, enabling it to retain relevant information over time and discard irrelevant data. 

Let's watch [a Video from the IBM technology channel about LSTMs](https://www.yout-ube.com/watch?v=b61DPVFX03I)

## Summary
One summary sentence about the lecture goes here.

> __Key Takeaways:__
> 
> Three key takeaways from the lecture go here.

One concluding sentence about the lecture goes here.
___