# L12c: Recurrent Neural Networks

___

In this lecture, we continue our discussion of artificial neural networks by introducing recurrent neural networks (RNNs). RNNs are a neural network well-suited for processing data sequences, such as time series or natural language. In this lecture, we will cover the following topics:

* __What are RNNs?__: Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to process sequential data by retaining information about previous inputs through their _internal memory_. This makes them particularly effective for tasks such as language modeling, time-series prediction, and speech recognition, where context and dependencies between data points are crucial.
* __How do RNNs work?__: RNNs maintain a hidden state updated at each time step based on the current input and the previous hidden state. This allows them to capture temporal dependencies in the data. The basic building block of an RNN is a recurrent layer, which processes the input sequence one element at a time while updating its hidden state.
* __Training RNNs__: Recurrent Neural Networks (RNNs) are trained using backpropagation through time (BPTT), which _unrolls the network_ across sequential steps to compute gradients and update shared weights. However, this process is prone to the __vanishing gradients problem__, where gradients shrink exponentially during backpropagation, hindering the learning of long-term dependencies, and the __exploding gradients problem__, where unchecked gradient growth destabilizes training. These challenges led to advanced architectures like Long short-term memory (LSTMs) and Gated Recurrent Units (GRUs), which use gating mechanisms to regulate information flow better and mitigate gradient issues.

Sources for this lecture include:
* [Goodfellow et al., Deep Learning Book, 2017 MIT Press](http://www.deeplearningbook.org/)

To get a general overview of RNNs, check out the following [video from the IBM technology channel](https://www.yout-ube.com/watch?v=Gafjk7_w1i8) on YouTube. It provides a good introduction to the topic and covers some key concepts discussed in this lecture.
___

## Setup, Data and Prequisites
Let's set up the computational environment, e.g., importing the necessary libraries (and codes) by including the `Include.jl` file.

In [3]:
include("Include.jl");

### Data
We'll use a weather dataset for this lecture. The dataset contains daily weather data for Cornell from January 2025 until last week, including low and high temperatures for each day. The data is available in the repository's `data` folder. 
* _Data_: The data is in CSV format; we load it using [the `CSV.jl` package](https://github.com/JuliaData/CSV.jl) and store the data [using the `DataFrame` type exported from the `DataFrames.jl` package](https://dataframes.juliadata.org/stable/). 

We store the `TMIN` and `TMAX` values in the `X::Array{Float32,2}` variable, and the `rawdata::DataFrame` variable contains the entire dataset.

In [40]:
X, rawdata = let

    # raw data -
    rawdata = CSV.read(joinpath(_PATH_TO_DATA, "Temp-ITH-YTD-NOAA-2025.csv"), DataFrame); # load the data from a CSV file into a DataFrame
    X = @select rawdata :TMIN :TMAX; # Wow! Grab the Tmax and Tmin using the @select macro from the DataFramesMeta.jl package.
    X̂ = X .|> Float32 # convert to Float32

    # return -
    X̂,rawdata;
end;

Let's set some constants that we'll use throughout the lecture. Please take a look at the comment next to each constant for its purpose, permissible values, default value, etc.

In [7]:
number_of_inputs = 1; # dimension of the input
number_of_outputs = 1; # dimension of the output
number_of_hidden_states = 2; # number of hidden neurons
σ₁ = NNlib.tanh_fast; # activation function
σ₂ = NNlib.tanh_fast; # activation function
number_of_epochs = 250; # how many epochs do we want to train for?
number_of_training_samples = 10; # how many training samples do we want to use?

__Training data__: We need to convert the weather data into the form $\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$, where $x_i$ is the input sequence and $y_i$ is the target output. In this case, we can use the `TMIN` values as the input sequence and the `TMAX` value for the same day as the target output. 

We store training data in the `training_data::Array{Tuple{Float32, Float32}}` variable.

In [48]:
training_dataset, x = let

    # initialize - 
    training_tuple_array = Array{Tuple{Float32,Float32}}(undef, number_of_training_samples); # create an empty array of tuples to store the training data
    y = X.TMAX; # extract the TMAX column from the DataFrame
    x = X.TMIN; # extract the TMIN column from the DataFrame

    # build training tuples -
    for i ∈ 1:number_of_training_samples
        xᵢ = x[i];
        yᵢ = y[i];
        training_tuple_array[i] = (xᵢ,yᵢ); # fill the array with random tuples
    end

    # return -
    training_tuple_array, x;
end;

## General problem: Modeling a Sequence
Suppose we have a _sequence of data_ $(x_1, x_2, \ldots, x_T)$ where $T$ is the sequence length, and $x_i$ is the $i$-th element (token) of the sequence. 
* _Example sequences_: in natural language processing, $x_{i}$ could be words or characters in a sentence in a word. In time series analysis, $x_t$ could be a measurement, i.e., temperature, pressure, price, etc, at time $i$.

To model this sequence, i.e., predict the next token given past tokens, we _could try_ to use tools such as [Hidden Markov Models (HMMs)](https://en.wikipedia.org/wiki/Hidden_Markov_model). However, HMMs are limited in their ability to capture _long-range dependencies_ and complex relationships between elements in the sequence. 
* _Why is this true?_ HMMs use the [Markov property](https://en.wikipedia.org/wiki/Markov_property), which says that the future state of a system depends only on its current state and not on its past states. This assumption is often too restrictive for many real-world applications, where the relationships between elements in a sequence can be more complex and require a more flexible modeling approach.

This is where RNNs come in. RNNs are designed to handle data sequences by maintaining a hidden state that captures information about previous inputs. This allows them to model long-range dependencies and contextual relationships between elements in the sequence.

## What are RNNs?
Recurrent Neural Networks (RNNs) are artificial neural networks designed to process sequential data by retaining information about previous inputs through their internal memory. 

* _Do feedforward neural networks have memory?_ No, feedforward neural networks process do not retain information about previous inputs. Thus, the parameters (weights and bias values) do not change once training is over. This means that the network is done learning and evolving. When we feed in values, an FNN applies the operations that make up the network using the values it has learned.
* _How are RNNs different from feedforward neural networks?_ RNNs have connections that loop back on themselves, allowing them to maintain a _hidden state_ that captures information about previous inputs. This makes RNNs particularly effective for tasks such as language modeling, time-series prediction, and speech recognition, where context and dependencies between data points are crucial. 

Let's look at two types of _simple_ RNNs: the Elman and Jordan networks.

<img
  src="figs/recurrent_neural_network_unfold.svg"
  alt="triangle with all three sides equal"
  height="400"
  width="800" />

### Elman Network: Mathematical Formulation
The Elman network is a simple RNN type consisting of an input layer, a hidden layer, and an output layer. The hidden layer has recurrent connections that allow it to maintain a hidden state over time:

* [Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.](https://onlinelibrary.wiley.com/doi/10.1207/s15516709cog1402_1)

__At each time step__: an Elman RNN takes an _input_ and the previous hidden state (memory) and computes the output entry at time $t$. 

Let the input vector at time $t$ be denoted as $\mathbf{x}_t\in\mathbb{R}^{d_{in}}$, the hidden state at time $t$ as $\mathbf{h}_t\in\mathbb{R}^{h}$, and the output at time $t$ as $\mathbf{y}_t\in\mathbb{R}^{d_{out}}$. The following equations can describe the RNN:
$$
\begin{align*}
\mathbf{h}_t &= \sigma_{h}(\mathbf{U}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}_h) \\
\mathbf{y}_t &= \sigma_{y}(\mathbf{W}_y \mathbf{h}_t + \mathbf{b}_y)
\end{align*}
$$
where the parameters are:
* _Network weights_: the term $\mathbf{U}_h\in\mathbb{R}^{h\times{h}}$ is the weight matrix for the hidden state, $\mathbf{W}_x\in\mathbb{R}^{h\times{d_{in}}}$ is the weight matrix for the input, and $\mathbf{W}_y\in\mathbb{R}^{d_{out}\times{h}}$ is the weight matrix for the output
* _Network bias_: the $\mathbf{b}_h\in\mathbb{R}^{h}$ terms denote the bias vector for the hidden state, and $\mathbf{b}_y\in\mathbb{R}^{d_{out}}$ is the bias vector for the output.
* _Activation function_: the $\sigma_{h}$ function is a _hidden layer activation function_, such as the sigmoid or hyperbolic tangent (tanh) function, which introduces non-linearity into the RNN. The activation function $\sigma_{y}$ is an _output activation function_ that can be a softmax function for classification tasks or a linear function for regression tasks.

How many parameters are there in the Elman network? The number of parameters in an Elman RNN can be calculated as follows:
* _Hidden state_: The number of parameters for the hidden state is $h^2 + d_{in}h + h = h(h + d_{in} + 1)$
* _Output_: The number of parameters for the output is $d_{out}h + d_{out} = d_{out}(h + 1)$
* _Total_: The total number of parameters in the Elman RNN is $h(h + d_{in} + 1) + d_{out}(h + 1)$

Let's build a simple Elman RNN to understand better how it works.

In [13]:
Flux.@layer MyFluxElmanRecurrentNeuralNetworkModel; # create a "namespaced" of sorts
MyElmanRNNModel() = MyFluxElmanRecurrentNeuralNetworkModel( # a strange type of constructor
    Flux.Chain(
        hidden = Flux.RNN(number_of_inputs => number_of_hidden_states, σ₁),  # hidden layer
        output = Flux.Dense(number_of_hidden_states => number_of_outputs, σ₂) # output layer
    )
);
elmanmodel = MyElmanRNNModel().chain;

Let's explore what is happening in each component of the Elman RNN.

In [15]:
# rnn = RNN(number_of_inputs => number_of_hidden_states, σ₁);
rnn = elmanmodel; # get the hidden layer from the model
x = rand(Float32, (number_of_inputs, 1)); # create a random input vector
h = zeros(Float32, (number_of_hidden_states, 1)); # create a random hidden state vector
rnn(x) # pass input through the RNN

1×1 Matrix{Float32}:
 0.22428186

### Jordan Network: Mathematical Formulation
The Jordan network is another type of RNN similar to the Elman network but with a different architecture. In a Jordan network, the output layer is connected back to the hidden layer, allowing the network to maintain a hidden state based on the output at the previous time step.
* [Jordan, Michael I. (1997-01-01). "Serial Order: A Parallel Distributed Processing Approach". Neural-Network Models of Cognition — Biobehavioral Foundations. Advances in Psychology. Vol. 121. pp. 471–495. doi:10.1016/s0166-4115(97)80111-2. ISBN 978-0-444-81931-4. S2CID 15375627.](https://www.sciencedirect.com/science/article/pii/S0166411597801112?via%3Dihub)

__At each time step__: a Jordan RNN takes an _input_, the previous hidden state (memory), and the previous output and computes the output entry at time $t$. Thus, the Jordan network has a similar structure to the Elman network but with a different way of maintaining the hidden state (i.e., the output layer is connected back to the hidden layer).

Let the input vector at time $t$ be denoted as $\mathbf{x}_t\in\mathbb{R}^{d_{in}}$, the hidden state at time $t$ as $\mathbf{h}_t\in\mathbb{R}^{h}$, 
the state vector at time $t$ as $\mathbf{s}_t\in\mathbb{R}^{s}$, and the output at time $t$ as $\mathbf{y}_t\in\mathbb{R}^{d_{out}}$. Then, the Jordan RNN can be described by the following equations:
$$
\begin{align*}
\mathbf{h}_t &= \sigma_{h}(\mathbf{U}_h \mathbf{s}_{t} + \mathbf{W}_h \mathbf{x}_t + \mathbf{b}_h) \\
\mathbf{y}_t &= \sigma_{y}(\mathbf{W}_y \mathbf{h}_t + \mathbf{b}_y) \\
\mathbf{s}_t &= \sigma_{s}(\mathbf{W}_{ss} \mathbf{s}_{t-1} + \mathbf{W}_{sy} \mathbf{y}_{t-1} + \mathbf{b}_s) \\
\end{align*}
$$
where the parameters are:
* _Network weights_: the term $\mathbf{U}_h\in\mathbb{R}^{h\times{s}}$ is the weight matrix for the hidden state with respect to $s$, $\mathbf{W}_h\in\mathbb{R}^{h\times{d_{in}}}$ is the weight matrix for the input, and $\mathbf{W}_y\in\mathbb{R}^{d_{out}\times{h}}$ is the weight matrix for the output. In addition, a Jordan network has parameters associated with the state $\mathbf{s}$, the $\mathbf{W}_{ss}\in\mathbb{R}^{h\times{s}}$ matrix is the weight matrix for the state with respect to the previous $s$, and $\mathbf{W}_{sy}\in\mathbb{R}^{h\times{d_{out}}}$ is the weight matrix for the state with respect to the previous $y$.
* _Network bias_: the $\mathbf{b}_h\in\mathbb{R}^{h}$ terms denotes the bias vector for the hidden state, $\mathbf{b}_y\in\mathbb{R}^{d_{out}}$ is the bias vector for the output and $\mathbf{b}_s\in\mathbb{R}^{h}$ is the bias vector for the state.
* _Activation function_: the $\sigma_{h}$ function is a _hidden layer activation function_, such as the sigmoid or hyperbolic tangent (tanh) function, which introduces non-linearity into the RNN. The activation function $\sigma_{y}$ is an _output activation function_ that can be a softmax function for classification tasks or a linear function for regression tasks, and $\sigma_{s}$ is a _state activation function_ that can be a sigmoid or tanh function.

How many parameters are there in the Jordan network? The number of parameters in a Jordan RNN can be calculated as follows:
* _Hidden state_: The number of parameters for the hidden state is $sh + d_{in}h + h = h(s + d_{in} + 1)$
* _Output_: The number of parameters for the output is $d_{out}h + d_{out} = d_{out}(h + 1)$
* _State_: The number of parameters for the state is $s^2 + sd_{out} + s = s(s + d_{out} + 1)$
* _Total_: The total number of parameters in the Jordan RNN is $h(s + d_{in} + 1) + d_{out}(h + 1) + s(s + d_{out} + 1)$

___

## Training challenges with RNNs
The training process for RNNs is similar to that of feedforward neural networks but with a few key differences. The main difference is that RNNs are trained using _backpropagation through time_ (BPTT), which _unrolls the network_ across sequential steps to compute gradients and update shared weights. 
* _What is BPTT?_ Backpropagation through time (BPTT) is a variant of the backpropagation algorithm that trains recurrent neural networks (RNNs). It involves _unrolling_ the RNN across time steps, treating it as a feedforward network, and then applying the standard backpropagation algorithm to compute gradients and update weights. BPTT allows RNNs to learn from data sequences by capturing temporal dependencies and adjusting weights based on the entire sequence.
* _Issues_: However, BPTT is prone to the __vanishing gradients problem__, where gradients shrink exponentially during backpropagation, hindering the learning of long-term dependencies, and the __exploding gradients problem__, where unchecked gradient growth destabilizes training. 
* __Hmmm__: Suppose we didn't use gradient descent but instead used a different optimization algorithm, such as genetic algorithms, simulated annealing, or particle swarm optimization. Would that help with the vanishing gradients problem?

For more information (and intuition) about BPTT and the vanishing and exploding gradients problem, see [Chapter 10 of Goodfellow et al.](http://www.deeplearningbook.org/).

These training challenges (and other factors) led to advanced architectures like [Long short-term memory (LSTMs) and Gated Recurrent Units (GRUs)](https://arxiv.org/pdf/1412.3555), which use gating mechanisms to better regulate information flow and mitigate gradient issues.
* _What is gating?_ Gating mechanisms are components in neural networks, particularly in recurrent neural networks (RNNs), that control the flow of information by selectively allowing or blocking specific inputs or activations. They help manage the network's memory and learning process, enabling it to retain relevant information over time and discard irrelevant data. 

Let's watch [a Video from the IBM technology channel about LSTMs](https://www.yout-ube.com/watch?v=b61DPVFX03I)

## Lab
In Lab `L12d`, we will implement (and _hopefully_ train) a Long Short-Term Memory (LSTM) network constructed using [the `Flux.jl` package](https://github.com/FluxML/Flux.jl).

# Today?
That's a wrap! What are some of the interesting things we discussed today?