<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/Introduction_to_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent neural networks

# Table of contents

1. [Introduction](#1)
2. [Input and output sequences](#2)
3. [Recurrent neurons](#3)
    1. [Memory cells](#3.1)
    2. [Basic RNNs in Tensoflow](#3.2)
    3. [Comparison with FNN](#3.3)
4. [Training RNNs](#4)
    1. [Backpropagation through time (BPTT)](#4.1)
    2. [Truncated backpropagation through time (TBPTT)](#4.2)
        1. [Keras implementation of TBPTT](#4.2.1)
    3. [Stateless vs stateful model](#4.3)
    4. [Data preparation for BPTT or TBPTT in Keras](#4.4)
5. [References](#5)

# Introduction <a name="1"></a>

In this notebook, we will look at the fundamental concepts underlying Recurrent Neural Networks (RNN).

Up to now, we have studied **feedforward neural networks**, where the **activations flow only in one direction, from the input layer to the output layer**. Despite their power, these neural networks have **limitations**. Most notably, they rely on the assumption of **independence among the training examples** (and test examples). After each example (data point) is processed, the entire state of the network is lost. If each example is generated independently, this presents no problem. But if data points are related in time or space, this is unacceptable. Frames from video, snippets of audio, and **words pulled from sentences**, represent data where the **independence assumption fails**. Additionally, **standard networks** generally **rely on input vectors of fixed length**. Thus it is desirable to **extend** these powerful learning tools **to model data with temporal or sequential structure and varying length inputs and outputs**.

What differentiates Recurrent Neural Networks from Feedforward Neural Networks (also known as Multi-Layer Perceptrons, MLPs) is how information gets passed through the network. While Feedforward Networks pass information through the network without cycles, the **RNN has cycles and transmits information back into itself**. This enables them to extend the functionality of Feedforward Networks to also **take into account previous inputs and not only the current input**.

In [1]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np

# Input and output sequences <a name="2"></a>

Standard feed-forward neural networks propagate the data in one direction, from input to output. This type of network cannot handle sequential data.

![](https://i.ibb.co/fNkfK3g/vanila-neural-networks.png)

Recurrent neural networks are particularly well-suited for handling cases where we have a sequence of inputs rather than a single input.

- An RNN can simultaneously **take a sequence of inputs and produce a sequence of outputs** (see Figure 4-4, top-left network). For example, this type of network called ***sequence-to-sequence network*** or also known as ***many to many*** model is useful for predicting **time series such as stock prices**: we feed it the prices over the last $N$ days, and it must output the prices shifted by one day into the future (i.e., from $N-1$ days ago to tomorrow).

- Alternatively, we could feed the network **a sequence of inputs, and ignore all outputs except for the last one** (see the top-right network). In other words, this is a ***sequence-to-vector network*** or also called ***many to one*** model. For example, we could feed the network an **input of variable size like a sequence of words** corresponding to a movie review, and the network would **output a sentiment score** (e.g., from -1 (hate) to +1 (love)). Another example could be input a video which have a variable number of frames and output what kind of activity or action is going on in that video.

- Conversely, we could feed the network **a single input at the first time step** (and zeros for all other time steps), and let it **output a sequence** (see the bottom-left network). This is a ***vector-to-sequence network*** or also called ***one to many*** model. For example, the **input** could be some object **of fixed length like an image**, and the **output** could be **a sequence of variable length such as a caption for that image** (image captioning), where different captions might have a different number of words (the output needs to be variable in length).

- Lastly, we could have a **sequence-to-vector** network, called an ***encoder***, **followed by a vector-to-sequence** network, called a ***decoder*** (see the bottom-right network). For example, this can be used **for language translation**. We would feed the network a sentence in one language (English) which could have a variable length, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language (french), which also could have a variable length. Note that the length of the English sentence might be different from the french sentence. This two-step model, called an ***Encoder-Decoder***, **works much better than** trying to translate on the fly with **a single sequence-to-sequence** RNN (like the one represented on the top left), since the last words of a sentence can affect the first words of the translation, so we need to wait until we have heard the whole sentence before translating it.

![texto alternativo](https://i.ibb.co/b5NpXXV/inputs-outputs-rnn.png)

Recurrent neural networks can handle variable size sequence data that allow us to capture all of these different types of setups in our models. Furthermore, they can also be useful for some problems that have a fixed size input and a fixed size output.

# Recurrent neurons <a name="3"></a>

 A recurrent neural network looks like a feedforward neural network, except it also has **connections pointing backward**. Let’s look at the simplest possible RNN, composed of just **one neuron** receiving inputs, producing an output, and sending that output back to itself, as shown in Figure 4-1 (left). At each time step $t$ (also called a *frame*), this recurrent neuron receives the inputs $x_{(t)}$ as well as its own output from the previous time step, $y_{(t-1)}$. Since there is no previous output at the first time step, it is generally set to 0. We can **represent** this tiny **network against the time axis**, as shown in Figure 4-1 (right). This is called **unrolling or unfolding the network through time** (it’s the same recurrent neuron represented once per time step).

![texto alternativo](https://i.ibb.co/g7ByFbG/recurrent-neuron.png)

We can easily create a layer of recurrent neurons. At each time step $t$, every neuron receives both the input vector $\boldsymbol{x}_{(t)}$ and the output vector from the previous time step $\boldsymbol{y}_{(t-1)}$, as shown in Figure 4-2. Note that both the inputs and outputs are vectors now (when there was just a single neuron, the output was a scalar).

![texto alternativo](https://i.ibb.co/BNC0jRK/recurrent-layer.png)

## Memory cells <a name="3.1"></a>



Since the **output of a recurrent neuron at time step $t$ is a function of all the inputs from previous time steps**, we could say it has a form of ***memory***. A part of a neural network that preserves some state across time steps is called a *memory cell* (or simply a *cell*). A single recurrent neuron, or a layer of recurrent neurons, is a very basic cell, capable of learning only short patterns (typically about 10 steps long, but this varies depending on the task). Later, we will look at some more complex and powerful types of cells capable of learning longer patterns (roughly 10 times longer, but again, this depends on the task).

In general, a **cell’s state** at time step $t$, denoted $\boldsymbol{h}_{(t)}$ (the “h” stands for “hidden”), **is a function** parametrized by $\boldsymbol{W}$ **of some inputs at that time step**, $\boldsymbol{x}_{(t)}$, **and its state at the previous time step**, $\boldsymbol{h}_{(t-1)}$. This is,  $\boldsymbol{h}_{(t)}=f_{\boldsymbol{W}}(\boldsymbol{h}_{(t-1)},\boldsymbol{x}_{(t)})$. **Its output** at time step $t$, denoted $\boldsymbol{y}_{(t)}$, is **also** a function of the previous state and the current inputs. **In** the case of the **basic cells** we have discussed so far, the **output is simply equal to the state**, but in more complex cells this is not always the case, as shown in Figure 4-3.

**Note**: The **same function and the same set of parameters are used at every time step**.

![texto alternativo](https://i.ibb.co/WgpyWBm/hidden-cells.png)

The network passes the information about its hidden state from one time step of the network to the next. We call these networks with loops in them recurrent because the information is being passed from one time step to the next internally within the network.

Let's consider the standard recurrent neural network with one hidden layer shown in the previous figure. **Each recurrent neuron has two sets of weights**: one **for the inputs** $\boldsymbol{x}_{(t)}$ **and** the other **for the hidden state of the previous time step**, $\boldsymbol{h}_{(t-1)}$. Let’s call these weight vectors $\boldsymbol{w}_{xh}$ and $\boldsymbol{w}_{hh}$. If we **consider** the whole recurrent **layer** instead of just one recurrent neuron, we can place all the **weight vectors in two weight matrices**, $\boldsymbol{W}_{xh}$ and $\boldsymbol{W}_{hh}$. The **hidden stage** vector **of the recurrent layer at time step $t$ is updated** as shown in the next equation.

$$\boldsymbol{h}_{(t)} = \phi_{h}(\boldsymbol{W}_{xh}^T\boldsymbol{x}_{(t)}+\boldsymbol{W}_{hh}^T\boldsymbol{h}_{(t-1)}+\boldsymbol{b}_{h})$$

where $\boldsymbol{b}_{h}$
is the bias parameter vector and $ \phi_{h}()$ is the activation function used in the hidden layer.

Just like for feedforward neural networks, we can compute a recurrent layer’s hidden state in one shot for a whole mini-batch by placing all the inputs at time step $t$ in an input matrix $\boldsymbol{X}_{(t)}$ and the hidden states at the previous time step in a hidden-state-to-hidden-state matrix $\boldsymbol{H}_{(t-1)}$.

$$\boldsymbol{H}_{(t)} = \phi_{h}(\boldsymbol{X}_{(t)}\boldsymbol{W}_{xh}+\boldsymbol{H}_{(t-1)}\boldsymbol{W}_{hh}+\boldsymbol{b}_h)
$$

- $\boldsymbol{H}_{(t)}$ is an $m$ × $n$ matrix containing the layer’s hidden states at time step $t$ for each instance in the mini-batch ($m$ is the number of instances in the mini-batch and $n$ is the number of neurons or hidden units).

- $\boldsymbol{X}_{(t)}$  is an $m$ × $p$ matrix containing the inputs for all instances ($p$ is the number of input features).

- $\boldsymbol{W}_{xh} $ is an $p$ × $n$ matrix containing the connection weights for the inputs of the current time step.

- $\boldsymbol{W}_{hh} $ is an $n$ × $n$ matrix containing the connection weights for the hidden states of the current time step.

- $\boldsymbol{b}_h$ is a vector of size $n$ containing each neuron’s bias term.

Notice that $\boldsymbol{H}_{(t)}$  is a function of $\boldsymbol{X}_{(t)}$ and 
$\boldsymbol{H}_{(t-1)}$,
which is a function of $\boldsymbol{X}_{(t-1)}$ and 
$\boldsymbol{H}_{(t-2)}$, which is a function of 
$\boldsymbol{X}_{(t-2)}$ and 
$\boldsymbol{H}_{(t-3)}$, and so on. This makes 
$\boldsymbol{H}_{(t)}$ a function of all the inputs since time 
$t = 0$ (that is, $\boldsymbol{X}_{(0)},\boldsymbol{X}_{(1)},...,\boldsymbol{X}_{(t)}$). 
The RNN includes traces of all hidden states that preceded $\boldsymbol{H}_{(t-1)}$ as well as $\boldsymbol{H}_{(t-1)}$ itself. At the first time step, $t = 0$, there are no previous hidden states, so they are typically assumed to be all zeros.

**Note**: As we mentioned previously, the same function and the same set of parameters are used at every time step.

Additionally, **if we require an output** $\boldsymbol{y}_{(t)}$
 **at the end of each time step** we can pass the **hidden state** and just **multiply** it **by another weight matrix** which contains the connection weights for the outputs of the current time step, denoted by $\boldsymbol{W}_{hy}$, and possibly apply an activation function $ \phi_{o}()$ to obtain the desired shape of the result.

$$\boldsymbol{y}_{(t)} = \phi_{o}(\boldsymbol{W}_{hy}\boldsymbol{h}_{(t)}+\boldsymbol{b}_o)$$

or for a whole mini-batch

$$\boldsymbol{Y}_{(t)} = \phi_{o}(\boldsymbol{H}_{(t)}\boldsymbol{W}_{hy}+\boldsymbol{b}_o)$$

## Basic RNNs in TensorFlow <a name="3.2"></a>

Let’s implement a very simple RNN model, without using Keras operations or layers, to better understand what goes on under the hood. We will create an RNN composed of a layer of five recurrent neurons, $n=5$ (like the RNN represented in Figure 4-2), using the tanh activation function. We will assume that the RNN runs over only two-time steps ($t=0,1$), taking input vectors of size 3 at each time step, $p=3$. We use a batch size of $m=4$.The following code builds this RNN, unrolled through two-time steps:

In [2]:
number_inputs, hidden_units = 3, 5
# Intizialize the weight matrices with random values from a normal distribution
W_xh = tf.Variable(tf.random.normal(shape=[number_inputs, hidden_units], dtype=tf.float32))
W_hh = tf.Variable(tf.random.normal(shape=[hidden_units, hidden_units], dtype=tf.float32))
b = tf.Variable(tf.zeros([1, hidden_units], dtype=tf.float32))

# Each mini-batch contains four instances, each with an input of three inputs
X0_batch = tf.constant([[2,0,1], [3,4,5], [6,7,8], [3,4,6]], dtype=tf.float32) # t = 0
X1_batch = tf.constant([[5,6,9], [0,0,0], [3,4,6], [2,0,1]], dtype=tf.float32) # t = 1

# The outputs of the network at both time steps
H_0 = tf.tanh(X0_batch @ W_xh + b)
H_1 = tf.tanh(X1_batch @ W_xh + H_0 @ W_hh + b)
print(H_0)
print(H_1)

tf.Tensor(
[[-0.97083503  0.8867572  -0.9498689  -0.65231925  0.9999443 ]
 [-0.08644805  0.9336822  -1.         -0.9997256   1.        ]
 [-0.7637378   0.9982905  -1.         -1.          1.        ]
 [-0.03022995  0.9273263  -1.         -0.9986557   1.        ]], shape=(4, 5), dtype=float32)
tf.Tensor(
[[-0.6307184   0.77310055 -1.         -0.99999464  1.        ]
 [-0.88780326 -0.98216236 -0.8184843   0.50630784  0.99621964]
 [-0.67107487 -0.5753925  -1.         -0.9984835   1.        ]
 [-0.99841154 -0.7405708  -0.99497765 -0.17910978  1.        ]], shape=(4, 5), dtype=float32)


Of course, if we want to be able to run an RNN over 100 time steps, the graph is going to be pretty big. Now let’s look at how to create the same model using Keras RNN layers. In Keras, the RNN layers expect input data to have the dimensions:`[batch size, time steps, features]`.

In [3]:
X_batch = np.array([[[2,0,1], [5,6,9]], [[3,4,5], [0,0,0]], 
                    [[6,7,8], [3,4,6]], [[3,4,6], [2,0,1]]]
).astype(np.float32)
print(f'Input dim: (batch size, time steps, features) : {X_batch.shape}')
simple_rnn = keras.layers.SimpleRNN(
    hidden_units, return_sequences=False, return_state=True
)
# sequence_output has shape `[4, 5]`.
# final_state has shape `[4, 5]`.
sequence_output, final_state = simple_rnn(X_batch)
print(final_state)

Input dim: (batch size, time steps, features) : (4, 2, 3)
tf.Tensor(
[[-0.9999964   0.0960996   0.9948537  -1.         -0.995393  ]
 [ 0.88473886 -0.22579265  0.42553097 -0.3813316   0.8267129 ]
 [-0.9988111  -0.3068507   0.95532393 -1.         -0.81841385]
 [ 0.5443056   0.7043666   0.9043952  -0.9833065   0.26924437]], shape=(4, 5), dtype=float32)


The `simple_rnn` function returns two objects. In this case, the first is a tensor containing the last output of the network. The second is a tensor containing the final states of the network. As we are using basic cells, the final state is simply equal to the last output. If we want the network returns the intermediate output as well, we can set `return_sequences=True`. More details are available in the [TensorFlow documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN).

## Comparison with FNN <a name="3.3"></a>

If we compare that notation for RNNs with similar notation for Feedforward Neural Networks we can clearly see the difference we described earlier. In the **next equations**, we can see the **computation** for the hidden variable and the output variable **in a feed-forward neural network with one hidden layer**.

\begin{cases}
\boldsymbol{H} = \phi_{h}(\boldsymbol{X}\boldsymbol{W}_{xh}+\boldsymbol{b}_h)\\\\
\boldsymbol{Y} = \phi_{o}(\boldsymbol{H}\boldsymbol{W}_{hy}+\boldsymbol{b}_o)
\end{cases}

![](https://i.ibb.co/wYdtkZW/FFNN-vs-RNN.png)

# Training RNNs <a name="4"></a>

Regular **feedforward neural networks are trained using the backpropagation algorithm to update the weights in order to minimize the error between the expected output and the predicted output** for a given input. In this, a certain input is first propagated through the network to compute the output. This is called the forward pass. The output is then compared to a ground truth label using a differentiable loss function. In the backward pass the gradients of the loss with respect to all the parameters (weights) in the network are computed by application of the chain rule (see notebook [Introduction to neural networks for more details](https://nbviewer.jupyter.org/github/victorviro/Deep_learning_python/blob/master/Introduction_artificial_neural_networks.ipynb)). Finally, all parameters are updated using a gradient-based optimization procedure such as gradient descent (ee notebook [Gradient Descent](https://nbviewer.jupyter.org/github/victorviro/ML_algorithms_python/blob/master/Introduction_gradient_descent_algorithm.ipynb)).

### Backpropagation through time (BPTT) <a name="4.1"></a>

As we saw, **in RNNs** a new input is applied for every time step, and the **output at a certain time step is dependent on all previous inputs**. This means that the **loss at time step $N$ needs to be backpropagated up until the applied inputs at time step 0**. To train these networks, the trick is to **unroll it through time** (like we did previously) and then simply **use regular backpropagation** (see Figure 15-5). This strategy is called ***backpropagation through time (BPTT)***.

Just like in regular backpropagation, there is a **first forward pass through the unrolled network** (represented by the dashed arrows in the Figure 15-5). Then the **output sequence is evaluated using a cost function** $C(\boldsymbol{Y}_{(0)}, \boldsymbol{Y}_{(1)},...,\boldsymbol{Y}_{(T)})$
 (where $T$ is the max time step). This basically sums up every loss term of each update step so far. This loss term can have different definitions based on the specific problem (e.g. Mean Squared Error, Cross-Entropy Loss, etc.). The **cost function may ignore some outputs**, as shown in Figure 15-5 (for example, in a sequence-to-vector RNN, all outputs are ignored except for the very last one). The **gradients of that cost function are then propagated backward through the unrolled network** (represented by the solid arrows) (hence the name backpropagation through time). Finally, the model parameters are updated using the gradients computed during BPTT. The **gradients flow backward through all the outputs used by the cost function**, not just through the final output (for example, in Figure 15-5 the cost function is computed using the last three outputs of the network, $\boldsymbol{Y}_{(2)}, \boldsymbol{Y}_{(3)}$ and $\boldsymbol{Y}_{(4)}$, so gradients flow through these three outputs, but not through  $\boldsymbol{Y}_{(0)}$ and $\boldsymbol{Y}_{(1)}$).

![texto alternativo](https://i.ibb.co/kKM7VpZ/BPTT.png)

The next figure shows the fact that the same weight matrices are shared across time steps.

![](https://i.ibb.co/gwJmKqb/BPTT-2.png)

Fortunately, `tf.keras` takes care of all of this complexity for us.

### Truncated backpropagation through time <a name="4.2"></a>

**BPTT can be sometimes inefficient**. Imagine we want to train a **language model**, which tries to predict the next word in a sentence, and we use very **long sequences** to train the model. In this case, BPTT requires processing the full sequence both forward and backward. This requires maintaining the full unfolded network, or equivalently storing the full history of inputs and activations. This is impractical when very long sequences are processed with large networks: processing the whole sequence at every gradient step **slows down learning**.  In addition to speed, the **accumulation of gradients over so many timesteps** can cause weights to vanish or explode (go to zero or overflow).

Practically, this is **alleviated by limiting gradient flows after a fixed number of timesteps**, or equivalently, splitting the input sequence into subsequences of fixed length, and only backpropagating through those subsequences. This algorithm is referred to as ***truncated backpropagation through time (TBPTT)***.

TBPTT processes the sequence one timestep at a time, and **every $k_1$ timesteps, it runs BPTT for $k_2$ timesteps**, so a parameter update can be cheap if $k_2$ is small. That is, the TBPTT training algorithm has two parameters:

- $k_1$: The **number of forward-pass timesteps between updates**. Generally, this influences how slow or fast training will be, given how often weight updates are performed.
- $k_2$: Defines the **number of timesteps to look at when estimating the gradient on the backward pass**. Generally, it should be large enough to capture the temporal patterns. Too large values result in vanishing gradients.

As such, we can use the notation $\text{TBPTT}(k_1, k_2)$ when considering how to configure the training algorithm. 

- For $k_1 = k_2 = T$, where $T$ is the length of the original input sequence, it is the classical non-truncated BPTT. Updates are performed at the end of the sequence across all timesteps in the sequence.

- $k_1 = k_2 < T$, is a common configuration where a fixed number of timesteps are used for both forward and backward-pass timesteps (e.g. tens to hundreds of timesteps).

The choice of TBPTT parameters influences how the network estimates the error gradient used to update the weights. To avoid some data points be skipped during training, $k_1$ should preferably be less than or equal to $k_2$

Modern RNNs (like LSTMs, which we will see in a following notebook) can use their internal state to remember over very long input sequences (over thousands of timesteps). This means that the configuration of TBPTT does not necessarily define the memory of the network that we are optimizing with the choice of the number of time steps. We can choose when the internal state of the network is reset separately from the strategy used to update network weights.

#### Keras Implementation of TBPTT <a name="4.2.1"></a>

Keras provides an implementation of TBPTT for training recurrent neural networks. The **implementation** is more **restricted** than the general version. 

Specifically, the **$k_1$ and $k_2$ values are equal to each other and fixed**. This means that when we train the model, we will step forward for some number of steps (like 100), compute the loss only over this subsequence of the data, backpropagate through this subsequence and then make a gradient step. This scheme is repeated for each mini-batch of data during the training process.

In Keras, this is realized by the fixed-sized three-dimensional input required to train recurrent neural networks. The RNN expects input data to have the dimensions: `[samples, timesteps, features]`. It is the **second dimension of this input format**, that defines the ***number of timesteps used for forward and backward passes** on our sequence prediction problem.

Therefore, when preparing our input data for sequence prediction problems in Keras, the choice of timesteps will influence both the internal state accumulated during the forward pass and the gradient estimate used to update weights on the backward pass. We will see how we can prepare the date later.

### Stateless and stateful modes <a name="4.3"></a>

In the implementation of RNNs **in Keras, by default, the internal state of the network is reset after each batch** (usually all zeros). This is known as a ***stateless model***. This means that, at each batch, the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away, as it is not needed anymore. In a ***stateful mode***, **the model preserves this final state after processing one training batch and uses it as the initial state for the next training batch**. This mode allows us to have more explicit **control over when the internal state is reset** by calling the reset operation manually. This way the model can **learn long-term patterns despite only backpropagating through short sequences**.

Note that a **stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off**. So **batching** is usually much **harder when preparing a dataset for a stateful RNN** than it is for a stateless RNN. Moreover, we must obviously **not shuffle the sequences**.

Most of the problems can be solved with the stateless model so **we must choose the stateful mode only when we really need it**. For example, suppose we have a big sequence (e.g. all text of Wikipedia) and we split it into smaller subsequences to construct our dataset. Then, the model may find dependencies between the subsequences only if we choose the stateful model. In a later notebook, we will prepare a dataset and train a language model where we will use the stateful mode.

### Prepare sequence data for BPTT or TBPTT in Keras <a name="4.4"></a>

The way that we break up our sequence data will define the number of time steps used in the forward and backward passes of BPTT. As such, we must put careful thought into how we will prepare our training data.

This section lists some techniques we may consider.

- **Use data as-is**: We may **use our input sequences as-is if the number of timesteps in each sequence is modest**, such as **tens or a few hundred timesteps** (practical limits have been suggested for TBPTT of about 200-to-400 timesteps). If our sequence data is less than or equal to this range, we may reshape the sequence observations as timesteps for the input data and use classical BPTT. For example, if we had a collection of 100 univariate sequences of 25 timesteps, this could be reshaped as 100 samples, 25 timesteps, and 1 feature or `[100, 25, 1]`.

- **Naive Data Split**: If we have **long input sequences**, such as thousands of timesteps, we may need to **break** the long input sequences **into multiple contiguous subsequences**. This will require the use of a **stateful model in Keras so that the internal state is preserved across the input of the sub-sequences** and only reset at the end of a true fuller input sequence. A split that divides the full sequence into fixed-sized subsequences is preferred. The choice of the subsequence length is arbitrary, hence the name “naive data split”. For example, if we had 100 input sequences of 50000 timesteps, then each input sequence could be divided into 100 subsequences of 500 timesteps. One input sequence would become 100 samples, therefore the 100 original samples would become 10,000. The dimensionality of the input for Keras would be 10,000 samples, 500 timesteps, and 1 feature or [10000, 500, 1]. If we use a stateful model, care would be needed to preserve the state across every 100 subsequences and reset the internal state after every 100 samples either explicitly or by using a batch size of 100.

- **Domain-Specific Data Split**: It can be hard to know the correct number of timesteps required to provide a useful estimate of the error gradient. We can use the naive approach (above) to get a model quickly, but the model may be far from optimized. Alternatively, we can **use domain-specific information to estimate the number of timesteps that will be relevant** to the model while learning the problem. For example, if the sequence problem is a regression time series, perhaps a review of the autocorrelation and partial autocorrelation plots can inform the choice of the number of timesteps. If the sequence problem is an NLP problem, perhaps the input sequence can be divided by sentence and then padded to a fixed length or split according to the average sentence length in the domain. The key idea is to consider knowledge specific to our domain that we can use to split up the sequence into meaningful chunks.

- **Systematic Data Split (grid search)**: Rather than guessing at an efficient number of timesteps, we can systematically **evaluate different subsequence lengths** for our sequence prediction problem. We could perform a **grid search** over each sub-sequence length **and adopt** the configuration that results in **the best performing model** on average. There are some considerations for this approach. We can start with subsequence lengths that are a factor of the full sequence length or use padding and perhaps masking if exploring subsequence lengths that are not a factor of the full sequence length We may ake the average performance over multiple runs (e.g. 30) of each different configuration. If computation resources are not a limitation, then a systematic investigation of different numbers of timesteps is recommended.

# References <a name="5"></a>

- [A Critical Review of Recurrent Neural Networks for Sequence Learning](https://arxiv.org/abs/1506.00019)

- [Recurrent Neural Networks (RNNs): A gentle Introduction and Overview](https://arxiv.org/abs/1912.05911)



- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

- [Handson-ml2 Github](https://github.com/ageron/handson-ml2)

- [RNNs Stanford lecture](https://www.youtube.com/watch?v=6niqTuYFZLQ)

- [RNNs MIT lecture](https://www.youtube.com/watch?v=SEnXr6v2ifU)

- [Unbiasing Truncated Backpropagation Through Time](https://arxiv.org/abs/1705.08209)

- [Stateful and Stateless LSTM for Time Series Forecasting with Python](https://machinelearningmastery.com/stateful-stateless-lstm-time-series-forecasting-python/)

- [Understanding Stateful LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/)

- [Stateful LSTM in Keras](http://philipperemy.github.io/keras-stateful-lstm/)

- [Introduction to Backpropagation Through Time](https://machinelearningmastery.com/gentle-introduction-backpropagation-time/)

- [How to Prepare Sequence Prediction for Truncated BPTT in Keras](https://machinelearningmastery.com/truncated-backpropagation-through-time-in-keras/)

- [Training recurrent neural networks](https://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf)


