# Modelling sequences

When we go from sequences of word embeddings to a document-wise vector representation that can be classified, we have to somehow summarize a sequence of vectors into a single vector. So far, what we have been doing is:

1. Get one embedding per word ($\mathbb{t \times n}$, where $t$ is the sequence length and $n$ is the embedding dimension),
1. Calculate the timewise mean of the words ($\mathbb{1 \times n}$)
1. Proceed to classification with our Residual MLP modules

This is something like this:

<pl-figure file-name="classifier.png" directory="clientFilesQuestion"></pl-figure>


The problem with this idea is that the calculation of the mean totally disregards the order of the words - essentially, we are doing a glorified bag-of-words modelling, which seems non-ideal. When we do so, we are We could find some other way to summarize our sequence of words so that we somehow account for the order of words.

## Recurrent Neural Networks

Recurrent Neural Networks (RNNs) were diffusely invented by many small contributions during the 1950s to the 1970s. The underlying idea is to begin with a simple MLP network that processes each time step of the input separately:

<pl-figure file-name="mlp.png" directory="clientFilesQuestion"></pl-figure>

Then, we store the output at each time step and use it as part of the calculation of the output at the next step. For such, we concatenate the current input (that is, $x_t$) with the previous output ($y_{t-1}$), and yield it normally to an MLP network:

<pl-figure file-name="rnnmlp.png" directory="clientFilesQuestion"></pl-figure>

The output $y$ is often called "hidden state" and referred to as $h$. This is a reference to the fact that this output is usually a part of a larger classifier, and is hidden because it is an intermediate result of the network (similarly to the intermediate results in a deep MLP).

The [Pytorch implementation of RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) is simply a layer like:

    rnnlayer = nn.RNN(input_size, hidden_size)

Typically, RNNs are used to summarize sequences by propagating the last output, that is, out summarization of a sequence of $T$ elements is simply $y_{T}$. You can also stack many RNNs by propagating the whole sequence of outputs between two layers. Check the documentation for some interesting options!

Now, change the summarization process from the mean to an RNN. Train and test your network again!



RNNs are easy to implement. However, there is a problem here. In RNNs, the gradient typically propagates through many non-linearities, which leads to a vanishing gradient problem. 

Another problem is that, although the RNN can theoretically hold information from the past, in practice it only holds information from the recent past.

The solution for these problems is the Long-Short-Term Memory (LSTM) layer.

# Long-Short-Term Memory (LSTM)

The LSTMs are commonly classified as a type of RNN, even though the word RNN usually refers to the vanilla version of the recurrent neural networks. However, LSTMs also use the idea of saving the current state to help finding future outputs, that is, they use recurrence as well.

LSTMs add many controllers to the vanilla RNNs. You might have seen an LSTM diagram in the past, but I found them a bit confusing, so I made my own. Let's go through it very slowly.

The diagram is just below. The first thing to note is the orange loop below, which looks exactly like an RNN. The second thing to note is the black loop on the top, which looks like a residual connection:

<pl-figure file-name="lstm.png" directory="clientFilesQuestion"></pl-figure>

Note that there are two "states" that are being stored: the hidden state (bottom, orange), which corresponds to the output of the RNN, and the cell state (top, black).

The top loop (in black) consists of a so-called "constant error carousel". Note that the cell state $c$ doesn't go through any non-linearities in time, which is the same idea used in ResNet to avoid vanishing gradient. As a result, $c$ can encode information related to a long-term past, whereas $y$ (or $h$, depending of your implementation), as an RNN, can encode information from the short-term past. Hence, we have two outputs: the $c$ for the long-term encoding, and $y$ for the short-term encoding.

In our figure, there are also three colored gates.

The first one, in blue, is the Forget Gate. The forget gate gets the short-term information and runs through an MLP network with a sigmoid activation -- essentially, a logistic regression. This operation yields a number between $0$ and $1$, which then multiplies the current cell state. Hence, this allows "erasing" the cell state, which metaphorically means "forgetting" the past.

The second one, in green, is the Input Gate. Note that is calculates some transformation of the short-term input with an MLP with $\tanh(.)$ activation, and modulates it with another logistic regression. The result is then added to the cell state. The result of this is that this operation adds short-time information to what should be "remembered" in the future.

The last one, in red, is the Output Gate. This is the only gate in which the long-term memory impacts the short-term memory. The operation performed here is an elementwise $\tanh(.)$ to the current cell state, which is used to multiply (element-wise) the output of the RNN before yielding the output $y$ at current time.

# Pytorch implementation

The [Pytorch implementation](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) of LSTMs is as easy as:

    lstmlayer = nn.LSTM(input_size, hidden_size)

Check their documentation before using!

When LSTMs are used for sequence summarization, there is some controversy. If we follow the same nomenclature from the RNNs, we should be summarizing sequences using $y_T$, that is, the last short-term output. This somehow carries the information from $c$ because we draw from $c$ using the output gate.

However, the long-term information is, in fact, in $c$. So we could, as well, use $c_T$ at least as part of the summarization vector. This could increase the number of inputs in our classifier, but, if some parameters are irrelevant, then they could simply be ignored.

Also, we note that the LSTM is unable to look into the future. When we read a text, word meanings impact each other in both directions and according to context (for example: the iron sword is made of iron, but the face towel is not made of faces!). For this reason, another technique is to invert the sequence and run an auxiliary LSTM in parallel, potentially generating a Bidirectional LSTM (you may find it referred to as B-LSTM, BDLSTM or BLSTM).

Let's put these ideas to test! Start with the base LSTM, and summarize your sequence using the $y$ output (in Pytorch, this is the `h` output). Then, try using $c$. Then, both of them (you will need to concatenate them). Then, use a bidirectional variant of the LSTM (check the documentation to find out how to do this!). Finally, let's try to do one much simpler thing: get all $h_t$ and $c_t$ from each sequence, and summarize the sequence using the timewise average of $[c_t, h_t]$. Do you see any difference in performance?