# Recurrent neural networks

In the previous module, we have been using rich semantic representations of text, and a simple linear classifier on top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it does not take into account the **order** of words, because aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.

To capture the meaning of text sequence, we need to use another neural network architecture, which is called a **recurrent neural network**, or RNN. In RNN, we pass our sentence through the network one token at a time, and the network produces some **state**, which we then pass to the network again with the next token.

![Image showing an example recurrent neural network generation.](./images/sample-rnn-model-generation.png)

Given the input sequence of tokens $X_0,\dots,X_n$, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using back propagation. Each network block takes a pair $(X_i,S_i)$ as an input, and produces $S_{i+1}$ as a result. Final state $S_n$ or output $X_n$ goes into a linear classifier to produce the result. All network blocks share the same weights, and are trained end-to-end using one backpropagation pass.

Because state vectors $S_0,\dots,S_n$ are passed through the network, it is able to learn the sequential dependencies between words. For example, when the word *not* appears somewhere in the sequence, it can learn to negate certain elements within the state vector, resulting in negation.  

Let's see how recurrent neural networks can help us classify our news dataset.

In [1]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

ds_train, ds_test = tfds.load('ag_news_subset').values()

## Simple RNN classifier

In case of simple RNN, each recurrent unit is a simple linear network, which takes concatenated input vector and state vector, and produces a new state vector. In Keras, this can be represented by `SimpleRNN` layer.

While we can pass one-hot encoded tokens to the RNN layer directly, this is not a good idea because of high dimensionality. Therefore, we will use an embedding layer to lower the dimensionality of word vectors, and then have RNN layer on top of it, followed by `Dense` classifier.

> **Note**: In some cases, for example, when using character-level tokenization, it might make sense to pass one-hot encoded tokens directly into RNN cell. 

In [2]:
vocab_size = 20000

model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,)),
    keras.layers.Embedding(vocab_size, 128),
    keras.layers.SimpleRNN(64),
    keras.layers.Dense(4,activation='softmax')
])
vectorizer = model.layers[0]

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 128)         2560000   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 64)                12352     
_________________________________________________________________
dense (Dense)                (None, 4)                 260       
Total params: 2,572,612
Trainable params: 2,572,612
Non-trainable params: 0
_________________________________________________________________


> **Note:** We use untrained embedding layer here for simplicity, but for even better results we can use pre-trained embedding layer with Word2Vec or GloVe embeddings, as described in the previous unit. For better understanding, you might want to adapt this code to work with pre-trained embeddings.

Now let's train our RNN. RNNs are in general quite difficult to train, because once the RNN cells are unrolled along the sequence length, the resulting number of layers involved in back propagation is quite large. Thus we need to select smaller learning rate, and train the network on larger dataset to produce good results. It can take quite a long time, so using GPU is preferred.

To speed things up, we will only train RNN model on news titles, omiting the description. You can try training with description and see if you can make the model to train.

In [3]:
def extract_title(x):
    return x['title']

def tupelize_title(x):
    return (extract_title(x),x['label'])

print('Training vectorizer')
model.layers[0].adapt(ds_train.map(extract_title))

Training vectorizer


In [4]:
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize_title).batch(128),validation_data=ds_test.map(tupelize_title).batch(128),epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fc0bc49c6a0>

> **Task**: Does this training show overfitting? If so, try to do something to minimize overfitting.

## Revisiting Variable Sequences 

Remember that `TextVectorization` layer will automatically pad sequences of variable length in a minibatch with pad tokens. During RNN training, those tokens also take part in training, and they can complicate convergence of the model.

To minimize the amount of padding, there are several approaches. One of them is to re-order the dataset by sequence length, or group all sequences by the size. This can be done using `tf.data.experimental.bucket_by_sequence_length` function (see [documentation](https://www.tensorflow.org/api_docs/python/tf/data/experimental/bucket_by_sequence_length)). 

Another approach is to use **masking**. In Keras, some layers support additional input that shows which tokens should be taken into account when training. To incorporate masking into our model, we can either include separate `Masking` layer ([docs](https://keras.io/api/layers/core_layers/masking/)), or specify `mask_zero=True` parameter to our `Embedding` layer:

In [5]:
vocab_size = 20000

def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,)),
    keras.layers.Embedding(vocab_size, 128,mask_zero=True),
    keras.layers.SimpleRNN(64),
    keras.layers.Dense(4,activation='softmax')
])

print('Training vectorizer')
model.layers[0].adapt(ds_train.map(extract_text))

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128))

Training vectorizer


<tensorflow.python.keras.callbacks.History at 0x7fc0bc3c68b0>

Using masking, we can now train the model on whole dataset of titles and descriptions - something that would not work unless you apply some of the techniques to minimize padding.

## LSTM: Long Short-Term Memory

One of the main problems of classical RNNs is so-called **vanishing gradients** problem. Because RNNs are trained end-to-end in one back-propagation pass, it is having hard times propagating error to the first layers of the network, and thus the network cannot learn relationships between distant tokens. One of the ways to avoid this problem is to introduce **explicit state management** by using so called **gates**. There are two most known architectures of this kind: **Long Short Term Memory** (LSTM) and **Gated Relay Unit** (GRU).

![Image showing an example long short term memory cell](./images/long-short-term-memory-cell.svg)

LSTM Network is organized in a manner similar to RNN, but there are two states that are being passed from layer to layer: actual state $c$, and hidden vector $h$. At each unit, hidden vector $h_i$ is concatenated with input $x_i$, and they control what happens to the state $c$ via **gates**. Each gate is a neural network with sigmoid activation (output in the range $[0,1]$), which can be thought of as bitwise mask when multiplied by the state vector. There are the following gates (from left to right on the picture above):
* **forget gate** takes hidden vector and determines, which components of the vector $c$ we need to forget, and which to pass through. 
* **input gate** takes some information from the input and hidden vector, and inserts it into state.
* **output gate** transforms state via some linear layer with $\tanh$ activation, then selects some of its components using hidden vector $h_i$ to produce new state $c_{i+1}$.

Components of the state $c$ can be thought of as some flags that can be switched on and off. For example, when we encounter a name *Alice* in the sequence, we may want to assume that it refers to female character, and raise the flag in the state that we have female noun in the sentence. When we further encounter phrases *and Tom*, we will raise the flag that we have plural noun. Thus by manipulating state we can supposedly keep track of grammatical properties of sentence parts.

> **Note**: A great resource for understanding internals of LSTM is this great article [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.

While internal structure of LSTM cell may look complex, Keras hides this implementation inside `LSTM` layer, so the only thing we need to do in the example above is to replace the recurrent layer:

In [6]:
vocab_size = 20000

model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,)),
    keras.layers.Embedding(vocab_size, 128, mask_zero=True),
    keras.layers.LSTM(32),
    keras.layers.Dense(4,activation='softmax')
])

print('Training vectorizer')
model.layers[0].adapt(ds_train.map(extract_text))

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(64),validation_data=ds_test.map(tupelize).batch(64))

Training vectorizer


<tensorflow.python.keras.callbacks.History at 0x7fbfdb60a0a0>

> **Note** that training LSTM is also quite slow, and you may not seem much raise in accuracy in the beginning of training. You may need to continue training for some time to achieve good accuracy

## Bidirectional and multilayer RNNs

In our examples, all recurrent networks operated in one direction, from beginning of a sequence to the end. It looks natural, because it resembles the way we read and listen to speech. However, since in many practical cases we have random access to the input sequence, it might make sense to run recurrent computation in both directions. Such networks are call **bidirectional** RNNs, and they can be created by wrapping recurrent layer with special `Bidirectonal` layer.

> **Note**: What `Bidirectional` layer does is make two copies of the layer inside it, and set `go_backwards` property of one of those copies to `True`, making it go in the opposite direction along the sequence axis.

Recurrent network, one-directional or bidirectional, captures certain patterns within a sequence, and can store them into state vector or pass into output. As with convolutional networks, we can build another recurrent layer on top of the first one to capture higher level patterns, built from low-level patterns extracted by the first layer. This leads us to the notion of **multi-layer RNN**, which consists of two or more recurrent networks, where output of the previous layer is passed to the next layer as input.

![Image showing a Multilayer long-short-term-memory- RNN](./images/multi-layer-lstm.jpg)

*Picture from [this wonderful post](https://towardsdatascience.com/from-a-lstm-cell-to-a-multilayer-lstm-network-with-pytorch-2899eb5696f3) by Fernando López*

Keras makes constructing such networks an easy task, because you just need to add more recurrent layers on top of each other. For all layers except the last one, we need to specify `return_sequences=True` parameter, because we need the layer to return all intermediate states, and not just the final state of recurrent computation.

Let's build two-layer bidirectional LSTM for our classification problem:

In [7]:
vocab_size = 50000

model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,)),
    keras.layers.Embedding(vocab_size, 128, mask_zero=True),
    keras.layers.Bidirectional(keras.layers.LSTM(64,return_sequences=True)),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),    
    keras.layers.Dense(4,activation='softmax')
])

print('Training vectorizer')
model.layers[0].adapt(ds_train.map(extract_text))

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(64),validation_data=ds_test.map(tupelize).batch(64))

Training vectorizer


<tensorflow.python.keras.callbacks.History at 0x7fbff8094a60>

## RNNs for other tasks

In this unit, we have seen that RNNs can be used for sequence classification, but in fact, they can handle many more tasks, such as text generation, machine translation, and more. We will consider those tasks in the next unit.