# Recurrent Neural Networks

In our [previous MLP experiment](./03-mlp.ipynb), we saw that a neural network could extract more from averaged word embeddings than a simple Logistic Regression. However, averaging throws away crucial information - the order of words in a sentence.

This time, we are going to dive deeper into **recurrent neural networks**. They are specifically designed to process sequences, remembering context and understanding how word order contributes to meaning. Let's see if harnessing this sequential power can push our accuracy even further.

## Data Preparation

In [1]:
from datasets import load_dataset
ds = load_dataset('stanfordnlp/imdb', split='train+test')
train, test = ds.train_test_split(test_size=0.2, seed=0).values()
display(train.to_pandas())

x_train = train['text']
y_train = train['label']
x_test = test['text']
y_test = test['label']

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x10750bf10>>
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/ml/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 
  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,text,label
0,After the SuperFriends and Scooby Doo left the...,1
1,good job.that's how i would describe this anim...,1
2,Michael Cacoyannis has had a relatively long c...,1
3,I've just seen this film in a lovely air-condi...,0
4,My one-line summary hints that this is not a g...,1
...,...,...
39995,***SPOILERS*** ***SPOILERS*** After two so-so ...,1
39996,"Way back in 1967, a certain director had no id...",0
39997,I saw this movie with my dad. I must have been...,1
39998,During my teens or should I say prime time I w...,1


## Data Encoding

In our previous notebook, we simply tokenized text and then immediately squashed it into the averaged word vectors. Our new recurrent pipeline needs to be a bit more complicated.

This time, we would simply **tokenize** our sentences first - without trying to vectorize them. We might limit the vocabulary, making it more robust and accurate representation of frequently occurring words by filtering out rare or noisy terms.

In [2]:
from keras.preprocessing.text import Tokenizer

max_vocab = 45000
tokenizer = Tokenizer(num_words=max_vocab, oov_token='<oov>')

tokenizer.fit_on_texts(x_train)
display(tokenizer.texts_to_sequences(['Hello World']))

ModuleNotFoundError: No module named 'tensorflow'

Next, we need to **pad** our sequences. Most neural networks require input sequences to be of a uniform length. Since our sentences naturally vary, we need to bring them to the same length. 

This involves either truncating longer sequences or adding special tokens (usually zeros) to shorter sequences until they all reach a predetermined maximum length. Let's do a small research to determine how long our sequences usually are.

In [None]:
import matplotlib.pyplot as plt
text_lengths = train.to_pandas()['text'].apply(lambda x: len(x.split()))
display(text_lengths.describe(percentiles=[0.25, 0.5, 0.75, 0.95, 0.99]))

Taking the top quartile length seems to be a reasonable choice here.

In [None]:
from keras.utils import pad_sequences

max_seq = 280
x_train = pad_sequences(tokenizer.texts_to_sequences(x_train), maxlen=max_seq)
x_test = pad_sequences(tokenizer.texts_to_sequences(x_test), maxlen=max_seq)

display(x_train)
display(x_train.shape)

## Embedding Layer

Finally, we need to transform these padded sequences into some meaningful representation that captures their semantic relationships. That's where the **embedding layer** comes in. To build it, we may utilize our existing pre-trained model.

In [None]:
from os import path
from huggingface_hub import snapshot_download
from gensim.models import KeyedVectors

model_path = path.join(snapshot_download('fse/word2vec-google-news-300'), 'word2vec-google-news-300.model')
wv = KeyedVectors.load(model_path)

Think of it as a sophisticated, trainable lookup table - for each token in our sequence, the embedding layer looks up its corresponding dense vector. This helps our neural network to capture not some random indexes, but the *semantic* meaning of words.

In [None]:
import numpy as np
embedding_matrix_shape = (max_vocab, wv.vector_size)
embedding_matrix = np.zeros(shape=embedding_matrix_shape)
for word, index in tokenizer.word_index.items():
    if index < max_vocab:
        if word in wv:
            embedding_matrix[index] = wv.get_vector(word)
        else:
            embedding_matrix[index] = np.zeros(wv.vector_size)

## Label Encoding

In [None]:
from keras.utils import to_categorical
y_train_encoded = to_categorical(y_train, num_classes=len(ds.features['label'].names))
y_test_encoded = to_categorical(y_test, num_classes=len(ds.features['label'].names))

## Building and Training the Model

Here comes the most interesting part. With our text now represented as sequences of dense semantic vectors (thanks to the embedding layer), we can introduce the star of this experiment - the **Long Short-Term Memory (LSTM)** layer.

Unlike a simple dense layer that processes all its inputs at once, an LSTM processes each vector in our sequence one at a time. Internally, each LSTM unit contains a sophisticated set of **gates** – an input gate, a forget gate, and an output gate – along with a memory cell.

These gates learn to control the flow of information, deciding what to remember from previous steps, what to discard, and what new information from the current word's vector is important enough to update its memory, allowing it to capture *context* and *dependencies* across the entire sequence.

We may also make the recurrent layer **bidirectional**, wrapping it in a special helper function - this will allow the network to capture information from both past and future time steps, leading to a more comprehensive understanding of the sequence.

In [None]:
from keras.utils import set_random_seed
from keras import layers, Sequential

num_classes = len(ds.features['label'].names)
set_random_seed(0)

model = Sequential([
    layers.Input(shape=(max_seq,)),
    layers.Embedding(
        weights=[embedding_matrix],
        input_dim=embedding_matrix_shape[0],
        output_dim=embedding_matrix_shape[1],
        trainable=True,
    ),
    layers.SpatialDropout1D(0.35),
    layers.Bidirectional(layers.LSTM(256, dropout=0.3, return_sequences=True)),
    layers.GlobalMaxPooling1D(),
    layers.Dropout(0.6),
    layers.Dense(num_classes, activation='softmax'),
])

display(model.summary())

Before we start the training, we might tweak a few more things. Changing the optimizer learning rate may be a good idea - the default one might be too high when fine-tuning pre-trained embeddings.

In [None]:
from keras.optimizers import AdamW
optimizer = AdamW(learning_rate=0.0001, weight_decay=0.0005)

Also, we might introduce a **learning rate scheduler** called `ReduceLROnPlateau` - it automatically reduces the optimizer's learning rate when a monitored metric (like validation loss) stops improving for a specified number of epochs (patience limit). This helps the model escape local minima and fine-tune its weights more precisely.

In [None]:
from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2)

Now, let's compile and train our final model. This time we might start using the GPU - our model is finally complex enough to leverage the parallelism it offers to speed up the training process.

In [None]:
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train_encoded, epochs=20, batch_size=64, callbacks=[reduce_lr], validation_split=0.2) 

## Result

In [None]:
y_pred_probs = model.predict(x_test, verbose=False)
y_pred_labels = np.argmax(y_pred_probs, axis=1)
y_true_labels = np.argmax(y_test_encoded, axis=1)
print(classification_report(y_true_labels, y_pred_labels, target_names=ds.features['label'].names))

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(4.5, 3))
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(loc='lower right')
plt.show()

## Conclusion

We achieved a final accuracy of **91%**, ultimately outperforming the linear model. It became possible by combining LSTM units with pre-trained word embeddings, enabling the model to leverage sequential information.

This proves the value of sequence-aware models over simple embedding averaging for this type of task, performing competitively with the heavily optimized count vectorizer approach. Further improvements would likely require exploring more advanced architectures like transformers.