# Natural Language Processing with Recurrent Neural Networks

## The structure of the lecture:
- Why RNNs & NLP matter
- RNNs: An introduction
- RNNs Under the Hood (and architectural variations)
- NLP: Why we use RNNs
- Classifying sentiment: A coded example
- Further Reading

## 1️⃣ Why RNNs & NLP matter

📈 Recurrent Neural Networks are Neural Networks specifically designed to deal with **sequences** as input data, i.e. **observations repeated throughout time**.

### Example 1: Prediction of future stock market values and trends

![ex_1](pics/example_1.png)

### Example 2: Video prediction

🎥 Videos = sequences of images/frames

👉 Why not predicting the next image(s)?

![ex_2](pics/example_2.png)

### Example 3: Predicting the next word - Natural Language Processing

Recurrent Networks are massively used for text!

![ex_3](pics/example_3.png)


### NLP: Text classification, such as sentiment analysis
Classification depending on a word, a sentence, a paragraph, ...

![ex_4](pics/example_4.png)

The typical setting is sentiment analysis: Classify positive or negative sentences (but also happiness, sadness, joy, anger, ...).

### Sequence to Sequence Models

Given an input sequence, produce a corresponding output sequence. Typical application is language translation.

![ex_5](pics/example_5.png)

## 2️⃣ RNNs: An introduction

Traditional (Statistical) Time Series forecasting can be represented in the following way:

![ts_1](pics/ts_1.png)

The algorithm would learn from the past values to predict the future values using temporal features such as the trend, the seasonality...

👉 Example: [**ARIMA**](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) models **recursively** predict the next data points one after the other

What follows is a slighly different approach...

![dl_framework](pics/dl_frame.png)

![rnn_framework](pics/rnn_frame.png)

<img src="pics/pred_1.png" width="750"/>

<img src="pics/pred_2.png" width="820"/>

<img src="pics/pred_3.png" width="850"/>

<img src="pics/pred_4.png" width="910"/>

<img src="pics/pred_5.png" width="930"/>

<img src="pics/pred_6.png" width="990"/>

<img src="pics/pred_7.png" width="960"/>

<img src="pics/pred_8.png" width="920"/>


<img src="pics/pred_recap.png" width="850"/>

<img src="pics/pred_recap_2.png" width="910"/>


## 3️⃣ RNNs Under the Hood

<img src="pics/rnn_intro.png" width="580"/>

![rnn_diagramm.png](pics/rnn_diagram.png)

**Let's zoom inside this function $f_W$ for one time step**. 

![rnn_diagramm_2.png](pics/rnn_diagram_2.png)

![rnn_intro_2.png](pics/rnn_intro_2.png)

![rnn_weight.png](pics/rnn_weight.png)

![rnn_note.png](pics/rnn_note.png)

![rnn_number_units.png](pics/rnn_number_units.png)

![rnn_layer.png](pics/rnn_layer.png)

![rnn_stack.png](pics/rnn_stack.png)

![rnn_diagramm.png](pics/rnn_diagram.png)

<img src="pics/rnn_diagram_stack.png" width="1050"/>

![rnn_type.png](pics/rnn_type.png)

<img src="pics/rnn_lstm_gru.png" width="1050"/>

## 4️⃣ NLP: Why we need RNNs

### Because order matters!

An obvious example:

- "The teacher inspired the students with a passionate speech about deep learning"
- "The students inspired the teacher with a passionate speech about deep learning"

A more subtle example:

- "The company hired talented employees and achieved great success."
- "The company achieved great success and hired talented employees."

Let's give ourselves a task to solve by the end of the lecture!

Our `X` will be sentences and our `y` will be whether or not those sentences are positive or negative!

How do we go about this?

"This movie is the best thing I've seen in my life" -> ✅ "I had to leave this film because I was so bored" -> ❌ "Anything with Nicolas Cage in it is a masterpiece: 5/5" -> ✅

### How to feed Recurrent Neural Networks with words?

![rnn_example.png](pics/rnn_example.png)

### We have to convert words into numbers somehow!

**Text** is also a form of **recurrent data**, where a sentence is a sequence of words, each being the observation. For now, let's imagine we can express each **word** as four **numbers** (or a point/ vector in 4-D space). 

![rnn_example_2.png](pics/rnn_example_2.png)

### What will our `X` look like?

![rnn_example_3.png](pics/rnn_example_3.png)

`X` is just a sequence of observations! Univariate, multivariate, an image, ... The point is that the observation is repeated through time!

`X.shape = (N_SEQUENCES, N_OBSERVATIONS, N_FEATURES)`

**Warning**: The number of observations can vary from one sequence to another! (This is when you `pad` - we'll get back to that)


### First, we'll need tokens

Neural networks' can't work directly on words, so you will still need to provide it with tokens.

![rnn_example_4.png](pics/rnn_example_4.png)



In [1]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

### Let's create some mock data
def get_mock_up_data():
    sentence_1 = 'This movie was awful'
    sentence_2 = 'I loved every moment of this movie!'
    sentence_3 = 'I want a refund; I was so bored!'

    X = [sentence_1, sentence_2, sentence_3]
    y = np.array([0., 1., 0.])

    ### Let's tokenize the vocabulary
    tk = Tokenizer()
    tk.fit_on_texts(X)
    vocab_size = len(tk.word_index)
    print(f'There are {vocab_size} different words in your corpus')
    X_token = tk.texts_to_sequences(X)

    ### Pad the inputs
    X_pad = pad_sequences(X_token, dtype='float32', padding='pre')

    return X_pad, y, vocab_size

X_pad, y, vocab_size = get_mock_up_data()
print("X_pad.shape", X_pad.shape)
X_pad

2025-03-17 12:54:57.710853: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


There are 14 different words in your corpus
X_pad.shape (3, 8)


array([[ 0.,  0.,  0.,  0.,  2.,  3.,  4.,  5.],
       [ 0.,  1.,  6.,  7.,  8.,  9.,  2.,  3.],
       [ 1., 10., 11., 12.,  1.,  4., 13., 14.]], dtype=float32)

![rnn_padding.png](pics/rnn_padding.png)

🤔 To"post" or to "pre"? The eternal question: there's no definitive answer and it varies by architecture, but results seem to lean towards using ["pre"](https://pmc.ncbi.nlm.nih.gov/articles/PMC7471694/) (Another resource: [Medium](https://saadsohail5104.medium.com/understanding-padding-in-nlp-types-and-when-to-use-them-bacae6cae401#:~:text=Pre%2DPadding%20(Default)%3A,the%20end%20of%20a%20sequence.))



### Now we need to make the leap from tokens to vectors

We want each word to be represented by a vector `↗` of chosen length. How?

### Consider a 2D embedding

To further specify the meaning of each word, we add a dimension: we rank them according to their relative abstraction. 

<img src="pics/rnn_2d_embed.png" width="850"/>

We can't differentiate much between "horse" and "killer whale" :)

### Adding a 3rd dimension

<img src="pics/rnn_3d_embed.png" width="850"/>

### Arithmetic on words?

If we embed correctly, we could dream of performing mathematical operations on the embeddings, which would have meaning in terms of natural language.

E.g. we could express the sentence "Queen is to king as man is to..." as $V(Queen) - V (King) + V(Man) = ?$

<img src="pics/word2vec.png" width="300"/>

Here, as the red vectors are similar, we could expect $V(Queen) - V (King) = V(Woman) - V(Man)$

So the answer we're looking for might be $Woman$.


### What makes a good embedding?

Semantically close words are mathematically close in this space

❗️ Usually, word embedding spaces are from 30 up to 300 dimensions ❗️

🤔 But wait ... how do you choose these numbers? Surely we're not scoring all of our words on n difference axes?

<img src="pics/embeddings_cloud.png" width="500"/>

### Two options: learn with `layers.Embedding` or with `Word2Vec`


<img src="pics/embedding_vs_word2vec_2.png" width="800"/>


👈 The left option allows you to have a representation that is perfectly suited to your task! However, it increases the number of parameters to learn, and thus:

- the time of each epoch (more parameters to optimize during back-propagation)
- the time to converge (because more parameters to find overall)

👉 On the other hand, word2vec is an unsupervised learning method not specifically designed for your task (may be sub-optimal) but training it is very fast! You will also be able to optimize your RNN faster as you'll have less parameters.

- ❗ Prefer Word2Vec on small corpus (esp. with transfer learning)

<img src="pics/first_option.png" width="700"/>

<img src="pics/second_option.png" width="1000"/>

💡 We use unsupervised learning to look at the words around each chosen word 

and assume that these will be relevant to the word we're interested in!

### Instead of training it on your training set (especially if it is very small), you can directly load a pretrained embedding

In [2]:
import gensim.downloader

print(list(gensim.downloader.info()['models'].keys()))

model_wiki = gensim.downloader.load('glove-wiki-gigaword-50')

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


We can even try out our aforementioned example with code!

In [3]:
# N.B. Words not in glove-wiki-gigaword-50 will not have vectors computed
example_1 = model_wiki["queen"] - model_wiki["king"] + model_wiki["man"]

In [4]:
model_wiki.most_similar(example_1)

[('woman', 0.8903914093971252),
 ('girl', 0.8453726768493652),
 ('man', 0.8301756381988525),
 ('her', 0.7845831513404846),
 ('boy', 0.7763065695762634),
 ('she', 0.7619765400886536),
 ('herself', 0.7597628235816956),
 ('blind', 0.7296755313873291),
 ('mother', 0.7230339646339417),
 ('blonde', 0.713614284992218)]

What about "good is to evil as cold is to..."?

In [5]:
example_2 = model_wiki["good"] - model_wiki["evil"] + model_wiki["cold"]

In [6]:
model_wiki.most_similar(example_2)

[('warm', 0.7870427966117859),
 ('dry', 0.7643216848373413),
 ('hot', 0.750431478023529),
 ('cool', 0.7491166591644287),
 ('weather', 0.7476345896720886),
 ('getting', 0.7447267770767212),
 ('good', 0.7442173361778259),
 ('cold', 0.7425146102905273),
 ('low', 0.7231549620628357),
 ('little', 0.7223491668701172)]

## 5️⃣ Coded example: Sentiment analysis

First let's load up our data

In [7]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Set parameters
max_features = 10000  # Maximum number of words to get out of our imdb data
max_len = 500  # Maximum sequence length
embedding_dim = 50  # Dimensionality of word embeddings

# Load the IMDb dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 0us/step


Next we need to pad 🗒️

In [8]:
# Pad sequences to a fixed length
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

Now we'll construct and compile our model 🔨

In [9]:
# Build the model
model = Sequential()
model.add(Embedding(max_features, embedding_dim, input_length=max_len))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])



Finally we fit and evaluate 💯

In [11]:
# Train the model
model.fit(x_train, y_train, batch_size=128, epochs=1, validation_data=(x_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test loss: {loss:.4f}')
print(f'Test accuracy: {accuracy:.4f}')

[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m180s[0m 920ms/step - accuracy: 0.8664 - loss: 0.3285 - val_accuracy: 0.8756 - val_loss: 0.2979
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m180s[0m 230ms/step - accuracy: 0.8742 - loss: 0.3009
Test loss: 0.2979
Test accuracy: 0.8756


Done! We can predict sentiment with a very high degree of accuracy and minimal pre-processing!

## 6️⃣ Further reading
- 📚 [Stanford Cheat Sheet](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)
- 📚 [Medium - Illustrated Guide to RNN](https://medium.com/data-science/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9)
- 📚 [Medium - Illustrated Guide to LSTM and GRUs](https://medium.com/data-science/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)
- 📺 [RNN explained with 3*3 matrices - 21min](https://www.youtube.com/watch?v=UNmqTiOnRfg)
- 📺 [RNN & LSTM explained without math - 24min](https://www.youtube.com/watch?v=WCUNPb-5EYI)

### Deep Learning courses - videos
- [CS224N](https://www.youtube.com/playlist?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z): Natural Language Processing with Deep Learning by Standford University (recommended)
- [92 step-by-step videos](https://www.youtube.com/watch?v=SGZ6BttHMPw&list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH) by Hugo Larochelle.
- [Deep Learning course](https://www.coursera.org/specializations/deep-learning), by Andrew Ng.
- [MIT Introduction to Deep Learning](https://www.youtube.com/playlist?list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI), from MIT.

### Deep Learning courses - books
- [Deep Learning book](https://www.deeplearningbook.org/), by Ian Goodfellow, Yoshua Bengio, Aaron Courville
- Deep Learning with Python, by François Chollet231 Lecture on RNN - 1h20 (https://www.youtube.com/watch?v=6niqTuYFZLQ)