# Recurrent Neural Networks (RNNs) for Natural Language Processing (NLP) with Keras

## Rationale

Why do we need to consider a RNN architecture at all?
1. CNNs processing windows of words are good at using some of the word proximities in sentences but fail to use the relationships that are present in longer sentences.
2. Language uses sentences that create meaning as the sequences of words build meaning by being considered together over time as the sentence evolves. So **word order** matters.
3. So we need to consider how to handle the construction of meaning **over time**.

Consider the following two sentences -



> The stolen car sped into the arena. \
The clown car sped into the arena.



These are two almost identical sentences. However, by the time you the reader get to the end of each sentence, you will have formed a very different sense of what the different sentences mean.

In order to somehow convey the contributon of the adjectives "stolen" or "clown" at the beginning of each sentence to the end of it's sentence, you need to incorporate some notion of **memory** so that their contributions to their sentences persist to construct the meanings at the end of the sentence.

This notion of **memory** is what RNNs introduce.

To extend your practical experience, this code is intended for running on your local machine and not particularly on Colab. While the *free* version of Colab provides a good environment to learn, it can be limited in the degree to which it facilitates experiments without running into resource constraints.

## Import relevant libraries
Import the relevant libraries and set up your **working directory** in the code below

In [None]:
%pip install nltk
%pip install gensim
import glob # string manipulation for constructing directory paths
import nltk # bring in the Natural Language Tool Kit
import os # handle Operating System file tasks
from random import shuffle # facility to generate random selections
from nltk.tokenize import TreebankWordTokenizer # Tokenize the strings
from gensim import models

# Set your working directory in the code here
os.chdir('C:/Users/patrick.denny/OneDrive - University of Limerick/Documents/AdvancedNLP/Module Material/3. and 4. CNNs and RNNs with Sentiment Analysis/Example Code')
print(os.getcwd())

If not already installed, install **Tensorflow** and **Keras** using **pip** or **pip3**

In [None]:
!pip3 install tensorflow
!pip3 install keras
# if you are still having trouble, then they might just need an update and that is straightforward
# pip3 install tensorflow --upgrade
# pip3 install keras --upgrade

## Get word vectors and training data
This is the same data as you used in the previous CNN code. If you have already downloaded these data to your working directory, then you don't need to do so again. The data consist of
- the aclImbd database of reviews and their scored sentiments, for training
- word vectors trained on a Google News corpus, for converting tokens into word vectors

In [None]:
# download the IMDb database
!wget 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
# download the Google News corpus


'wget' is not recognized as an internal or external command,
operable program or batch file.


You will also need to download the rather large Google News corpus to your **working directory** ; it can be downloaded using this Kaggle page - https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300

This allows us to create word vectors that are based on 20000 words from the Google News corpus. We can choose more or fewer words for tokenizing.

In [None]:
w = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', limit = 20000, binary=True)

## Data preprocessing
As ever, there is a little bit of preprocessing to get the data into a shape we can use.

### Load in samples and shuffle them together
The idea here is that we
- Read in the positive and negative sentiments
- Associate their ground truth labels with them, i.e., whether a specific sentiment is actually positive or negative
- Shuffle them to avoid bias

In [None]:
def pre_process_data(filepath):
  """
  Load pos and neg examples from separate dirs then shuffle them together.
  """
  positive_path = os.path.join(filepath, 'pos')
  negative_path = os.path.join(filepath, 'neg')
  pos_label = 1
  neg_label = 0
  dataset = []
  for filename in glob.glob(os.path.join(positive_path, '*.txt')):
    with open(filename, 'r', encoding = "utf-8") as f:
      dataset.append((pos_label, f.read()))
  for filename in glob.glob(os.path.join(negative_path, '*.txt')):
    with open(filename, 'r', encoding = "utf-8") as f:
      dataset.append((neg_label, f.read()))
  shuffle(dataset)
  return dataset

### Data tokenizer + vectorizer
The recurrent neural network needs to operate with vectors of numbers, so we need to operate on each sample of words from a dataset accordingly in essentially two steps
- **Tokenize** the words in the input data set, in this case using the Treebank Word Tokenizer
- Determine the corresponding **word vector** for the token.

These are then collected and returned by the helper routine below.


In [None]:
def tokenize_and_vectorize(dataset):
    tokenizer = TreebankWordTokenizer() # The Treebank Tokenizer from the Natural Language Toolkit
    vectorized_data = []
    for sample in dataset:
      tokens = tokenizer.tokenize(sample[1])
      sample_vecs = []
      for token in tokens:
        try:
          sample_vecs.append(w[token])
        except KeyError:
          pass # this is just if there is no matching token in the vocabulary
      vectorized_data.append(sample_vecs)
    return vectorized_data


### Target unzipper
Peel off the target values (the **ground truth** of **positive** or **negative sentiment** for a given sample) from a dataset.

In [None]:
def collect_expected(dataset):
    """ Peel off the target values from the dataset """
    expected = []
    for sample in dataset:
        expected.append(sample[0])
    return expected

### Load and prepare your data
This is where we split the data into **training** and **testing** components with corresponding labels. We will, iteratively
- **train** the network with training data and its corresponding labels
- **test** the network performance with our test data and its corresponding labels

In [None]:
print(os.getcwd())
dataset = pre_process_data('aclImdb/train')
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)
split_point = int(len(vectorized_data) * .8) # Split the train and tests into an 80/20 unshuffled split
x_train = vectorized_data[:split_point]
y_train = expected[:split_point]
x_test = vectorized_data[split_point:]
y_test = expected[split_point:]

# Clear unneeded data
del(vectorized_data)


C:\Users\patrick.denny\OneDrive - University of Limerick\Documents\AdvancedNLP\Module Material\3. and 4. CNNs and RNNs with Sentiment Analysis\Example Code


### What do the data look like?
The data are essentially a **sentiment of a text** and its **corresponding text** and there are many of them...

In [None]:
print(dataset[10])
print('\n The number of entries in the dataset is ', len(dataset))

(0, '...from this awful movie! There are so many things wrong with this film, acting, writing, direction, editing, etc. that it\'s amazing that something rises to the top and proves itself to be the absolute worst. The music! I noted that the film has two composers listed. This must be the reason why every single frame has music, of the absolute worst "D" movie style drivel. They have never heard of the expression "less is more". It got so painful to listen to, I muted the sound every time there was no dialogue, not that the dialogue was that good. You have to feel sorry for Robert Wagner and Tom Bosley, I\'m sure they didn\'t see roles like this in the twilight of their careers. See it at your own risk.')

 The number of entries in the dataset is  25000


## Building and Training our Recurrent Neural Network
### Initialize the network parameters
Now we set **hyperparameters** for the network prior to training.

In [None]:
maxlen =  400   # The maximum length of an input sequence for training
batch_size = 16 # The number of steps before a backpropagation is performed
embedding_dims = 300 # The number of word vector embedding dimensions used
epochs = 5 # The overall number of training iterations
num_neurons = 50 # Number of neurons in the hidden layer

### Tidying the input data
Next we have to pad or truncate the sequence of tokens in each review so that we have a fixed input size for our RNN training input.

In [None]:
def pad_trunc(data, maximumlen):
    """
    Pad or truncate each review to the size set by the hyperparameter maxlen
    because we need each input to have consistently sized tokens.
    """
    new_data = []

    zero_vector = []
    for _ in range(len(data[0][0])):
        zero_vector.append(0.0)

    for sample in data:
        if len(sample) > maximumlen: # if the input is too large, truncate it
            temp = sample[:maximumlen]
        elif len(sample) < maximumlen: # if the input is too small, pad it
            temp = sample
            # Append the appropriate number zero vectors to the list
            additional_elems = maximumlen - len(sample)
            for _ in range(additional_elems):
                temp.append(zero_vector)
        else:
            temp = sample
        new_data.append(temp)
    return new_data

### Load your test and training data
We go through a similar exercise as before for the CNN, by ensuring consistently sized tokens and by reshaping the training and test data into **NumPy array structures** as these are much easier for the system to manipulate.

In [None]:
import numpy as np

# Pad or truncate the inputs to the length maxlen
x_train = pad_trunc(x_train, maxlen)
x_test = pad_trunc(x_test, maxlen)

x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))
y_train = np.array(y_train)
x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

### Initialize an empty Keras network
We start to build a sequential neural network using Keras. Firstly, we set up a Sequential model.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, SimpleRNN
model = Sequential()

### Add a Recurrent layer
As it is a recurrent neural network, we need to add a simple recurrent neural network layer which will set up the appropriate "plumbing" :-)

In [None]:
model.add(SimpleRNN(
    num_neurons, return_sequences=True,
    input_shape=(maxlen, embedding_dims)
))

### Add a Dropout layer
We want the network to be as small and efficient as possible and not be overtrained, so we add a dropout layer. The dropout layer is a countermeasure against the network taking on too many fine details in the training data that adversely affect its **generalisability**. The dropout layer just creates a layer the same size as the previous layer with **one-to-one** connections (i.e. each neuron in the droput layer is connected to exactly one in the previous layer), but randomly "drops out" 20 \% of the connections.

In [None]:
model.add(Dropout(0.2)) # was 0.2

### Add a Flatten layer
This layer "flattens" the data and assigns a node in the layer to each of the sample data so far.

In [None]:
model.add(Flatten())

### Add a Dense layer
The dense layer serves a very useful purpose. Thus far, the networking in the layer has been getting more and more intricate and quite large, but the dense layer "condenses" the information so far. It does this by giving a weighted combination of the neurons in the previous Flatten layer and apply a sigmoid activiation function to that whole layer. A **sigmoid activation function** compresses a whole range of numbers into a value between 0 and 1 and for this sentiment analysis application, where we represent our ground truth negative and postive values as between 0 and 1, this is ideal. Ultimately, it puts our **known labels** (sentiment **ground truth**) and our **predicted values** (our **predictions** from the network) in the same format.

In [None]:
model.add(Dense(1, activation='sigmoid'))

### Compile your recurrent network
Lets stick the parts of the network together and see how many parameters we have.

In [None]:
model.compile('rmsprop','binary_crossentropy', metrics=['accuracy'])
model.summary()

You will note that the scale of the parameters in this RNN network is **an order of magnitude smaller** than its CNN counterpart.

Now, lets take a look at what the size of each of these layers is.

The **SimpleRNN** *shape* is based on
* weights for each embedding dimension for each token that is used; the embedding dimension of the word vectors is **300**
* a **bias-term** for when providing the output, as the next recurrence applied to the network data will do the normal network calculations that nodes perform, which involves using the **weighted** inputs and a **bias**. So the number of bias terms here is just **1**.
* the weights corresponding to the hidden layer that are passed as **output** from the network at time **t** into the network as an **input** at time **t+1**, in this case **50**

That gives us **300** + **1** + **50** = **351** parameters to consider at each time step.
But we have in turn **50** neurons in the hidden layer, so each of these has to be considered too.

This gives a grand total of **351** x **50** = **17,550** for the "simple" RNN setup.

Taking the subsequent layers in turn :

* The **Dropout layer** requires no parameters as it just creates a layer the same size as the previous layer with **one-to-one** connections, but randomly "drops out" 20 \% of those connections. The **"20 \%"** is not a **parameter** of the network that is updated during processing, but a **hyperparameter** that is used to set-up the network architecture.
* Similarily, the **Flatten layer** has no associated parameters, as it is just spreading out all of the previous neurons (**400** x **50** = **20,000**)
* The **Dense layer** has a weight for each of the neurons in the Flatten layer and a single bias term (**20,000** + **1** = **20,001**) but outputs a single value so the shape of the output is **1**.

That gives a total of **17,550** + **20,001** = **37,551** parameters in the network.

### Train and save your model
I recommend you do this on a PC or workstation or some form of local computer instead of Colab and this will take time. I've tried this on Colab and sometimes it crashes the session because I am using the free version of Colab. Ideally, you will have a GPU on your computer; if not, the code will still run but will take longer. This is not much of an issue if you are running a network for educational purposes, but becomes very noticeable very quickly if you are doing serious experiments with large datasets.

From a module learning perspective, it is more important that you have code that runs so that you can learn it rather than having high precision outcomes that require large amounts of time, data and processing overhead.

If you can build and appreciate the smaller models, then graduating to the larger ones is really a matter of
* exploiting that extra time, data and processing
* re-running and challenging your data
* trying out different selections of hyperparameter values
* rinsing and repeating...

In [None]:
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

model_structure = model.to_json()
with open("simplernn1.json", "w") as json_file:
  json_file.write(model_structure)
model.save_weights("simplernn1.weights.h5")

You can **reload** a **saved** model as follows

In [None]:
# Import the module to read in a JSON format model

from keras.models import model_from_json

# Now instantiate a model

with open("simplernn_model1.json", "r") as json_file:
  json_string = json_file.read()
model2 = model_from_json(json_string)

# Once the model structure exists, set its characteristic weights

model2.load_weights('simplernn1.weights.h5')
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_1 (SimpleRNN)    (None, 100, 50)           17550     
                                                                 
 dropout_1 (Dropout)         (None, 100, 50)           0         
                                                                 
 flatten_1 (Flatten)         (None, 5000)              0         
                                                                 
 dense_1 (Dense)             (None, 1)                 5001      
                                                                 
Total params: 22,551
Trainable params: 22,551
Non-trainable params: 0
_________________________________________________________________


### Run a simple test on your model
Now, the fun bit, run the code and see what it does.
Recall
- the closer the predicted sentiment is to **1**, the more positive the sentiment
- the closer the predicted sentiment is to **0**, the more negative the sentiment

In [None]:
#
# Create a couple of samples for prediction and predict
#
sample_1 ="It is a beautiful day outside, the weather is wonderful and I am so happy to be alive!"
sample_2 ="This is one of the worst movies I have ever seen. It is rubbish."
sample_3 ="I am exhausted and I just want to die."
sample_4 ="This is terrible.Terrible, terrible, terrible, terrible, terrible, terrible. How can anything be as bad as this?"
sample_5 ="I have great class and this makes me very happy. They are learning very interesting technology."
sample_6 = "Wow!"

samples = [sample_1, sample_2, sample_3, sample_4, sample_5, sample_6]

for sample in samples :
  vec_list = tokenize_and_vectorize([(1, sample)])
  test_vec_list = pad_trunc(vec_list, maxlen)
  test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))
  print(sample, model.predict(test_vec))


# vec_list = tokenize_and_vectorize([(1, sample_4)])
# test_vec_list = pad_trunc(vec_list, maxlen)
# test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))
# model.predict(test_vec)



It is a beautiful day outside, the weather is wonderful and I am so happy to be alive! [[0.93817055]]
This is one of the worst movies I have ever seen. It is rubbish. [[0.3374831]]
I am exhausted and I just want to die. [[0.37458467]]
This is terrible.Terrible, terrible, terrible, terrible, terrible, terrible. How can anything be as bad as this? [[0.21551856]]
I have great class and this makes me very happy. They are learning very interesting technology. [[0.8899865]]
Wow! [[0.75701046]]


## Build a larger network

The performance of network depends on a combination of choices of **architecture**, **training data** and **hyperparameters**. So, again, lets experiment with the number of neurons by setting **num_neurons = 100** and see what happens to the network performance.

In [None]:
num_neurons = 100
model_bigger = Sequential()
model_bigger.add(SimpleRNN(
    num_neurons, return_sequences=True, input_shape=(maxlen, embedding_dims)
))
model_bigger.add(Dropout(.2))
model_bigger.add(Flatten())
model_bigger.add(Dense(1, activation='sigmoid'))
model_bigger.compile('rmsprop', 'binary_crossentropy', metrics=['accuracy'])
model_bigger.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_2 (SimpleRNN)    (None, 400, 100)          40100     
                                                                 
 dropout_2 (Dropout)         (None, 400, 100)          0         
                                                                 
 flatten_2 (Flatten)         (None, 40000)             0         
                                                                 
 dense_2 (Dense)             (None, 1)                 40001     
                                                                 
Total params: 80,101
Trainable params: 80,101
Non-trainable params: 0
_________________________________________________________________


### Train the larger network

In [None]:
model_bigger.fit(x_train, y_train,
          batch_size+batch_size,
          epochs=epochs,
          validation_data=(x_test,y_test))

# Save the trained network
model_structure = model_bigger.to_json()
with open("simplernn_model_bigger.json", "w") as json_file_bigger:
  json_file_bigger.write(model_structure)
model_bigger.save_weights("simplernn_bigger.weights.h5")

## Predicting sentiments
Lets make sentiment predictions. Note that **the network has never seen these sentences before**.

In [None]:
#
# Create a couple of samples for prediction and predict
#
sample_1 ="It is a beautiful day outside, the weather is wonderful and I am so happy to be alive!"
sample_2 ="This is one of the worst movies I have ever seen. It is rubbish."
sample_3 ="I am exhausted and I just want to die."
sample_4 ="This is terrible.Terrible, terrible, terrible, terrible, terrible, terrible. How can anything be as bad as this?"
sample_5 ="I have a great class and this makes me very happy. They are learning very interesting technology."
sample_6 = "Wow!"

samples = [sample_1, sample_2, sample_3, sample_4, sample_5, sample_6]

# If you HAVE created the model already, then read it back in to use it

from keras.models import model_from_json
with open("simplernn_model_bigger.json", "r") as json_file_bigger:
  json_string = json_file_bigger.read()
model_bigger = model_from_json(json_string)
model_bigger.load_weights('simplernn_bigger.weights.h5')

# Here you are passing a dummy value into the first elemnt of the tuple because
# your helper function expects it from the way it processed the initial data.
# That value won't ever see the network, so it doesn't matter what it is
# Could this and should this be syntactically better for production code? Yes.

for sample in samples :
  vec_list = tokenize_and_vectorize([(1, sample)])
  test_vec_list = pad_trunc(vec_list, maxlen)
  test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))
  print(sample, model_bigger.predict(test_vec))

It is a beautiful day outside, the weather is wonderful and I am so happy to be alive! [[0.7948919]]
This is one of the worst movies I have ever seen. It is rubbish. [[0.4005727]]
I am exhausted and I just want to die. [[0.2393439]]
This is terrible.Terrible, terrible, terrible, terrible, terrible, terrible. How can anything be as bad as this? [[0.06073603]]
I have a great class and this makes me very happy. They are learning very interesting technology. [[0.8801557]]
Wow! [[0.43194225]]


What do you think of the scores? Why might they be what they are?

Also, for experimentation
- Consider different training corpora - https://www.kaggle.com/datasets?search=corpus
- Compare the sizes of the networks, the runtimes of the predictions of the networks and the accuracy of the network

## Two-way street - bidirectional RNNs for NLP
Sometimes we get information about a key word in a sentence later in the sentence and as humans we can make sense of it. For example, consider the sentence

> He taught the class of M.Sc. students

As you read this sentence, you will see that someone taught a class and then you will see that the class was of M.Sc. students; you could associate **teaching** with **a class** and then discover that class was **M.Sc. students**. In principle you could read the words backwards and see that there **students**, that the students where doing an **M.Sc.** and that the M.Sc. students were in a **class** and that that class was being **taught**. So, you can make richer sequences of inferences and relatioships by going backwards through the sentence. It is possible to do this with RNNs too.

### Build a **Bidirectional** RNN
This is just some sample code for building a bidirectional RNN using keras

In [None]:
"""
from keras.models import Sequential
from keras.layers import SimpleRNN
from keras.layers.wrappers import Bidirectional

num_neurons = 10
maxlen = 100
embedding_dims = 300

model = Sequential()
model.add(Bidirectional(
    SimpleRNN(num_neurons, return_sequences+True), input_shape=(maxlen, embedding_dims))
)
"""

'\nfrom keras.models import Sequential\nfrom keras.layers import SimpleRNN\nfrom keras.layers.wrappers import Bidirectional\n\nnum_neurons = 10\nmaxlen = 100\nembedding_dims = 300\n\nmodel = Sequential()\nmodel.add(Bidirectional(\n    SimpleRNN(num_neurons, return_sequences+True), input_shape=(maxlen, embedding_dims))\n)\n'

# Summary
Just after the **Dense** layer, a vector of shape **number of neurons x 1** comes out of the last step of the **Recurrent** layer and it is a sort of encoding of the sequence of input tokens. It is like the notion of **thought vector** that we discussed for CNNs and it is a powerful notion that we take into the last section of this module so that we can amplify its power.

So, what have we found in summary?
- In NLP sequences, and in sentences in general, the meaning of words is affected by previous words
- Splitting a natural language statement into a time sequence of tokens can help get a deeper meaning from the sentences
- You can backpropagate learning errors "in time" as well as in the normal way that we did with CNNs
- RNNs are particularly deep, so they have gradients that can disappear or explode and this needs to be considered
- Efficient modelling of natural language character sequences was impossible until RNNs were invented
- Weights in an RNN are adjusted across time for a given sample, thereby capturing the meaning that is hidden in the sequencing of the words
- There are different metrics, such as accuracy, that can be used to examine the output of RNNs
- You can exploit the sequencing of tokens in an RNN in both directxions with bidirectional RNNs
- The notion of a **thought vector** that somehow captures an underlying **meaning** of an NLP sequence appears again and is worth considering in the next section