### NLP with RNNs

How computerts understand/process human language 

RNN = Recurrent neural network

##### Bag of words 
method for converting text into integers

Only really works for simple tasks

Have a bunch of words, and have a number that represetns each word

Keep track fo which words are present in the text, and frequency, then when we see a word has appeared we will place that words number in a 'bag'

Loses order of words though (hence only useful for simple tasks) therefore can lose context and meaning of words in sentence

##### Word embeddings

Represent words that are similar with numbers that are similar

Get a vector that is n long for each word, each element in the vector tells you info about the word or how similar it is to other words

If you plotted the vector of good the aim is that is will be a similar vector to a word like happy

You can use pretrained word embedding layers in the NN

##### SimpleRNN

It has a loop.  An RNN will process one word at a time whilst keeping internal memory of what it has already seen

At each step it will recieve new info and info from the previous step

As sequence gets longer it becomes harder for it to remember the start so may not work well with longer texts

##### LSTM

Long short-term memory

Similar to RNN but has added feature of being able to access the inputs from any timestamp in the past

It keeps tracks of the internal memory

### Sentiment Analysis

We will use a movie review dataset from tensorflow

The data is from IMDB and has 25k reviews, the data is labelled with positive/negative class

Each review has been encoded by how common a word is in the dataset, i.e. 1 is the msot common word and 88584 is the least common

In [1]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

In [2]:
vocab_size = 88584  #the number of words in the reviwes
maxlen = 250
batch_size = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words= vocab_size)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 0us/step


In [3]:
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

#### Some preprocessing

In NLP each entry will generally be a different length whcih means we don't now how many input neurons to put so it becomes a problem

Keras has a fucntion that will trim or extend data based on the condition you pick i.e. in this example we will make all reviews with words greater than 250 shortened to just 250, and anything less than 250 will be extended with 0's

In [4]:
train_data = sequence.pad_sequences(train_data, maxlen)
test_data = sequence.pad_sequences(test_data, maxlen)

#### Creating the model

Create in a similar way to a standard NN in keras

We first need an embedding layer

Followed by an LSTM layer

Then a dense layer with 1 output since we only have 2 classes (positive or negative)

The 32 is the output dimension of the vector generated by the embedding layer, can be customised to optimise network

In [7]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 32),  #even though we already have preprcoessed data we still need embedding layer so it will find a meaningful pattern in data
    tf.keras.layers.LSTM(32),  #needs 32 since output of embedding is
    tf.keras.layers.Dense(1, activation = 'sigmoid')  # number > 0.5 is positive and < 0.5 is negative
])

#### Training the model

In [10]:
model.compile(loss= 'binary_crossentropy', optimizer= 'rmsprop', metrics=['acc'])

In [11]:
history = model.fit(train_data, train_labels, epochs= 10, validation_split=0.2)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 44ms/step - acc: 0.6875 - loss: 0.5591 - val_acc: 0.7870 - val_loss: 0.4687
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 44ms/step - acc: 0.8949 - loss: 0.2772 - val_acc: 0.8734 - val_loss: 0.3130
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 44ms/step - acc: 0.9146 - loss: 0.2189 - val_acc: 0.8610 - val_loss: 0.3299
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 43ms/step - acc: 0.9336 - loss: 0.1728 - val_acc: 0.8872 - val_loss: 0.2845
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 43ms/step - acc: 0.9466 - loss: 0.1467 - val_acc: 0.8576 - val_loss: 0.4038
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 43ms/step - acc: 0.9527 - loss: 0.1352 - val_acc: 0.8874 - val_loss: 0.3509
Epoch 7/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0

Looks like some overfitting has occurred since the accuracy is higher than validation accuracy by quite a lot

In [12]:
results = model.evaluate(test_data, test_labels)
print(results)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 14ms/step - acc: 0.8683 - loss: 0.4368
[0.4372289478778839, 0.8676400184631348]


We get an accuracy of 86.7% on the test data, which is not great but a good start

#### Making predictions

We will have to encode any new reviews we want to use as predictions

In [14]:
word_index = imdb.get_word_index()

def encode_text(text):
    tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]  #reaplce the word in new review with corresponding number from word index, else put 0
    return sequence.pad_sequences([tokens], maxlen)[0]



#test on a sample review
text = 'I really loved that movie, one of the best I have seen this year'
encoded = encode_text(text)
print(encoded)

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

There are lots of 0's in the encoded text due to padding

#### Predictions

In [21]:
def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1, 250))  #we specified earlier a lenght of 250 for our network
    pred[0] = encoded_text
    result = model.predict(pred)

    if result[0] >= 0.5:
        print('This review is positive')
    else:
        print('This review is negative')


positive_revw = 'That movie was so amazing I will definiely watch it again'

predict(positive_revw)

negative_revw = 'Was not a fan of that, it felt dry'

predict(negative_revw)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
This review is positive
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
This review is negative


It has predicted the classes correctly!