## References : 
1. https://keras.io/examples/nlp/pretrained_word_embeddings/
2. https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

## Setup

In [20]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

import math
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt

import pickle

## Introduction

In this project, we show how to train a text classification model that uses pre-trained
word embeddings.

We'll work with the AclImdb dataset, a set of total 25,000 with positive 12,500 positive and 12,500 negative movie reviews.

For the pre-trained word embeddings, we'll use
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).

In [21]:
# Uncomment below 2 lines if you are using google drive. 
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Download the AclImdb data

### Download data from here: 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz' and place in your root of project folder

### uncomment and run script in below cell only one time to extract this in the root of your project folder. Comment after executed. Update the paths according to your project path

In [22]:
#!tar -xzvf '/content/drive/MyDrive/DS/nlp_movie_ratings/aclImdb_v1.tar.gz' -C '/content/drive/MyDrive/DS/nlp_movie_ratings/root'     #[run this cell to extract tar.gz files]

### Set the path to the folder where data is extracted in your project folder in this main_path variable.

In [23]:
#main_path = "/content/drive/MyDrive/DS/nlp_movie_ratings/root/aclImdb/" # wenoff
main_path = "/content/drive/MyDrive/DS/nlp_movie_ratings/root/aclImdb/" # wazu
#main_path = "/content/drive/MyDrive/DSs/nlp_movie_ratings/root/aclImdb/" # dowa and jale
#main_path = "/content/drive/MyDrive/datasets/nlp_movie_ratings/root/aclImdb/" # wado

In [54]:
#path_to_glove_file = "/content/drive/MyDrive/DS/21a/glove6B/glove.6B.100d.txt" # wadon
path_to_glove_file = "/content/drive/MyDrive/DS/21a/glove6B/glove.6B.100d.txt" # wazul

## Let's take a look at the data

In [55]:
import os
import pathlib

data_dir = pathlib.Path(main_path).parent / "aclImdb/train"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", sorted(dirnames))

Number of directories: 8
Directory names: ['labeledBow.feat', 'neg', 'pos', 'unsup', 'unsupBow.feat', 'urls_neg.txt', 'urls_pos.txt', 'urls_unsup.txt']


Here's a example of what one file contains:

In [56]:
print(open(data_dir / "pos" / "0_9.txt").read())

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:

Setting total data size from total 25000

In [57]:
#no_of_rows = 2500 # for both class - total 5000

In [58]:
samples = []
labels = []
class_names = dirnames = ['pos', 'neg']
class_index = 0
for dirname in dirnames:
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath) # [:no_of_rows]
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing pos, 12500 files found
Processing neg, 12500 files found
Classes: ['pos', 'neg']
Number of samples: 25000


## Shuffle and split the data into training & validation sets

In [59]:
# Shuffle the data
seed = 9
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.05
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

## Create a vocabulary index by using imdb vocabulary given with the dataset

Let's use the `TextVectorization` to index the vocabulary found in the dataset.
Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 5,000 words, and will truncate or pad sequences to
be actually 200 tokens long.

In [60]:
from tensorflow.keras.layers import TextVectorization

vectorizer = TextVectorization(max_tokens=5000, output_sequence_length=200, vocabulary = main_path + "imdb.vocab")
#text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
#vectorizer.adapt(text_ds)

You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
print the top 5 words:

In [61]:
vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'and', 'a']

Let's vectorize a test sentence:

In [62]:
#output = vectorizer([["the cat sat on the mat"]])
#output.numpy()[0, :6]

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
reserved for "out of vocabulary" tokens.

Here's a dict mapping words to their indices:

In [63]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

As you can see, we obtain the same encoding as above for our test sentence:

In [64]:
word_index["word"]

685

In [65]:
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

[2, 1080, 1777, 20, 2, 12332]

In [66]:
# Saving vectorizer for deployment

# Reference https://stackoverflow.com/questions/65103526/how-to-save-textvectorization-to-disk-in-tensorflow

# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
             'weights': vectorizer.get_weights()}
            , open(main_path + "vectorizer.pkl", "wb"))

# Later you can unpickle and use 
# `config` to create object and 
# `weights` to load the trained weights. 

# from_disk = pickle.load(open(main_path + "vectorizer.pkl", "rb"))
# vectorizer = TextVectorization.from_config(from_disk['config'])
# vectorizer.set_weights(from_disk['weights'])

## Load pre-trained word embeddings

Let's download pre-trained GloVe embeddings (a 822M zip file).

You'll need to run the following commands:

```
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
```

In [67]:
# one time only
#!unzip "/content/drive/MyDrive/DSs/21a/glove.6B.zip" -d "/content/drive/MyDrive/DSs/21a/glove6B/"

The archive contains text-encoded vectors of various sizes: 50-dimensional,
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.

Let's make a dict mapping words (strings) to their NumPy vector representation:

In [68]:
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


Now, let's prepare a corresponding embedding matrix that we can use in a Keras
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
vector for the word of index `i` in our `vectorizer`'s vocabulary.

In [69]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 62596 words (26933 misses)


Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.

Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
update them during training).

In [70]:
num_tokens, embedding_dim

(89531, 100)

In [71]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

## Build the model

A simple 1D convnet with global max pooling and a classifier at the end.

In [72]:
def create_rnn(cell_units = 100, dropout = 0.2):
    # Initialising the RNN
    regressor = Sequential()

    regressor.add(embedding_layer)

    # Adding a 1st LSTM layer and some Dropout regularisation
    regressor.add(LSTM(units = cell_units, return_sequences = True, dropout=dropout, recurrent_dropout=dropout))

    # Adding a 2nd LSTM layer and some Dropout regularisation
    regressor.add(LSTM(units = cell_units, return_sequences = True, dropout=dropout, recurrent_dropout=dropout))

    # Adding 3rd LSTM layer and some Dropout regularisation
    regressor.add(LSTM(units = cell_units, dropout=dropout, recurrent_dropout=dropout))

    # Adding the output layer
    regressor.add(Dense(units = 1, activation='sigmoid'))

    return regressor

## Train the model

First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
are right-padded.

In [73]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

In [74]:
x_train.shape, y_train.shape, x_val.shape, y_val.shape

((23750, 200), (23750,), (1250, 200), (1250,))

In [75]:
# fix random seed for reproducibility
np.random.seed(9)

units = 100
drop_out = 0.5
model = create_rnn(units, drop_out)
#model.summary()

model.compile(
    loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']
)

callbacks = [
    keras.callbacks.ModelCheckpoint(main_path + "model_checkpoints/model_8.h5", save_best_only=True)
]

model.fit(x_train, y_train, batch_size=2000, epochs=400, validation_data=(x_val, y_val), use_multiprocessing=True, callbacks=callbacks)

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78

<keras.callbacks.History at 0x7f2713f87990>

## Part 3 - Making the predictions and visualising the results

In [76]:
#model = keras.models.load_model(main_path + "model_checkpoints/model_8.h5") 

### Evaluating model

In [77]:
# Final evaluation of the model
scores = model.evaluate(x_val, y_val, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 88.08%


### Saving the model

In [78]:
model.save(main_path + 'LSTM_h5_model_8.h5')