**The goal of this notebook is to get familiar with NLP using Simple RNN. To this end, we are going to do sentiment analysis on a dataset of IMDb reviews to determine if a review is positive or negative.**

Information about the dataset and api - https://keras.io/api/datasets/imdb/

In [4]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding
from tensorflow.keras.callbacks import EarlyStopping

In [5]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  0


## Importing the IMDb Dataset

In [6]:
vocab_size = 10000

# Loading the data
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=vocab_size)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step


In [7]:
X_train.shape

(25000,)

In [8]:
y_train.shape

(25000,)

25000 reviews and their labels(positive/negative) have been loaded

Each element in X_train represents a review, where the words are represented by their respective indices. The indices are directly proportional to the frequency of occurence of the word in the dataset. 

**Indices 1 and 2 represent start character and out-of-vocabulary(OOV) characters respectively** (see https://keras.io/api/datasets/imdb/). Some words will be OOV since we imported only the 10,000 most frequent words (vocab_size).

In [9]:
# First review
print(X_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [10]:
# get_words returns a dictioanary of words: index number
word_index = imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1us/step


In [11]:
list(word_index.items())[:5]

[('fawn', 34701),
 ('tsukino', 52006),
 ('nunnery', 52007),
 ('sonja', 16816),
 ('vani', 63951)]

In [12]:
# creating a reverse word_index for decoding words from their indices
index_word = {value + 3: key for key, value in word_index.items()}

# Adding "START" and "OOV" characters
index_word[1] = "[START]"
index_word[2] = "[OOV]"

In [13]:
list(index_word.items())[:5]

[(34704, 'fawn'),
 (52009, 'tsukino'),
 (52010, 'nunnery'),
 (16819, 'sonja'),
 (63954, 'vani')]

In [14]:
# Let's see what the most frequent word is
index_word[4]

'the'

Index 1 represents start character and 2 represents OOV words.

In [15]:
# Function to decode review from the index representation
def decode(indices):
    review = ' '.join([index_word.get(index, '?') for index in indices])
    
    return review

In [16]:
# Let's decode the first review
decode(X_train[0])

"[START] this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert [OOV] is an amazing actor and now the same being director [OOV] father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for [OOV] and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also [OOV] to the two little boy's that played the [OOV] of norman and paul they were just brilliant children are often left out of the [OOV] list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for wh

## Preprocessing reviews

### Standardizing input size`

In [17]:
# Setting max_length
max_length = 500

In [18]:
# Padding
X_train = sequence.pad_sequences(sequences=X_train, maxlen=max_length, padding='pre')
X_test = sequence.pad_sequences(sequences=X_test, maxlen=max_length, padding='pre')

In [19]:
X_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

All reviews have been padded with 0s at the end to standardize input size

## Building out Model

In [20]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=128, input_length=max_length))
model.add(SimpleRNN(units=128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))



In [21]:
model.summary()

In [22]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [23]:
# Setting up Earlystopping
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

## Training the Model

In [None]:
history = model.fit(
    X_train, y_train,epochs=10, batch_size=32,
    validation_split=0.2,
    callbacks=[early_stopping])

Executing op RangeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op MapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ZipDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptionsDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptionsDataset in device /job:localhost/replica:0/task:0/device:CPU:0
ob:localhost/re

In [69]:
model.summary()

In [70]:
# Access training history
print("Training Loss and Accuracy:")
print(history.history['loss'])
print(history.history['accuracy'])

print("Validation Loss and Accuracy:")
print(history.history['val_loss'])
print(history.history['val_accuracy'])

Training Loss and Accuracy:
[1846856.875, 0.5897020101547241, 0.38693511486053467, 0.31793004274368286, 1095425.375, 36482.01953125, 0.5031412839889526, 0.4807693362236023]
[0.602150022983551, 0.7024499773979187, 0.8420000076293945, 0.874750018119812, 0.8204500079154968, 0.8009999990463257, 0.7706999778747559, 0.7859500050544739]
Validation Loss and Accuracy:
[0.6318543553352356, 0.4741601049900055, 0.3777773082256317, 0.3901137411594391, 0.5123240351676941, 0.6003900170326233, 0.5975767374038696, 0.5962917804718018]
[0.6233999729156494, 0.7799999713897705, 0.8410000205039978, 0.829800009727478, 0.758400022983551, 0.6722000241279602, 0.675599992275238, 0.678600013256073]


In [71]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)

Executing op RangeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op MapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op RepeatDataset in device /job:localhost/replica:0/task:0/device:CPU:0
/device:CPU:0
handle_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:CPU:0
args_0: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
range: (Range): /job:localhost/replica:0/task:0/device:CPU:0
Identity: (Identity): /job:localhost/replica:0/task:0/device:CPU:0
identity_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:CPU:0
ra

In [72]:
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

Test Loss: 0.38786259293556213
Test Accuracy: 0.835319995880127


In [73]:
model.save("model_500.h5")

Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op ReadVariableOp in device /j

## Prediction for new data (review)

For every new review:

1. Convert to lower-case
2. Convert each word in the review to its respective index
3. Apply padding
4. Pass it through the model

In [None]:
# Loading the pre-saved model
model = tf.keras.models.load_model("/kaggle/working/model_500.h5")

In [48]:
# Function to pre-process the movie
def pre_process(review):
    # Lower case
    review = review.lower()
    
    # Word to index
    encoded_review = [word_index.get(word, 2) + 3 for word in review]
    
    # Padding
    padded_review = sequence.pad_sequences([encoded_review], maxlen=max_length, padding='pre')
    
    return padded_review

In [49]:
# Function to predict sentiment
def predict(review):

    print("Positive") if model.predict(pre_process(review)) > 0.5 else print("Negative")

In [50]:
predict("Good movie")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
Positive


In [51]:
predict("Bad movie")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
Negative


In [56]:
predict("movie was a disaster")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Negative


In [58]:
predict("Excellent performances all over from all actors and crew members")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
Positive


In [59]:
predict("Complete waste of time")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Negative


In [62]:
predict("The prequel was better, this one is a disaster")
# Wrong Output

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
Positive
