# `Sentimental Analysis on IMBD Movie reviews with RNN`

We will build an LSTM Recurrent Neural Network to perform sentimental anaylsis on IMBD movie reviews to determine whether a certain review is positive or negative. We will be using the Tensorflow IMBD dataset & we will start by importing the necessary libraries and dependencies

In [1]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.datasets import imdb

This dataset contains reviews that are encoded numerically. The reviews are divided into words and each word in a review is encoded to a numerical value that represents the frequency of the word. VOCAB_SIZE will represent the amount of vocabulary used in the reviews, where we will use a set of 80000 words to select our reviews from. A word encoded by '1' will be the most frequent, while a word encoded by '80000' will be the least frequenct.

In [3]:
VOCAB_SIZE = 80000
MAXLEN = 250
BATCH_SIZE = 64
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


train_data & test_data are two 2D array of equal size, where each entry in the outer array is an array containing the review. Every element in the array is a numerical value representing a word within the review. Each element in the array has a different size, since the reviews vary in the number of words. train_labels & test_labels are arrays of 1s and 0s. A one represents a positive review & a zero represents a negative review. Their values correspond to the numerical review arrays of train_data & test_data.

Since train_data & test_data consist of 2D array of varying size (since each review has a different amount of words), we must alter the size of each review in the array to make every element in the array equally sized.

In [85]:
def padding_data(data, size):
  padded_data = np.empty((25000,size), int)
  for i in range(0,len(data)):
    if len(data[i]) > size:
      padded_data[i] = data[i][0:size]
    elif len(data[i]) < size:
      zeros = np.zeros((size-len(data[i]),), dtype=int)
      data_item = data[i][0:len(data[i])]
      padded_data[i] = np.concatenate([zeros,data_item])
    else: 
      padded_data[i] = data[i]
  return padded_data

We create the function padding_data to readjust the size of each element in the train_data & test_data datasets into MAXLEN. If a review has more than MAXLEN words, the review will be trimmed to the first 'MAXLEN' words. If a review has less than MAXLEN words, then the review will be appended with (MAXLEN - review.size) number of 0s in order to make every entry in the array equally sized. The zeros are not representative of any word.

In [94]:
train_data = padding_data(train_data, MAXLEN)
test_data = padding_data(test_data, MAXLEN)

Now, we will build our Sequential model. The first layer we'll add is an embedding layer, where we will plug in our vocabulary size as an input dimension & 64 will be the dimension for each embedding output. The embedding layer will find meaning behind the positioning of a vocabulary word within a sentence. Next layer will be our LSTM (Long-Short Term Memory), where we will pass in 64 to represent the number of dimensions for each word, corresponding to the output dimension size used in the mebedding layer. Finally, we will add a dense layer to our NN. We will use a sigmoid function, since our expected output must be between either 1 (positive review) or 0 (negative review). We also add 1 unit for our dense layer, since we are expecting one output per review (whether the sentiment is positive or negative).

In [100]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 64),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Next, we will compile our model. We will use the binary cross-entropy as our loss function, due to the nature of our output variable (1 or 0). We will also use 'Adam' as our optimizer and accuracy for our metrics.

In [101]:
model.compile(loss="binary_crossentropy",optimizer="adam",metrics=['accuracy'])

Now, we will train our model by passing the parameters for x & y as train_data and train_labels, epochs will represent the number of times the data is re-fed into the model and validation_split represents the ratio of data used for validating our model.

In [104]:
history = model.fit(train_data, train_labels, epochs = 10, validation_split = 0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now, we will evaluate our model by checking the accuracy achieved on the test_data

In [106]:
test_error, test_accuracy = model.evaluate(test_data, test_labels)
print(test_error)
print(test_accuracy)

0.9610259532928467
0.8302800059318542


We will now attempt to predict the sentiment of a movie review that the user can input into the program. In order to do so, we must encode the user input in order for the input to resemble the data. I will import sequence for padding the data, since 'padding_data' works on 2D arrays

In [None]:
from keras.preprocessing import sequence

In [149]:
# Get the list of words & their respective indeces (respective numerical values)
word_idx = imdb.get_word_index()

def encode_sentence(sentence):
  tokens = tf.keras.preprocessing.text.text_to_word_sequence(sentence)
  for i in range(0,len(tokens)):
    if tokens[i] in word_idx:
      tokens[i] = word_idx[tokens[i]]
    else:
      tokens[i] = 0
  return sequence.pad_sequences([tokens], MAXLEN)[0]

The function encode_sentence will take in a sentence & turn each word within the sentence into it's numerical representation (turns every word into it's index in the word_idx list).

In [160]:
def predict_sentiment(sentence):
  encoded_sentence = encode_sentence(sentence)
  sentiment_prediction = np.zeros((1,250))
  sentiment_prediction[0] = encoded_sentence
  result = model.predict(sentiment_prediction[0])
  if result[0] > 0.5:
    print('Positive Sentiment')
  elif result[0] < 0.5:
    print('Negative Sentiment')
  else: 
    print('Neutral Sentiment')

In [161]:
sentence = 'This was a great movie I absolutely loved it because it exceeded all of my expectations I would definitely recommend it'
predict_sentiment(sentence)

Positive Sentiment
