<a href="https://colab.research.google.com/github/shivangsingh26/sentimentAnalysis/blob/main/sentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are using the Keras's IMDB dataset for our work

# Import necessary libraries

In [1]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import numpy as np
import tensorflow as tf
import os

# Setting Parameters, Loading the Data, PreProcessing

In [2]:
VOCAB_SIZE = 88584
MAXLEN = 256 #max length of a review
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=VOCAB_SIZE)
#remember that this is a numpy array

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [4]:
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

These are the encodings of different words

We cannot pass words with different length to our model. so we will need to preprocess our data for that case.

Hence all the reviews must be of same length

In [5]:
#if review> 250 words ---> trim the extra words
#if review <250 words ---> add necessary amount of 0's as padding to make it equal to 250

train_data = sequence.pad_sequences(train_data, maxlen=MAXLEN)
test_data = sequence.pad_sequences(test_data, maxlen=MAXLEN)
#This function adds extra padding for the words making it upto MAXLEN

# Creating the Model

In [6]:
model = tf.keras.Sequential([
        tf.keras.layers.Embedding(VOCAB_SIZE, 32), #Embedding layers , 32 means word is translated to a 32 dimensional vector
        tf.keras.layers.LSTM(32), #Sequential layer (32 means 32 dimensions for every single word)
        tf.keras.layers.Dense(1,activation = 'sigmoid') #Dense sigmoid layer to predict the sentiment(+ve(1) or -ve(0))
    ])

In [7]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2843041 (10.85 MB)
Trainable params: 2843041 (10.85 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


# Model Training

In [9]:
model.compile(loss='binary_crossentropy', optimizer = 'rmsprop', metrics=['acc'])

history = model.fit(train_data, train_labels, epochs=10, validation_split = 0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [10]:
results = model.evaluate(test_data,test_labels)
print(results)

[0.6461187601089478, 0.8406400084495544]


# Making Predictions

Since our reviews are encoded so we will also need to encode the reviews and then make predictions from it

In [13]:
word_index = imdb.get_word_index()

def encode_text(text):
  tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
  tokens = [word_index[word] if word in word_index else 0 for word in tokens]
  return sequence.pad_sequences([tokens],maxlen=MAXLEN)[0]

text = "that movie was very good"
encoded = encode_text(text)
print(encoded)

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0 12 17 13 52 49]


Now lets make a decode function

In [14]:
reverse_word_index = {value: key for (key,value) in word_index.items()}

def decode_integers(integers):
  PAD = 0
  text = ""
  for num in integers:
    if num != PAD:
      text += reverse_word_index[num] + " "
  return text[:-1]

print(decode_integers(encoded))

that movie was very good


Now lets make the prediction

In [24]:
def predict(text):
  encoded_text = encode_text(text)
  pred = np.zeros((1,256))
  pred[0] = encoded_text
  result = model.predict(pred)
  print(result[0])

positive_review = "That movie was brilliant! I really loved it! Amazingly great movie"
negative_review = "That movie sucked. I hated it."
predict(positive_review)
predict(negative_review)

[0.51078844]
[0.69858474]
