<a href="https://colab.research.google.com/github/safal25/ml_basic_codes/blob/main/IMDB_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis of IMDB reviews
We will start by importing the necessary libraries

In [6]:

import tensorflow as tf

In [7]:
import tensorflow.keras as keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


Now, we need to download the dataset and then we will split the dataset into train and test data.

Our dataset contains imdb reviews as our features and their corresponding label values which are 1 and 0, 1 means a positive review and 0 means a negative review.









In [8]:
imdb = keras.datasets.imdb

In [9]:
#downloading the dataset
#splitting the data into train and test data
vocab_size=10000
(train_data,train_labels),(test_data,test_labels)=imdb.load_data(num_words=vocab_size)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


The dataset we have downloaded is already integer encoded so we need to download the mapping of words to integers for this dataset and create a python dictionary for the same.

In [10]:

word_index=imdb.get_word_index()

word_index={k:(v+3) for k,v in word_index.items()}

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [11]:
word_index["<PAD>"]=0
word_index["<START"]=1
word_index["<UNK>"]=2
word_index["<UNUSED>"]=3

In [12]:
#checking the mapping
s=["the","movie","was","beautiful"]
arr=[word_index[k] for k in s]

In [13]:
arr

[4, 20, 16, 307]

We would also create another python dictionary which contains the reverse mapping from integers to words we will use this dictionary whenever we want to convert the integer encoded data into text data.

In [14]:
reverse_word_index=dict([(value,key) for key,value in word_index.items()])

The decode_review function just helps us to decode an integer encoded review back to the text format.

In [15]:
def decode_review(text):

  return " ".join([reverse_word_index.get(i,'?') for i in text])

review=decode_review(train_data[0])

In [16]:
#printing the first review in training dataset
for i in review:
  print(i,end='')

<START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what

Before giving the review as an input to the model we need to perform following preprocessing steps:

 


*   The length of each review should be made equal for the model to be working correctly.

*  We have chosen the length of each review to be 500. 
*     If the review is longer than 500 words we are going to cut the extra part of the review.


*       If the review is contains less than 500 words we are going to pad the review with zeros to increase its length to 500.























In [17]:
train_data=keras.preprocessing.sequence.pad_sequences(train_data,value=word_index["<PAD>"],padding='post',maxlen=500)
test_data=keras.preprocessing.sequence.pad_sequences(test_data,value=word_index["<PAD>"],padding='post',maxlen=500)

#Building the model
Our model is a neural network and it consits of the following layers : 

1.   one word embedding layer which creates word embeddings of length 16 from integer encoded review.
2.  second layer is global average pooling layer.

1.   then a dense layer which has 16 hidden units and uses relu as activation function
2.  the final layer is the output layer which uses sigmoid as activation function 




In [18]:
model=keras.Sequential([keras.layers.Embedding(vocab_size,16,input_length=500),
                        keras.layers.GlobalAveragePooling1D(),
                        keras.layers.Dense(16,activation='relu'),
                        keras.layers.Dense(1,activation='sigmoid')])

#compiling the model


1.   Adam is used as optimization function for our model.
2.   Binary cross entropy loss function is used as loss function for the model.

1.   Accuracy is used as the metric for evaluating the model.





In [19]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In the next step we are going to train the model on our downloaded IMDB dataset.

In [20]:
#training the model
history=model.fit(train_data,train_labels,epochs=30,batch_size=512,validation_data=(test_data,test_labels))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Now we will be evaluating the loss and accuracy of our model on testing data.

In [21]:
loss,accuracy=model.evaluate(test_data,test_labels)



As we can see our model is giving an accuracy of 88.58% on the testing data.

Since our model is trained, now we will test our model on a random review that we are going to copy from imdb website.

In [22]:
#copying a random review from imdb to test our model
string="Scam 1992 The Harshad Mehta story is a brilliant web series directed by Hansal Mehta I have been a Hansal Mehta fan since Bose and Omerta His direction is mindblowing Performance by Pratik Gandhi Shreya Dhanwanthary and others are good. In short a definite watch"


In [23]:
#converting the string into a list of strings 
arr=string.split()

Since this review is in text format
we need to convert it in integer encoded format before giving it as an input to the model so we have created the
review encoder function to do the same.

In [24]:
def review_encoder(text):
  arr=[word_index.get(word,0) for word in text]
  return arr

scam_review=review_encoder(arr)
for i in range(len(scam_review)):
  if(scam_review[i]>10000):
    scam_review[i]=3

In [25]:
#converting the list to a numpy array
scam_review=np.array([scam_review])

In [26]:
#padding the review
scam_review=keras.preprocessing.sequence.pad_sequences(scam_review,value=word_index["<PAD>"],padding='post',maxlen=500)

Checking the prediction of our model on the random review.

In [27]:
if (model.predict(scam_review) > 0.5).astype("int32"):
  print("positive review")
else:
  print("negative review")

positive review
