# IMDB Movie Review Prediction using LSTM (Sentiment Analysis)

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data

Problem Statement: Need to predict the sentiment (Positive or Negative) basedon the movie review.

**Steps performed:**

1. Loading the dataset.
2. Dataset cleaning and preprocessing (If required)
3. Seperating the dependent(y) and Independent(X) features.
4. Text pre-processing
5. One hot encoding representation and applying padding
6. Train Test split
7. Create the ANN model with Embedding layer and LSTM layer.
8. Train the model
9. Check the performance score

## Loading Dataset

In [4]:
import pandas as pd

## Reading CSV

In [5]:
df = pd.read_csv('IMDB Dataset.csv')

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df.shape

(50000, 2)

## Checking if dataset has null values

In [8]:
df.isna().sum()

Unnamed: 0,0
review,0
sentiment,0


Since we don't have any missing or null values, we can proceed further in Text preprocessing.

## Seperating Independent (y) feature

As our problem statment is binary classification problem, we should ensure that our output contains the values in either 0 or 1.

Hence, we will use pd.dummies to convert the same.

In [40]:
y = pd.get_dummies(df['sentiment'], drop_first=True).astype(int)

In [41]:
print(y)

       positive
0             1
1             1
2             1
3             0
4             1
...         ...
49995         1
49996         0
49997         0
49998         0
49999         0

[50000 rows x 1 columns]


## Seperating Dependent (X) feature

In [10]:
X = df['review']

## Text preprocessing

Let us remove the special characters, apply stop words, and as its a Sentiment analysis we will perform Stemming.

## Importing libraries

In [13]:
import nltk # nltk library for stemming, stopwords
import re # for regular expressions
from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer # used for stemming

## Downloading stopwords

In [2]:
# Downloading stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Performing Stemming, Lowering the sentences and removing special characters

Before moving forward, lets create a copy of our 'review' column to avoid manipulating the main data.

In [11]:
reviews = X.copy()

## Iterating over the reviews and preprocessing the text

In [15]:
stemmer = PorterStemmer() # creating an object of Porter Stemmer class

pre_processed_reviews = []

for review in reviews:

    review = re.sub('[^a-zA-Z]',' ',review) # removing special characters
    review = review.lower() # lowering the sentences

    review = review.split() # splitting the words in the review
    review = [stemmer.stem(word) for word in review if not (word) in set(stopwords.words('english'))] # removing stopwords
    review = ' '.join(review) # joining to make the sentence look back to normal

    # adding to the preprocessed sentences list
    pre_processed_reviews.append(review)


## Defining Vocabulary size

This helps the One hot encoding to give the indexes.

In [16]:
voc_size = 10000

## Importing library to do one hot encoding

In [17]:
from tensorflow.keras.preprocessing.text import one_hot

## Deriving OHE representation

In [18]:
one_hot_representation = [one_hot(pre_processed_review, voc_size) for pre_processed_review in pre_processed_reviews]

In [19]:
pre_processed_reviews[1]

'wonder littl product br br film techniqu unassum old time bbc fashion give comfort sometim discomfort sens realism entir piec br br actor extrem well chosen michael sheen got polari voic pat truli see seamless edit guid refer william diari entri well worth watch terrificli written perform piec master product one great master comedi life br br realism realli come home littl thing fantasi guard rather use tradit dream techniqu remain solid disappear play knowledg sens particularli scene concern orton halliwel set particularli flat halliwel mural decor everi surfac terribl well done'

In [20]:
one_hot_representation[1]

[3506,
 3571,
 7665,
 6313,
 6313,
 6138,
 2824,
 5242,
 8717,
 2101,
 9657,
 6095,
 3868,
 8031,
 8802,
 9691,
 5679,
 2237,
 369,
 6027,
 6313,
 6313,
 710,
 4571,
 574,
 2926,
 6856,
 6356,
 5288,
 7796,
 6956,
 5359,
 5211,
 6184,
 3421,
 4698,
 2211,
 3698,
 4943,
 4567,
 5774,
 574,
 5087,
 694,
 7213,
 9860,
 8874,
 6027,
 5707,
 7665,
 2412,
 7960,
 5707,
 2714,
 2414,
 6313,
 6313,
 2237,
 6621,
 1351,
 5646,
 3571,
 4689,
 3819,
 8607,
 9157,
 8903,
 9322,
 2035,
 2824,
 8633,
 1571,
 2319,
 9225,
 219,
 5679,
 2937,
 8109,
 5591,
 4333,
 4215,
 649,
 2937,
 2787,
 4215,
 811,
 5345,
 6702,
 900,
 8437,
 574,
 1274]

## Applying padding

## Importing library

In [22]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [23]:
sent_length = 500

padded_one_hot_representation = pad_sequences(one_hot_representation, padding='pre', maxlen=sent_length)

print(padded_one_hot_representation)

[[   0    0    0 ... 9609 4580 4891]
 [   0    0    0 ... 8437  574 1274]
 [   0    0    0 ... 2134 6184 7791]
 ...
 [   0    0    0 ... 3728  838 7553]
 [   0    0    0 ...  995  170 3377]
 [   0    0    0 ... 9656 7195 9310]]


## Train Test Split

Converting the object datatype to np datatype

In [42]:
import numpy as np

X_final = np.array(padded_one_hot_representation)
y_final = np.array(y)

In [45]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

In [46]:
X_train.shape, y_train.shape

((33500, 500), (33500, 1))

## Creating model

### Importing required libraries

In [33]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

In [34]:
embeding_vector_features = 100 # used for feature representation.

model = Sequential() # creating object for Sequential class

# Embedding layer
model.add(Embedding(voc_size, # adding vocabulary size
                         embeding_vector_features, # feature representation
                         input_length = sent_length))

# LSTM layer
model.add(LSTM(300))

# Dense layer
model.add(Dense(1, activation='sigmoid')) # Using Sigmoid as its binary classification problem

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())



None


## Training the model

In [47]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)

Epoch 1/10
[1m524/524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 69ms/step - accuracy: 0.7495 - loss: 0.5019 - val_accuracy: 0.8590 - val_loss: 0.3369
Epoch 2/10
[1m524/524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 78ms/step - accuracy: 0.8906 - loss: 0.2775 - val_accuracy: 0.8576 - val_loss: 0.3293
Epoch 3/10
[1m524/524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 76ms/step - accuracy: 0.9204 - loss: 0.2089 - val_accuracy: 0.8659 - val_loss: 0.3230
Epoch 4/10
[1m524/524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 76ms/step - accuracy: 0.9463 - loss: 0.1549 - val_accuracy: 0.8628 - val_loss: 0.3647
Epoch 5/10
[1m524/524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 76ms/step - accuracy: 0.9592 - loss: 0.1178 - val_accuracy: 0.8533 - val_loss: 0.4140
Epoch 6/10
[1m524/524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 67ms/step - accuracy: 0.9619 - loss: 0.1107 - val_accuracy: 0.8580 - val_loss: 0.4525
Epoch 7/10
[1m5

<keras.src.callbacks.history.History at 0x7c0a70ad1450>

## Prediction

In [48]:
y_pred = model.predict(X_test)

[1m516/516[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 14ms/step


Since the model predicts in the range of 0 - 1 in form of decimals, we are converting it in form of either 0 or 1 by keeping a cutoff of

if pred_val > 0.6 output is 1
else output is 0.

In [49]:
y_pred = np.where(y_pred > 0.6, 1 # positive
                  ,0) # negative

## Accuracy

In [51]:
from sklearn.metrics import accuracy_score

In [52]:
print(accuracy_score(y_test, y_pred))

0.8501818181818181


In [53]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred, labels=[1,0]))

[[7212 1080]
 [1392 6816]]


In [54]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.83      0.85      8208
           1       0.84      0.87      0.85      8292

    accuracy                           0.85     16500
   macro avg       0.85      0.85      0.85     16500
weighted avg       0.85      0.85      0.85     16500

