# IMDb Movie Review Sentiment Prediction using LSTM

**Name:** Rohan Sahu  
**Registration Number:** 22BAI10166  
**Date:** 05-04-2025

## Problem Statement

When people watch a movie, they often head to platforms like **IMDb** to share what they thought about it. These reviews can be extremely valuable—not just for other viewers, but also for companies in the film industry trying to gauge public opinion. But with thousands of reviews being posted every day, going through them manually just isn’t feasible.

This project focuses on building a model that can **predict the sentiment** of a movie review—whether it's **positive** or **negative**—based purely on the text. To do this, we’ll be using a deep learning model known as **LSTM (Long Short-Term Memory)**, which is good at handling sequences of data like sentences and paragraphs. Since LSTMs are capable of capturing context and relationships between words, they’re a solid choice for this task.

The dataset we’re using is the **IMDb movie review dataset**, which contains **50,000 labeled reviews**—split evenly between positive and negative sentiments. The idea is to train the LSTM model on this data so that it can learn how sentiment is typically expressed in text and use that knowledge to make predictions on new reviews.

### Challenges:
- Dealing with informal language, sarcasm, or mixed emotions in reviews.
- Making sure the model captures not just individual words, but how they're used in context.
- Handling reviews of different lengths without losing important information.

### Goal:
Our goal is to develop a model that can reliably **predict the sentiment of reviews with high accuracy**—ideally **above 85% on the test data**. We'll also use evaluation metrics like:
- **Accuracy**
- **Precision**
- **Recall**
- **F1-score**

…to measure performance, and **visualize results** to better understand how the model is learning over time.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb

In [5]:
data = pd.read_csv("C:\\Users\\Rohan Sahu\\Downloads\\IMDB\\IMDB Dataset.csv")

In [6]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
data["sentiment"].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [8]:
data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)

  data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)


In [14]:
# split data into training data and test data
train_data, test_data = train_test_split(data, test_size=0.2, random_state=17)

In [15]:
print(train_data.shape)
print(test_data.shape)

(40000, 2)
(10000, 2)


In [16]:
# Tokenize text data
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data["review"]), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data["review"]), maxlen=200)

In [17]:
print(X_train)

[[   0    0    0 ...    7   21 1128]
 [   0    0    0 ...    5   99  872]
 [   0    0    0 ... 1014   32 1742]
 ...
 [   0    0    0 ...    1    4 3144]
 [   0    0    0 ...  693  725  155]
 [ 363 1219   76 ...    1   95  141]]


In [18]:
print(X_test)

[[ 460    4  249 ... 1094    6 4951]
 [ 137   61  140 ...  454  140   26]
 [   0    0    0 ...  273    3  163]
 ...
 [   0    0    0 ...  345    2 2466]
 [   0    0    0 ...   75   65   39]
 [   0    0    0 ...   21  246  342]]


In [19]:
Y_train = train_data["sentiment"]
Y_test = test_data["sentiment"]

In [20]:
print(Y_train)

2380     0
3385     1
41779    0
39302    0
20619    1
        ..
42297    1
33174    1
46470    1
34959    1
10863    1
Name: sentiment, Length: 40000, dtype: int64


In [21]:
# build the model

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=200))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation="sigmoid"))



In [22]:
model.summary()

In [23]:
# compile the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [24]:
model.fit(X_train, Y_train, epochs=5, batch_size=64, validation_split=0.2)

Epoch 1/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m477s[0m 928ms/step - accuracy: 0.7353 - loss: 0.5193 - val_accuracy: 0.8349 - val_loss: 0.3919
Epoch 2/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m785s[0m 2s/step - accuracy: 0.8559 - loss: 0.3512 - val_accuracy: 0.8485 - val_loss: 0.3508
Epoch 3/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m464s[0m 926ms/step - accuracy: 0.8806 - loss: 0.2937 - val_accuracy: 0.8654 - val_loss: 0.3272
Epoch 4/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m376s[0m 752ms/step - accuracy: 0.9014 - loss: 0.2488 - val_accuracy: 0.8704 - val_loss: 0.3257
Epoch 5/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m386s[0m 772ms/step - accuracy: 0.9115 - loss: 0.2215 - val_accuracy: 0.8730 - val_loss: 0.3239


<keras.src.callbacks.history.History at 0x1b94d4b70b0>

In [25]:
loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 83ms/step - accuracy: 0.8782 - loss: 0.3113
Test Loss: 0.32001858949661255
Test Accuracy: 0.8755999803543091


In [26]:
def predict_sentiment(review):
  # tokenize and pad the review
  sequence = tokenizer.texts_to_sequences([review])
  padded_sequence = pad_sequences(sequence, maxlen=200)
  prediction = model.predict(padded_sequence)
  sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
  return sentiment

In [27]:
# example usage
new_review = "An absolute masterpiece. The storytelling, cinematography, and acting were all top-notch. Easily one of the best films of the decade."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step
The sentiment of the review is: positive


In [28]:
new_review = "The acting felt wooden, and the dialogue was cringeworthy at best. I struggled to finish it."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 204ms/step
The sentiment of the review is: negative


### **Classification Report**

In [31]:
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

Y_pred_probs = model.predict(X_test)
Y_pred = (Y_pred_probs > 0.5).astype("int32")
cm = confusion_matrix(Y_test, Y_pred)

print("Confusion Matrix:\n", cm)
print(classification_report(Y_test, Y_pred, target_names=['Negative', 'Positive']))

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 73ms/step
Confusion Matrix:
 [[4349  590]
 [ 654 4407]]
              precision    recall  f1-score   support

    Negative       0.87      0.88      0.87      4939
    Positive       0.88      0.87      0.88      5061

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

