# 🎬 IMDb Movie Review Sentiment Analysis Using LSTM

## 🧠 What is LSTM?

Long Short-Term Memory (LSTM) is a special type of Recurrent Neural Network (RNN) capable of learning long-term dependencies in sequential data. It’s particularly useful in natural language processing tasks where understanding the context over time is essential.

### 📌 Use Cases of LSTM:
- **Sentiment Analysis**: Understanding emotion in text reviews.
- **Language Translation**: Translating sentences from one language to another.
- **Speech Recognition**: Interpreting spoken language in real-time.

---

## 🛠️ Project Workflow

### 1️⃣ Setting Up & Importing Dataset from Kaggle
- Configure Kaggle API.
- Download IMDb dataset (typically 50,000 movie reviews).

### 2️⃣ 🧹 Preprocessing the Data
- Tokenize and pad the text sequences.
- Convert sentiment labels:
  - **Positive = 1**
  - **Negative = 0**
- Prepare data for binary classification.

### 3️⃣ ✂️ Splitting the Dataset
- Use `train_test_split` to divide data into training and testing sets (e.g., 80-20 split).

### 4️⃣ 🏗️ Building the LSTM Model
- Construct model using:
  - `Embedding` layer (for word vectors)
  - `LSTM` layer (for sequence learning)
  - `Dense` layer with sigmoid activation (for binary classification)
- Use TensorFlow Keras.

### 5️⃣ 📊 Evaluating the Model
- Compute **Loss** and **Accuracy** on test set.


---

✅ This simple LSTM model helps classify the sentiment of IMDb reviews with meaningful accuracy using deep learning.




In [19]:
!pip install kaggle



In [3]:
import os
import json
from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [20]:
kaggle_dic = json.load(open('kaggle.json'))

In [22]:
#setting up the env varaibles
os.environ['KAGGLE_USERNAME'] = kaggle_dic['username']
os.environ['KAGGLE_KEY'] = kaggle_dic['key']

In [8]:
#!/bin/bash
!kaggle datasets download lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other


In [23]:
!ls

'IMDB Dataset.csv'			 kaggle.json
 imdb-dataset-of-50k-movie-reviews.zip	 sample_data


In [11]:
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", "r") as zip_ref:
  zip_ref.extractall()

In [12]:
!ls

'IMDB Dataset.csv'			 kaggle.json
 imdb-dataset-of-50k-movie-reviews.zip	 sample_data


In [24]:
data = pd.read_csv("/content/IMDB Dataset.csv")

In [25]:
data.shape

(50000, 2)

In [26]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [28]:
data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)

  data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)


In [29]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [30]:
data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
1,25000
0,25000


In [31]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

In [32]:
print(train_data.shape)
print(test_data.shape)

(40000, 2)
(10000, 2)


**Data Preprocessing**

In [35]:
#Tokenizing the text data
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data['review'])
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data['review']), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data['review']), maxlen=200)

In [38]:
print(X_train)
print(X_train.shape)

[[1935    1 1200 ...  205  351 3856]
 [   3 1651  595 ...   89  103    9]
 [   0    0    0 ...    2  710   62]
 ...
 [   0    0    0 ... 1641    2  603]
 [   0    0    0 ...  245  103  125]
 [   0    0    0 ...   70   73 2062]]
(40000, 200)


In [39]:
print(X_test)
print(X_test.shape)

[[   0    0    0 ...  995  719  155]
 [  12  162   59 ...  380    7    7]
 [   0    0    0 ...   50 1088   96]
 ...
 [   0    0    0 ...  125  200 3241]
 [   0    0    0 ... 1066    1 2305]
 [   0    0    0 ...    1  332   27]]
(10000, 200)


In [40]:
Y_train = train_data['sentiment']
Y_test = test_data['sentiment']

**LSTM - Long Short-Term Memory**

In [47]:
model = Sequential(
    [
        Embedding(input_dim=5000, output_dim=128, input_shape=(200,) ),
        LSTM(128, dropout=0.2, recurrent_dropout=0.2), #add dropout args to prevent overfitting
        Dense(1, activation='sigmoid')
    ]
)

  super().__init__(**kwargs)


In [48]:
model.summary()

In [49]:
#compile the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [50]:
model.fit(X_train, Y_train, epochs=5, batch_size=64, validation_split=0.2)

Epoch 1/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m227s[0m 441ms/step - accuracy: 0.7307 - loss: 0.5274 - val_accuracy: 0.8280 - val_loss: 0.3974
Epoch 2/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m224s[0m 448ms/step - accuracy: 0.8420 - loss: 0.3781 - val_accuracy: 0.8474 - val_loss: 0.3684
Epoch 3/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m246s[0m 415ms/step - accuracy: 0.8634 - loss: 0.3315 - val_accuracy: 0.8550 - val_loss: 0.3433
Epoch 4/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m256s[0m 403ms/step - accuracy: 0.8662 - loss: 0.3169 - val_accuracy: 0.8629 - val_loss: 0.3243
Epoch 5/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m214s[0m 427ms/step - accuracy: 0.9099 - loss: 0.2325 - val_accuracy: 0.8763 - val_loss: 0.3179


<keras.src.callbacks.history.History at 0x791c8af12dd0>

**Model Evalutaion**

In [51]:
#evaluating the model
loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 112ms/step - accuracy: 0.8835 - loss: 0.2935
Loss: 0.295772910118103
Accuracy: 0.8830999732017517


**Building a Predictive System**
Putting it all together, let's build a predictive system that takes a review as an input and outputs the sentiment.

In [55]:
def predict_sentiment(review):
  #tokenize the review
  sequence = tokenizer.texts_to_sequences([review])
  pad_sequence = pad_sequences(sequence, maxlen=200)
  #make prediction
  prediction = model.predict(pad_sequence)
  sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
  return sentiment

In [56]:
#test example usage
pos_review = "This movie is a masterpiece."
sentiment = predict_sentiment(pos_review)
print(f"The sentiment of the review is {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 426ms/step
The sentiment of the review is positive


In [57]:
#test example usage - negative
neg_review = "Such a boring movie. Not Recommended"
sentiment = predict_sentiment(neg_review)
print(f"The sentiment of the review is {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 125ms/step
The sentiment of the review is negative
