## LSTM Sentiment Analysis on Kaggle IMDB Movie Review Dataset
### Introduction

This project aims to perform sentiment analysis on the IMDB movie review dataset using a Long Short-Term Memory (LSTM) network. The dataset, available on Kaggle, consists of 50,000 highly polar movie reviews labeled as either positive or negative.

### Dataset
The dataset is available on Kaggle and can be downloaded from [Kaggle IMDB Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The dataset contains the following columns:
- review: The text of the movie review.
- sentiment: The sentiment label (positive/negative).

### Data Preprocessing
- Loading the Data: Load the dataset into a Pandas DataFrame.
- Tokenization: Convert text into sequences of integers using Keras's Tokenizer.
- Padding: Pad sequences to ensure uniform input length.

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,LSTM,Embedding

In [2]:
#reading data
data = pd.read_csv("IMDB Dataset.csv")

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [106]:
data.replace({"positive":1,"negative":0},inplace=True)

In [5]:
data["sentiment"].value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

In [16]:
#splitting data to test and train
train,test = train_test_split(data,random_state=42,test_size=0.2)

In [88]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train["review"])
train_x= pad_sequences(tokenizer.texts_to_sequences(train["review"]),maxlen=200)
test_x = pad_sequences(tokenizer.texts_to_sequences(test["review"]),maxlen=200)

In [91]:
train_y = train["sentiment"]
test_y = test["sentiment"]

In [85]:
#building model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=200))
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=1, activation="sigmoid"))

# Optionally compile the model (not necessary for building but good practice)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Build the model by passing some dummy data
model.build(input_shape=(None, 200))

In [86]:
model.summary()

In [83]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [116]:
model.fit(train_x,train_y,epochs=10,batch_size=64,validation_split=0.2)

Epoch 1/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 88ms/step - accuracy: 0.9664 - loss: 0.0932 - val_accuracy: 0.8739 - val_loss: 0.4534
Epoch 2/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 89ms/step - accuracy: 0.9726 - loss: 0.0811 - val_accuracy: 0.8714 - val_loss: 0.4763
Epoch 3/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 91ms/step - accuracy: 0.9776 - loss: 0.0671 - val_accuracy: 0.8668 - val_loss: 0.5550
Epoch 4/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 89ms/step - accuracy: 0.9743 - loss: 0.0719 - val_accuracy: 0.8691 - val_loss: 0.4893
Epoch 5/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 90ms/step - accuracy: 0.9795 - loss: 0.0615 - val_accuracy: 0.8689 - val_loss: 0.5633
Epoch 6/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 91ms/step - accuracy: 0.9874 - loss: 0.0408 - val_accuracy: 0.8639 - val_loss: 0.5955
Epoch 7/10
[1m5

<keras.src.callbacks.history.History at 0x13d3ffd8d70>

In [93]:
y_pred = model.predict(test_x)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 17ms/step


In [119]:
model.evaluate(test_x,test_y)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.8633 - loss: 0.6340


[0.6315243244171143, 0.8644999861717224]

In [118]:
def predict_sentiment(review):
  # tokenize and pad the review
  sequence = tokenizer.texts_to_sequences([review])
  padded_sequence = pad_sequences(sequence, maxlen=200)
  prediction = model.predict(padded_sequence)
  sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
  return sentiment

In [107]:
# example usage
new_review = "This movie was good"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
The sentiment of the review is: positive


In [108]:
new_review = "Very bad movie"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
The sentiment of the review is: negative


### Conclusion

This project demonstrates the implementation of an LSTM model for sentiment analysis on the IMDB movie review dataset. The model achieves 86.33% accuracy in classifying movie reviews as positive or negative.