<span style="color:black; font-size:28px; font-weight:bold;">Movie Sentiment Analysis using NLP</span>

This notebook demonstrates a complete workflow for analyzing movie review sentiments using Natural Language Processing (NLP). The dataset is classified into positive or negative sentiments. The workflow employs techniques such as `CountVectorizer`, `Random Forest`, and `Multinomial Naive Bayes`, along with `Pipeline` integration for efficient processing.

In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

<p style="font-size:16px; color:black;">
    <strong>Dataset Source:</strong> 
    <a href="https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews" style="color:purple;">
        https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
    </a>
</p>


In [2]:
# Load the dataset
df = pd.read_csv("IMDB_Dataset.csv")

In [3]:
# Display dataset details
print("Dataset Shape:", df.shape)
print("Top 5 Data Points:")
df.head()

Dataset Shape: (50000, 2)
Top 5 Data Points:


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
# Create a binary 'Category' column for sentiment classification
df['Category'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
print("Top 5 Data Points with Category:")
df.head()

Top 5 Data Points with Category:


Unnamed: 0,review,sentiment,Category
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [5]:
# Analyze the distribution of the 'Category' column
print("Category Distribution:")
df['Category'].value_counts()

Category Distribution:


Category
1    25000
0    25000
Name: count, dtype: int64

In [6]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['Category'], test_size=0.2)

In [7]:
# First method: Random Forest with Pipeline
clf = Pipeline([
    ('Vectorizer', CountVectorizer()),
    ('rf', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

In [8]:
# Train the pipeline
clf.fit(X_train, y_train)

In [9]:
# Evaluate the pipeline
print("Classification Report for Random Forest Pipeline:")
y_pred_rf = clf.predict(X_test)
print(classification_report(y_test, y_pred_rf))

Classification Report for Random Forest Pipeline:
              precision    recall  f1-score   support

           0       0.83      0.85      0.84      4958
           1       0.85      0.83      0.84      5042

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



In [10]:
# Second method: Multinomial Naive Bayes
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train.values)
X_test_cv = cv.transform(X_test)

In [11]:
# Train the Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [12]:
# Evaluate the Multinomial Naive Bayes model
print("Classification Report for Multinomial Naive Bayes:")
y_pred_nb = model.predict(X_test_cv)
print(classification_report(y_test, y_pred_nb))

Classification Report for Multinomial Naive Bayes:
              precision    recall  f1-score   support

           0       0.82      0.88      0.85      4958
           1       0.87      0.81      0.84      5042

    accuracy                           0.84     10000
   macro avg       0.85      0.84      0.84     10000
weighted avg       0.85      0.84      0.84     10000



In [13]:
# Observations
# KNN may fail to produce good results compared to Random Forest and MultinomialNB because it struggles with large datasets, feature scaling, and high-dimensional data.
# Random Forest and MultinomialNB are more suitable for high-dimensional datasets with multiple features and handle noise, irrelevant features, and overfitting better.