### Movie Sentiment Analysis

- In this Exercise, you are going to classify whether a given movie review is positive or negative.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.


In [2]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### Dataset: IMDB Dataset

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

This data consists of two columns. - review - sentiment
Reviews are the statements given by users after watching the movie.
sentiment feature tells whether the given review is positive or negative.

In [3]:
#read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv("movies_sentiment_data.csv")

In [4]:
#print the shape of the data

df.shape




(19000, 2)

In [5]:
#print top 5 datapoints
df.head(5)

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [6]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category']= df['sentiment'].apply(lambda x: 1 if x=='positive' else 0)

In [7]:
df.head()

Unnamed: 0,review,sentiment,Category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [8]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df['Category'].value_counts()

1    9500
0    9500
Name: Category, dtype: int64

### Dataset is balanced

In [9]:
#Do the 'train-test' splitting with test size of 20%
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.review,df.Category,test_size=0.2)

In [10]:
X_train.shape

(15200,)

In [12]:
X_test.shape

(3800,)

### Exercise-1

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

#### Note:

use CountVectorizer for pre-processing the text.

use Random Forest as the classifier with estimators as 50 and criterion as entropy.

print the classification report.

#### References:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [15]:
#creating pipeline object

rf = Pipeline([
    ('Vectorizer', CountVectorizer()),
    ('random_forest', RandomForestClassifier(n_estimators=50,criterion='entropy'))   
])

#Fit the model
rf.fit(X_train,y_train)


#Predict with X_test and store in y_pred
y_pred = rf.predict(X_test)

#Classification report 
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.83      0.83      1915
           1       0.83      0.84      0.83      1885

    accuracy                           0.83      3800
   macro avg       0.83      0.83      0.83      3800
weighted avg       0.83      0.83      0.83      3800



Using Random Forest Classifier we got 83% as Precision,Recall and f-score

### Exercise-2

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

#### Note:

use CountVectorizer for pre-processing the text.

use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean'.

print the classification report.

#### References:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [17]:
#creating pipeline object

knn = Pipeline([
    ('Vectorizer', CountVectorizer()),
    ('KNeiboursClassifier', KNeighborsClassifier(n_neighbors=10,metric='euclidean'))   
])

#Fit the model
knn.fit(X_train,y_train)


#Predict with X_test and store in y_pred
y_pred = knn.predict(X_test)

#Classification report 
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.65      0.66      0.66      1915
           1       0.65      0.64      0.64      1885

    accuracy                           0.65      3800
   macro avg       0.65      0.65      0.65      3800
weighted avg       0.65      0.65      0.65      3800



  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Using KNN we got as Precision,Recall and f-score score seems to avearge 65% which is lower.

### Exercise-3

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

#### Note:

use CountVectorizer for pre-processing the text.

use Multinomial Naive Bayes as the classifier.

print the classification report.

#### References:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [18]:
#creating pipeline object

mnb = Pipeline([
    ('Vectorizer', CountVectorizer()),
    ('multiN_naive_bayes', MultinomialNB())   
])

#Fit the model
mnb.fit(X_train,y_train)


#Predict with X_test and store in y_pred
y_pred = mnb.predict(X_test)

#Classification report 
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.87      0.85      1915
           1       0.86      0.83      0.84      1885

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



Using MultiNomial Naives Bayes  we got as more than 85% Precision,Recall and f-score for both Positive and Negative Reviews. This algorithm seems to be the best fit.


### Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?


There are several observations of why KNN may fail to produce good results for NLP text classification compared to RandomForest and MultinomialNB:

KNN's performance decreases with high-dimensional data: NLP datasets often have a large number of features or dimensions, which can lead to the "curse of dimensionality." KNN's performance can suffer from this curse since the algorithm calculates distances between points, which becomes increasingly difficult as the number of dimensions increases.

KNN is sensitive to irrelevant features: In NLP, many features may not be relevant for classification tasks. KNN may not be able to distinguish between relevant and irrelevant features, leading to poor performance.

KNN relies on a distance metric: KNN calculates distances between data points to determine similarity, but not all distance metrics are appropriate for NLP. For example, Euclidean distance may not be effective for text data, which is often sparse and high-dimensional.

KNN requires labeled data: KNN is a supervised learning algorithm and requires labeled data for training. NLP datasets can be expensive and time-consuming to label, which may limit the amount of training data available for KNN.