# Bag of words (Practice part)

- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [2]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv('movies_sentiment_data.csv')
df

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive
...,...,...
18995,- Bad Stuff: This movie is real crap. Bad stun...,negative
18996,"If you've seen the trailer for this movie, you...",positive
18997,This has to be the all time best computer anim...,positive
18998,I've seen 'NSNA' just after I've seen all Roge...,positive


In [3]:
#2. print the shape of the data
df.shape

(19000, 2)

In [4]:
#3. print top 5 datapoints
df.head()

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [5]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = df['sentiment'].apply(lambda x: 1 if x=='positive' else 0)
df.head()

Unnamed: 0,review,sentiment,Category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [6]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df['Category'].value_counts()

1    9500
0    9500
Name: Category, dtype: int64

In [7]:
#Do the 'train-test' splitting with test size of 20%
x_train, x_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2)

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [8]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('random_forest', (RandomForestClassifier(n_estimators=50, criterion='entropy')))
])

#2. fit with X_train and y_train
clf.fit(x_train, y_train)

In [9]:
#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.83      0.83      1945
           1       0.82      0.83      0.82      1855

    accuracy                           0.83      3800
   macro avg       0.83      0.83      0.83      3800
weighted avg       0.83      0.83      0.83      3800



From above, for both the classes (positive and negative sentiment) we got more than 80% precision, recall and f1- score. This seems to be an acceptable performance.

**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


In [10]:
#1. create a pipeline object
clf2 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('KNN', (KNeighborsClassifier(n_neighbors=10, metric='euclidean')))
])

#2. fit with X_train and y_train
clf2.fit(x_train, y_train)

In [11]:
#3. get the predictions for X_test and store it in y_pred
y_pred2 = clf2.predict(x_test)

#4. print the classfication report
print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.65      0.64      0.65      1945
           1       0.63      0.64      0.64      1855

    accuracy                           0.64      3800
   macro avg       0.64      0.64      0.64      3800
weighted avg       0.64      0.64      0.64      3800



Here the performance of various metrics (precision, recall etc.) seem to be lower (~60 %). Let's try one more classifier and then discuss why performance is varying so much

**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html


In [12]:
#1. create a pipeline object
clf3 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('Multi NB', MultinomialNB())
])

#2. fit with X_train and y_train
clf3.fit(x_train, y_train)

In [13]:
#3. get the predictions for X_test and store it in y_pred
y_pred3 = clf3.predict(x_test)

#4. print the classfication report
print(classification_report(y_test, y_pred3))

              precision    recall  f1-score   support

           0       0.83      0.88      0.86      1945
           1       0.87      0.81      0.84      1855

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



MultinomialNB model for both the classes (positive and negative sentiment) we got more than 80% precision, recall and f1- score and performed equally good with Random Forest.