Bag of words: Exercises
- In this Exercise, you are going to classify whether a given movie review is positive or negative.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [69]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

About Data: IMDB Dataset
Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

- This data consists of two columns. - review - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [70]:
df = pd.read_csv('movies_sentiment_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [71]:
df.sentiment.value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [72]:
df['Category'] = df['sentiment'].apply(lambda x: 1 if x=='positive' else 0)
df.head()

Unnamed: 0,review,sentiment,Category
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [73]:
X_train, X_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2, random_state=42)

In [74]:
X_train[:4]

39087    That's what I kept asking myself during the ma...
30893    I did not watch the entire movie. I could not ...
45278    A touching love story reminiscent of In the M...
16398    This latter-day Fulci schlocker is a totally a...
Name: review, dtype: object

In [75]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40000,), (10000,), (40000,), (10000,))

Exercise-1

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.
Note:

- use CountVectorizer for pre-processing the text.

- use Random Forest as the classifier with estimators as 50 and criterion as entropy.

- print the classification report.

In [76]:
v = CountVectorizer()
X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 5455841 stored elements and shape (40000, 93003)>

In [77]:
clf = Pipeline([('vectorizer', CountVectorizer()),
                ('classifier', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

In [78]:
clf.fit(X_train, y_train)

0,1,2
,steps,"[('vectorizer', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,n_estimators,50
,criterion,'entropy'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [79]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.85      0.84      4961
           1       0.85      0.84      0.84      5039

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



Exercise-2

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..
Note:

- use CountVectorizer for pre-processing the text.
- use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

In [None]:
clf_knn = Pipeline([('vectorizer', CountVectorizer()),
                ('knn', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])

In [None]:
clf_knn.fit(X_train, y_train)

0,1,2
,steps,"[('vectorizer', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,n_neighbors,10
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'euclidean'
,metric_params,
,n_jobs,


In [None]:
y_pred = clf_knn.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.65      0.66      4961
           1       0.66      0.66      0.66      5039

    accuracy                           0.66     10000
   macro avg       0.66      0.66      0.66     10000
weighted avg       0.66      0.66      0.66     10000



Exercise-3

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..
Note:

- use CountVectorizer for pre-processing the text.
- use Multinomial Naive Bayes as the classifier.
- print the classification report.

In [None]:
clf_mnb = Pipeline([('vectorizer', CountVectorizer()),
                ('mnb', MultinomialNB())])

In [None]:
clf_mnb.fit(X_train, y_train)

0,1,2
,steps,"[('vectorizer', ...), ('nb', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [None]:
y_pred = clf_mnb.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      4961
           1       0.87      0.82      0.85      5039

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000

