# Multinomial Naive Bays Graded Questions

In this segment, you will use the IMDB movie reviews dataset to classify reviews as 'positive' or 'negative'. We have divided the data into training and test sets. The training set contains 800 positive and 800 negative movie reviews whereas the test set contains 200 positive and 200  negative movie reviews.

 

This was one of the first widely-available sentiment analysis datasets compiled by Pang and Lee's. The data was first collected in 2002, however, the text is similar to movies reviews you find on IMDB today. The dataset is in a CSV format. It has two categories: Pos (reviews that express a positive or favourable sentiment) and Neg (reviews that express a negative or unfavourable sentiment). For this exercise, we will assume that all reviews are either positive or negative; there are no neutral reviews.

 

You will need to build a Multinomial Naive Bayes classification model in Python for solving the questions.

Please find the imdb_train dataset here and imdb_test dataset here.

 

Note: 

Tag negative(Neg) as 0 and positive(Pos) as 1.
Please answer the following questions based on the model you make using above datasets:


The notebook is divided into the following sections:
1. Importing and preprocessing data
2. Building the model: Multinomial Naive Bayes

### 1. Importing and Preprocessing Data

In [188]:
import numpy as np
import pandas as pd
import sklearn

# training data
train_docs = pd.read_csv('movie_review_train.csv') 
test_docs = pd.read_csv('movie_review_test.csv') 
train_docs

Unnamed: 0,class,text
0,Pos,a common complaint amongst film critics is ...
1,Pos,whew this film oozes energy the kind of b...
2,Pos,steven spielberg s amistad which is bas...
3,Pos,he has spent his entire life in an awful litt...
4,Pos,being that it is a foreign language film with...
...,...,...
1595,Neg,if anything stigmata should be taken as...
1596,Neg,john boorman s zardoz is a goofy cinemati...
1597,Neg,the kids in the hall are an acquired taste ...
1598,Neg,there was a time when john carpenter was a gr...


In [189]:
# convert label to a numerical variable
train_docs['class'] = train_docs["class"].map({'Pos':1, 'Neg':0})
test_docs['class'] = test_docs["class"].map({'Pos':1, 'Neg':0})
train_docs

Unnamed: 0,class,text
0,1,a common complaint amongst film critics is ...
1,1,whew this film oozes energy the kind of b...
2,1,steven spielberg s amistad which is bas...
3,1,he has spent his entire life in an awful litt...
4,1,being that it is a foreign language film with...
...,...,...
1595,0,if anything stigmata should be taken as...
1596,0,john boorman s zardoz is a goofy cinemati...
1597,0,the kids in the hall are an acquired taste ...
1598,0,there was a time when john carpenter was a gr...


Let's now split the dataframe into X and y labels.

In [190]:
# convert the df to a numpy array 
train_array = train_docs.values
test_array = test_docs.values
# split X and y
X_train = train_array[:,1]
y_train = train_array[:,0]
y_train = y_train.astype('int') # sklearn needs y as integers
X_test = test_array[:,1]
y_test = test_array[:,0]
y_test = y_test.astype('int') # sklearn needs y as integers

print("X_train")
print(X_train)
print("y_train")
print(y_train)

X_train
[' a common complaint amongst film critics is   why aren t there more literate scripts available      quiz show gives signs of hope that the art of writing isn t dead in hollywood and that we need not only look to independent films for thoughtful content    paul attanasio s script takes what could have been a tepid thriller   the quiz show scandals of the late 50s   and delivers a telling parable about the emptiness of the post war american dream and the golden bubble that surrounds and protects tv networks and their sponsors    the film is riddled with telling symbols   e   g    a  58 chrysler   a radio announcement of sputnik   but is never heavy handed    deft direction by robert redford and keen performances by ralph fiennes   john turturro and rob morrow dovetail perfectly with the carefully honed script    redford departs from the usually overlight     cable tv quality   sets and camera work so common in recent 20th century period pieces    quiz show perfectly captures th

### Creating the Bag of Words Representation

We now have to convert the data into a format which can be used for training the model. We'll use the **bag of words representation** for each sentence (document).

Imagine breaking X in individual words and putting them all in a bag. Then we pick all the unique words from the bag one by one and make a dictionary of unique words. 

This is called **vectorization of words**. We have the class ```CountVectorizer()``` in scikit learn to vectorize the words. 


In [191]:
# create an object of CountVectorizer() class 
from sklearn.feature_extraction.text import CountVectorizer 
# help(CountVectorizer)

In [192]:
vec = CountVectorizer()
vec1 = CountVectorizer()

In [193]:
# fit the vectorizer on training data 
vec.fit(X_train)
vec.vocabulary_
vec1.fit(X_test)
vec1.vocabulary_

{'films': 6509,
 'adapted': 383,
 'from': 6950,
 'comic': 3365,
 'books': 2022,
 'have': 7917,
 'had': 7729,
 'plenty': 12901,
 'of': 11899,
 'success': 16754,
 'whether': 18978,
 'they': 17384,
 're': 13841,
 'about': 235,
 'superheroes': 16840,
 'batman': 1551,
 'superman': 16845,
 'spawn': 16129,
 'or': 12021,
 'geared': 7140,
 'toward': 17680,
 'kids': 9570,
 'casper': 2643,
 'the': 17338,
 'arthouse': 1076,
 'crowd': 4087,
 'ghost': 7226,
 'world': 19223,
 'but': 2415,
 'there': 17369,
 'never': 11575,
 'really': 13877,
 'been': 1619,
 'book': 2018,
 'like': 10027,
 'hell': 8025,
 'before': 1625,
 'for': 6752,
 'starters': 16380,
 'it': 9177,
 'was': 18834,
 'created': 3991,
 'by': 2431,
 'alan': 598,
 'moore': 11183,
 'and': 790,
 'eddie': 5457,
 'campbell': 2504,
 'who': 19020,
 'brought': 2269,
 'medium': 10764,
 'to': 17558,
 'whole': 19024,
 'new': 11579,
 'level': 9964,
 'in': 8656,
 'mid': 10916,
 '80s': 183,
 'with': 19143,
 '12': 17,
 'part': 12407,
 'series': 15286,
 'ca

```Countvectorizer()``` has converted the documents into a set of unique words alphabetically sorted and indexed.


**Stop Words**

We can see a few trivial words such as  'and','is','of', etc. These words don't really make any difference in classyfying a document. These are called **stop words**. So we would like to get rid of them. 

We can remove them by passing a parameter stop_words='english' while instantiating ```Countvectorizer()``` as follows: 

In [194]:
# fitting the vectorizer on training data again
# removing the stop words this time
vec = CountVectorizer(stop_words='english')
vec.fit(X_train)
vec.vocabulary_

vec1 = CountVectorizer(stop_words='english')
vec1.fit(X_test)
vec1.vocabulary_

{'films': 6420,
 'adapted': 380,
 'comic': 3307,
 'books': 1972,
 'plenty': 12710,
 'success': 16536,
 'superheroes': 16621,
 'batman': 1518,
 'superman': 16626,
 'spawn': 15912,
 'geared': 7037,
 'kids': 9437,
 'casper': 2586,
 'arthouse': 1046,
 'crowd': 4027,
 'ghost': 7122,
 'world': 18934,
 'really': 13683,
 'book': 1968,
 'like': 9890,
 'hell': 7915,
 'starters': 16163,
 'created': 3931,
 'alan': 591,
 'moore': 11038,
 'eddie': 5387,
 'campbell': 2449,
 'brought': 2217,
 'medium': 10622,
 'new': 11420,
 'level': 9827,
 'mid': 10774,
 '80s': 183,
 '12': 17,
 'series': 15086,
 'called': 2423,
 'watchmen': 18586,
 'say': 14745,
 'thoroughly': 17180,
 'researched': 14084,
 'subject': 16490,
 'jack': 9071,
 'ripper': 14334,
 'saying': 14747,
 'michael': 10763,
 'jackson': 9076,
 'starting': 16164,
 'look': 10055,
 'little': 9973,
 'odd': 11721,
 'graphic': 7382,
 'novel': 11606,
 '500': 154,
 'pages': 12092,
 'long': 10048,
 'includes': 8567,
 'nearly': 11345,
 '30': 129,
 'consist': 

In [195]:
# printing feature names
print(vec.get_feature_names_out())
print(len(vec.get_feature_names_out()))

['00' '000' '007' ... 'zus' 'zwick' 'zwigoff']
35858


In [196]:
# printing feature names
print(vec1.get_feature_names_out())
print(len(vec1.get_feature_names_out()))

['00' '000' '0009f' ... 'zwigoff' 'zycie' 'zzzzzzz']
19150


Use CountVectorizer(stop_words='english', min_df=.03, max_df=.8) to create a new vocabulary 

In [197]:
# fitting the vectorizer on training data again
# removing the stop words this time
vec = CountVectorizer(stop_words='english',min_df=.03, max_df=.8)
vec.fit(X_train)
vec.vocabulary_

{'common': 264,
 'critics': 323,
 'aren': 78,
 'available': 101,
 'gives': 618,
 'hope': 693,
 'art': 81,
 'writing': 1632,
 'isn': 753,
 'dead': 342,
 'hollywood': 690,
 'need': 970,
 'look': 853,
 'films': 549,
 'content': 287,
 'paul': 1037,
 'script': 1248,
 'takes': 1429,
 'thriller': 1471,
 'late': 805,
 'delivers': 357,
 'telling': 1449,
 'post': 1089,
 'war': 1571,
 'american': 59,
 'dream': 413,
 'tv': 1512,
 'radio': 1145,
 'heavy': 673,
 'direction': 386,
 'robert': 1205,
 'performances': 1043,
 'john': 766,
 'rob': 1204,
 'perfectly': 1041,
 'usually': 1537,
 'quality': 1136,
 'sets': 1272,
 'camera': 192,
 'work': 1618,
 'recent': 1165,
 'century': 213,
 'period': 1044,
 'pieces': 1056,
 'years': 1638,
 'old': 1001,
 'images': 716,
 'true': 1502,
 'era': 464,
 'generation': 605,
 'gone': 625,
 '15': 4,
 'world': 1623,
 'themes': 1461,
 'good': 626,
 'life': 831,
 'family': 508,
 'match': 894,
 'father': 521,
 'fame': 506,
 'audience': 99,
 'appear': 72,
 'familiar': 507,
 

In [198]:
# printing feature names
print(vec.get_feature_names_out())
print(len(vec.get_feature_names_out()))

['000' '10' '100' ... 'york' 'young' 'younger']
1643


So our final dictionary is made of 12 words (after discarding the stop words). Now, to do classification, we need to represent all the documents with these words (or tokens) as features. 

Every document will be converted into a *feature vector* representing presence of these words in that document. Let's convert each of our training documents in to a feature vector.

In [199]:
# another way of representing the features
X_transformed = vec.transform(X_train)
X_transformed

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 217396 stored elements and shape (1600, 1643)>

In [200]:
# another way of representing the features  # ANSWER Graded
X_transformed1 = vec.transform(X_test)
X_transformed1

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 51663 stored elements and shape (400, 1643)>

In [201]:
X_transformed1

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 51663 stored elements and shape (400, 1643)>

You can see X_tranformed is a 5 x 12 **sparse matrix**. It has 5 rows for each of our 5 documents and 12 columns each 
for one word of the dictionary which we just created. Let us print X_transformed.

Suppose we build the vocabulary from the training data using CountVectorizer(stop_words='english', min_df=.03, max_df=.8) and then transform the test data using CountVectorizer(). How many nonzero entries are there in the sparse matrix (corresponding to the test data)? 

Note: Test data is provided in a separate CSV file.


Graded Question:

Train a Bernoulli Naive Bayes model on the training set and predict the classes of the test set. Each movie review in the test set has been labelled as 'Pos' or 'Neg'. What is the accuracy of the model?

Note - Dictionary should be prepared using CountVectorizer(stop_words='english', min_df=.03, max_df=.8)

In [202]:
# building a multinomial NB model
from sklearn.naive_bayes import MultinomialNB

# instantiate NB class
mnb=MultinomialNB()

# fitting the model on training data
mnb.fit(X_transformed, y_train)

# note that we are using the sparse matrix X_transformed, 
# though you can also use the non-sparse version
# mnb.fit(X_transformed.toarray(), y_train) 

# predicting probabilities of test data
proba = mnb.predict_proba(X_transformed1)

# predict class
y_pred_class = mnb.predict(X_transformed1)

# predict probabilities
y_pred_proba = mnb.predict_proba(X_transformed1)


In [203]:
# printing the overall accuracy
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.8275

In [204]:
from sklearn.naive_bayes import BernoulliNB

# instantiating bernoulli NB class
bnb=BernoulliNB()

# fitting the model
bnb.fit(X_transformed, y_train)

# also works
# bnb.fit(X_transformed.toarray(), y_train)

# predicting probability of test data
y_pred_class_bnb = bnb.predict(X_transformed1)

prob_bnb = bnb.predict_proba(X_transformed1)
prob_bnb

array([[9.48171309e-03, 9.90518287e-01],
       [6.49769619e-03, 9.93502304e-01],
       [3.94902967e-01, 6.05097033e-01],
       [2.04992401e-12, 1.00000000e+00],
       [9.98827241e-01, 1.17275875e-03],
       [1.78941734e-12, 1.00000000e+00],
       [5.94773559e-07, 9.99999405e-01],
       [2.02838492e-03, 9.97971615e-01],
       [9.17950086e-01, 8.20499136e-02],
       [9.44944057e-02, 9.05505594e-01],
       [9.97389573e-04, 9.99002610e-01],
       [8.55906408e-04, 9.99144094e-01],
       [3.30288186e-03, 9.96697118e-01],
       [2.50950625e-02, 9.74904938e-01],
       [7.32820406e-05, 9.99926718e-01],
       [5.14487298e-01, 4.85512702e-01],
       [9.77854644e-01, 2.21453565e-02],
       [2.42691437e-09, 9.99999998e-01],
       [9.44024181e-03, 9.90559758e-01],
       [9.91992498e-01, 8.00750156e-03],
       [9.66449102e-01, 3.35508975e-02],
       [4.86743623e-03, 9.95132564e-01],
       [9.99484407e-01, 5.15592846e-04],
       [2.34114640e-08, 9.99999977e-01],
       [1.307296

### Model Evaluation

In [205]:
# printing the overall accuracy
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class_bnb)

0.79

In [206]:
# confusion matrix
metrics.confusion_matrix(y_test, y_pred_class_bnb)
# help(metrics.confusion_matrix)

array([[177,  23],
       [ 61, 139]], dtype=int64)

In [207]:
confusion = metrics.confusion_matrix(y_test, y_pred_class_bnb)
print(confusion)
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
TP = confusion[1, 1]

[[177  23]
 [ 61 139]]


In [208]:
sensitivity = TP / float(FN + TP)
print("sensitivity",sensitivity)

sensitivity 0.695


In [209]:
specificity = TN / float(TN + FP)
print("specificity",specificity)

specificity 0.885


In [210]:
precision = TP / float(TP + FP)
print("precision",precision)
print(metrics.precision_score(y_test, y_pred_class))

precision 0.8580246913580247
0.8502673796791443


In [211]:
print("precision",precision)
print("PRECISION SCORE :",metrics.precision_score(y_test, y_pred_class))
print("RECALL SCORE :", metrics.recall_score(y_test, y_pred_class))
print("F1 SCORE :",metrics.f1_score(y_test, y_pred_class))

precision 0.8580246913580247
PRECISION SCORE : 0.8502673796791443
RECALL SCORE : 0.795
F1 SCORE : 0.8217054263565892
