# Tutorial - build MNB with sklearn

This tutorial demonstrates how to use the Sci-kit Learn (sklearn) package to build Multinomial Naive Bayes model, rank features, and use the model for prediction. 

The data from the Kaggle Sentiment Analysis on Movie Review Competition are used in this tutorial. Check out the details of the data and the competition on Kaggle.
https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

The tutorial also includes sample code to prepare your prediction result for submission to Kaggle. Although the competition is over, you can still submit your prediction to get an evaluation score.

# Step 1: Read in data

In [18]:
# read in the training data

# the data set includes four columns: PhraseId, SentenceId, Phrase, Sentiment
# In this data set a sentence is further split into phrases 
# in order to build a sentiment classification model
# that can not only predict sentiment of sentences but also shorter phrases

# A data example:
# PhraseId SentenceId Phrase Sentiment
# 1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .1

# the Phrase column includes the training examples
# the Sentiment column includes the training labels
# "0" for very negative
# "1" for negative
# "2" for neutral
# "3" for positive
# "4" for very positive

import numpy as np
import pandas as p
import re

train=p.read_csv("C:\\Users\\rkrishnan\\Documents\\01 Personal\\MS\\IST 736\\Week6\\kaggle-sentiment\\kaggle-sentiment\\train.tsv", delimiter='\t')

y=train['Sentiment'].values
X=p.DataFrame()
X['Phrase']=train['Phrase'].apply(lambda x: " ".join(re.sub('n\'t',' not',x) for x in x.split()))
X=X['Phrase'].values

# Step 2: Split train/test data for hold-out test

In [19]:
# check the sklearn documentation for train_test_split
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
# "test_size" : float, int, None, optional
# If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. 
# If int, represents the absolute number of test samples. 
# If None, the value is set to the complement of the train size. 
# By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size.    

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train[0])
print(y_train[0])
print(X_test[0])
print(y_test[0])

(93636,) (93636,) (62424,) (62424,)
almost in a class with that of Wilde
3
escape movie
2


Sample output from the code above:

(93636,) (93636,) (62424,) (62424,)
almost in a class with that of Wilde
3
escape movie
2

# Step 2.1 Data Checking

In [20]:
# Check how many training examples in each category
# this is important to see whether the data set is balanced or skewed

unique, counts = np.unique(y_train, return_counts=True)
print(np.asarray((unique, counts)))

[[    0     1     2     3     4]
 [ 4141 16449 47718 19859  5469]]


The sample output shows that the data set is skewed with 47718/93636=51% "neutral" examples. All other categories are smaller.

{0, 1, 2, 3, 4}
[[    0  4141]
 [    1 16449]
 [    2 47718]
 [    3 19859]
 [    4  5469]]

# Step 3: Vectorization

In [49]:
# sklearn contains two vectorizers

# CountVectorizer can give you Boolean or TF vectors
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# TfidfVectorizer can give you TF or TFIDF vectors
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

# Read the sklearn documentation to understand all vectorization options

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# several commonly used vectorizer setting

#  unigram boolean vectorizer, set minimum document frequency to 5
gram13_bool_vectorizer = CountVectorizer(encoding='latin-1',ngram_range=(1,3), binary=True, min_df=3, stop_words='english')


#  unigram tfidf vectorizer, set minimum document frequency to 5
gram13_tf_vectorizer = TfidfVectorizer(encoding='latin-1', ngram_range=(1,3),use_idf=False, min_df=3, stop_words='english')


## Step 3.1: Vectorize the training data

In [50]:
# The vectorizer can do "fit" and "transform"
# fit is a process to collect unique tokens into the vocabulary
# transform is a process to convert each document to vector based on the vocabulary
# These two processes can be done together using fit_transform(), or used individually: fit() or transform()

# fit vocabulary in training documents and transform the training documents into vectors
X_train_vec = gram13_bool_vectorizer.fit_transform(X_train)
X2_train_vec = gram13_tf_vectorizer.fit_transform(X_train)

# check the content of a document vector
print(X_train_vec.shape)
print(X2_train_vec.shape)
print(X_train_vec[0].toarray())
print(X2_train_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(gram13_bool_vectorizer.vocabulary_))
print(len(gram13_tf_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(gram13_bool_vectorizer.vocabulary_.items())[:10])
print(list(gram13_tf_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(gram13_bool_vectorizer.vocabulary_.get('imaginative'))
print(gram13_tf_vectorizer.vocabulary_.get('imaginative'))

(93636, 14113)
(93636, 14113)
[[0 0 0 ... 0 0 0]]
[[0. 0. 0. ... 0. 0. 0.]]
14113
14113
[('class', 2207), ('wilde', 13849), ('derring', 3309), ('chilling', 2103), ('affecting', 363), ('meanspirited', 7770), ('personal', 9061), ('low', 7463), ('involved', 6654), ('worth', 13991)]
[('class', 2207), ('wilde', 13849), ('derring', 3309), ('chilling', 2103), ('affecting', 363), ('meanspirited', 7770), ('personal', 9061), ('low', 7463), ('involved', 6654), ('worth', 13991)]
6195
6195


Sample output:

(93636, 11967)
[[0 0 0 ..., 0 0 0]]
11967
[('imaginative', 5224), ('tom', 10809), ('smiling', 9708), ('easy', 3310), ('diversity', 3060), ('impossibly', 5279), ('buy', 1458), ('sentiments', 9305), ('households', 5095), ('deteriorates', 2843)]
5224

## Step 3.2: Vectorize the test data

In [51]:
# use the vocabulary constructed from the training data to vectorize the test data. 
# Therefore, use "transform" only, not "fit_transform", 
# otherwise "fit" would generate a new vocabulary from the test data

X_test_vec = gram13_bool_vectorizer.transform(X_test)
X2_test_vec = gram13_tf_vectorizer.transform(X_test)

# print out #examples and #features in the test set
print(X_test_vec.shape)
print(X2_test_vec.shape)

(62424, 14113)
(62424, 14113)


Sample output:

(62424, 14324)

# Step 4: Train a MNB classifier

In [52]:
# import the MNB module
from sklearn.naive_bayes import MultinomialNB

# initialize the MNB model
nb_clf= MultinomialNB()
nb_clf2= MultinomialNB()

# use the training data to train the MNB model
nb_clf.fit(X_train_vec,y_train)
nb_clf2.fit(X2_train_vec,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# Step 4.1 Interpret a trained MNB model

In [53]:
## interpreting naive Bayes models
## by consulting the sklearn documentation you can also find out feature_log_prob_, 
## which are the conditional probabilities
## http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

# the code below will print out the conditional prob of the word "worthless" in each category
# sample output
# -8.98942647599 -> logP('worthless'|very negative')
# -11.1864401922 -> logP('worthless'|negative')
# -12.3637684625 -> logP('worthless'|neutral')
# -11.9886066961 -> logP('worthless'|positive')
# -11.0504454621 -> logP('worthless'|very positive')
# the above output means the word feature "worthless" is indicating "very negative" 
# because P('worthless'|very negative) is the greatest among all conditional probs

gram13_bool_vectorizer.vocabulary_.get('worthless')
for i in range(0,5):
  print(nb_clf.feature_log_prob_[i][gram13_bool_vectorizer.vocabulary_.get('worthless')])

gram13_tf_vectorizer.vocabulary_.get('worthless')
for i in range(0,5):
  print(nb_clf2.feature_log_prob_[i][gram13_tf_vectorizer.vocabulary_.get('worthless')])

-8.60402598103152
-10.682066738534797
-11.877908798454563
-11.510241867341026
-10.686315460863419
-8.697119005821374
-10.208564336094662
-11.362984109746867
-10.867265879535756
-10.168804206399354


Sample output:

-8.5389826392
-10.6436375867
-11.8419845779
-11.4778370023
-10.6297551464

In [54]:
# sort the conditional probability for category 0 "very negative"
# print the words with highest conditional probs
# these can be words popular in the "very negative" category alone, or words popular in all cateogires

feature_ranks = sorted(zip(nb_clf.feature_log_prob_[0], gram13_bool_vectorizer.get_feature_names()))
very_negative_features = feature_ranks[-10:]
print(very_negative_features)

feature_ranks2 = sorted(zip(nb_clf2.feature_log_prob_[0], gram13_tf_vectorizer.get_feature_names()))
very_negative_features2 = feature_ranks2[-10:]
print(very_negative_features2)

[(-6.017336636933576, 'time'), (-6.006641347816829, 'minutes'), (-5.996059238486292, 'characters'), (-5.996059238486292, 'story'), (-5.97522515158345, 'comedy'), (-5.770812636975303, 'just'), (-5.266732401348844, 'like'), (-5.142764358626714, 'bad'), (-4.918724348265467, 'film'), (-4.392957144501278, 'movie')]
[(-6.475600982412516, 'does'), (-6.442282568848196, 'comedy'), (-6.439464260205117, 'time'), (-6.399933511246903, 'story'), (-6.252171108650741, 'worst'), (-6.187991455812202, 'just'), (-5.720173171370445, 'like'), (-5.355650138858941, 'film'), (-5.344139370002233, 'bad'), (-4.783066126023934, 'movie')]


# Exercise C

In [55]:
def show_most_and_least_informative_features(vectorizer,clf,class_idx=0,n=10):
    feature_names=vectorizer.get_feature_names()
    coefs_with_fns=sorted(zip(clf.coef_[class_idx],feature_names))
    top=zip(coefs_with_fns[:n],coefs_with_fns[-n:])
    for (coef_1,fn_1),(coef_2,fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1,fn_1,coef_2,fn_2))


# calculate log ratio of conditional probs

# In this exercise you will calculate the log ratio 
# between conditional probs in the "very negative" category
# and conditional probs in the "very positive" category,
# and then sort and print out the top and bottom 10 words

# the conditional probs for the "very negative" category is stored in nb_clf.feature_log_prob_[0]
# the conditional probs for the "very positive" category is stored in nb_clf.feature_log_prob_[4]

# You can consult with similar code in week 4's sample script on feature weighting
# Note that in sklearn's MultinomialNB the conditional probs have been converted to log values.

# Your code starts here
# Top 10 Very Negative features
feature_ranks = sorted(zip(nb_clf.feature_log_prob_[0], gram13_bool_vectorizer.get_feature_names()))
very_negative_features = feature_ranks[-10:]
print(very_negative_features)

feature_ranks2 = sorted(zip(nb_clf2.feature_log_prob_[0], gram13_tf_vectorizer.get_feature_names()))
very_negative_features2 = feature_ranks2[-10:]
print(very_negative_features2)

# Bottom 10 Very Negative features
feature_ranks = sorted(zip(nb_clf.feature_log_prob_[0], gram13_bool_vectorizer.get_feature_names()))
least_negative_features = feature_ranks[1:10:]
print(least_negative_features)

feature_ranks2 = sorted(zip(nb_clf2.feature_log_prob_[0], gram13_tf_vectorizer.get_feature_names()))
least_negative_features2 = feature_ranks2[1:10:]
print(least_negative_features2)


# Top 10 Very Positive features
feature_ranks = sorted(zip(nb_clf.feature_log_prob_[4], gram13_bool_vectorizer.get_feature_names()))
very_positive_features = feature_ranks[-10:]
print(very_positive_features)

feature_ranks2 = sorted(zip(nb_clf2.feature_log_prob_[4], gram13_tf_vectorizer.get_feature_names()))
very_positive_features2 = feature_ranks2[1:10:]
print(very_positive_features2)

# Top 10 Very Positive features
feature_ranks = sorted(zip(nb_clf.feature_log_prob_[4], gram13_bool_vectorizer.get_feature_names()))
least_positive_features = feature_ranks[-10:]
print(least_positive_features)

feature_ranks2 = sorted(zip(nb_clf2.feature_log_prob_[4], gram13_tf_vectorizer.get_feature_names()))
least_positive_features2 = feature_ranks2[1:10:]
print(least_positive_features2)

show_most_and_least_informative_features(gram13_bool_vectorizer,nb_clf,class_idx=0,n=10)
show_most_and_least_informative_features(gram13_bool_vectorizer,nb_clf,class_idx=4,n=10)
show_most_and_least_informative_features(gram13_tf_vectorizer,nb_clf,class_idx=0,n=10)
show_most_and_least_informative_features(gram13_tf_vectorizer,nb_clf,class_idx=4,n=10)
# Your code ends here

[(-6.017336636933576, 'time'), (-6.006641347816829, 'minutes'), (-5.996059238486292, 'characters'), (-5.996059238486292, 'story'), (-5.97522515158345, 'comedy'), (-5.770812636975303, 'just'), (-5.266732401348844, 'like'), (-5.142764358626714, 'bad'), (-4.918724348265467, 'film'), (-4.392957144501278, 'movie')]
[(-6.475600982412516, 'does'), (-6.442282568848196, 'comedy'), (-6.439464260205117, 'time'), (-6.399933511246903, 'story'), (-6.252171108650741, 'worst'), (-6.187991455812202, 'just'), (-5.720173171370445, 'like'), (-5.355650138858941, 'film'), (-5.344139370002233, 'bad'), (-4.783066126023934, 'movie')]
[(-10.549936130086833, '10th'), (-10.549936130086833, '127'), (-10.549936130086833, '129'), (-10.549936130086833, '12th'), (-10.549936130086833, '13th'), (-10.549936130086833, '14'), (-10.549936130086833, '15th'), (-10.549936130086833, '16'), (-10.549936130086833, '163')]
[(-10.064921450665372, '10th'), (-10.064921450665372, '127'), (-10.064921450665372, '129'), (-10.0649214506653

Sample output for print(log_ratios[0])

-0.838009538739

# Step 5: Test the MNB classifier

In [56]:
# test the classifier on the test data set, print accuracy score

nb_clf.score(X_test_vec,y_test)


0.6049756503908753

In [57]:
nb_clf2.score(X2_test_vec,y_test)

0.5803537101114956

In [58]:
# print confusion matrix (row: ground truth; col: prediction)

from sklearn.metrics import confusion_matrix
y_pred = nb_clf.fit(X_train_vec, y_train).predict(X_test_vec)
cm=confusion_matrix(y_test, y_pred, labels=[0,1,2,3,4])
print(cm)



[[  728  1334   751   106    12]
 [  621  4347  5175   648    33]
 [  265  2534 25297  3516   252]
 [   22   473  5371  6430   772]
 [    1    50   691  2032   963]]


In [59]:
y_pred2 = nb_clf2.fit(X2_train_vec, y_train).predict(X2_test_vec)
cm2=confusion_matrix(y_test, y_pred2, labels=[0,1,2,3,4])
print(cm2)

[[   70  1105  1685    71     0]
 [   40  2435  7997   352     0]
 [   10  1107 28773  1958    16]
 [    0   138  8058  4814    58]
 [    0     9  1424  2168   136]]


In [60]:
# print classification report

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None))
print(precision_score(y_test, y_pred2, average=None))
print(recall_score(y_test, y_pred2, average=None))

from sklearn.metrics import classification_report
target_names = ['0','1','2','3','4']
print(classification_report(y_test, y_pred, target_names=target_names))
print(classification_report(y_test, y_pred2, target_names=target_names))

[0.44471594 0.49748226 0.6784766  0.5050267  0.47391732]
[0.24837939 0.40160754 0.79390535 0.49204163 0.25769334]
[0.58333333 0.50792657 0.6002253  0.51415145 0.64761905]
[0.02388263 0.22496305 0.90299397 0.36838078 0.03639283]
              precision    recall  f1-score   support

           0       0.44      0.25      0.32      2931
           1       0.50      0.40      0.44     10824
           2       0.68      0.79      0.73     31864
           3       0.51      0.49      0.50     13068
           4       0.47      0.26      0.33      3737

    accuracy                           0.60     62424
   macro avg       0.52      0.44      0.47     62424
weighted avg       0.59      0.60      0.59     62424

              precision    recall  f1-score   support

           0       0.58      0.02      0.05      2931
           1       0.51      0.22      0.31     10824
           2       0.60      0.90      0.72     31864
           3       0.51      0.37      0.43     13068
           4

# Step 5.1 Interpret the prediction result

In [42]:
## find the calculated posterior probability
posterior_probs = nb_clf.predict_proba(X_test_vec)
posterior_probs2= nb_clf2.predict_proba(X2_test_vec)

## find the posterior probabilities for the first test example
print(posterior_probs[0])
print(posterior_probs2[0])

# find the category prediction for the first test example
y_pred = nb_clf.predict(X_test_vec)
print(y_pred[0])

y_pred2 = nb_clf2.predict(X2_test_vec)
print(y_pred2[0])

# check the actual label for the first test example
print(y_test[0])

[0.03065103 0.16491852 0.75441523 0.03424136 0.01577386]
[0.02425665 0.17822079 0.66685966 0.10818064 0.02248227]
2
2
2


sample output array([ 0.06434628  0.34275846  0.50433091  0.07276319  0.01580115]

Because the posterior probability for category 2 (neutral) is the greatest, 0.50, the prediction should be "2". Because the actual label is also "2", this is a correct prediction


# Step 5.2 Error Analysis

In [45]:
# print out specific type of error for further analysis

# print out the very positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 53 such examples
# note if you use a different vectorizer option, your result might be different

err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==4 and y_pred[i]==0):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

there 's no way you wo  not be talking about the film once you exit the theater .
two fine actors , Morgan Freeman and Ashley Judd
What John does is heroic , but we do  not condone it , '' one of the film 's stars recently said , a tortuous comment that perfectly illustrates the picture 's moral schizophrenia .
of healthy eccentric inspiration and ambition
exceptionally good idea
distinguished actor
Spy Kids a surprising winner with both adults and younger audiences
errors: 7


# Exercise D

In [46]:
# Can you find linguistic patterns in the above errors? 
# What kind of very positive examples were mistakenly predicted as negative?

# Negations are being considered as negative 
# words like grief and loss , evil are negative keywords but it's is non-negative in this contextual
# terrifically entertaining and spectacularly outrageous are highly positive in this context

# Can you write code to print out the errors that very negative examples were mistakenly predicted as very positive?
# Your code starts here
err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==0 and y_pred[i]==4):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)
# Your code ends here
# Can you find lingustic patterns for this kind of errors?


# Again negation is not linked to the right bucket like "not the great american comedy", "never achieve the popularity" and "may not be a great piece of filmmaking"
# phrases like "cleverly crafted but ultimately hollow mockumentary"  , "beautifully shot but dull and ankle-deep ` epic . '" are very negative in nature but categorized as positive

# Based on the above error analysis, what suggestions would you give to improve the current model?

# One option is to use ngram to train the model which can assess the sentiment score of a phrases rather than just using individual keywords.
# consider negation while training the model, negation followed by positive keywords and negation followed by negative keywaords are opposite in nature.

rude and profane
this is the opposite of a truly magical movie .
it would be a shame if this was your introduction to one of the greatest plays of the last 100 years
The most horrific movie experience
emotionally belittle a cinema classic .
But if the essence of magic is its make-believe promise of life that soars above the material realm , this is the opposite of a truly magical movie .
is not Edward Burns ' best film
It may not be a great piece of filmmaking ,
is the opposite of a truly magical movie .
would be a shame if this was your introduction to one of the greatest plays of the last 100 years
to this shocking testament to anti-Semitism and neo-fascism
errors: 11


# Step 6: write the prediction output to file

In [60]:
y_pred=nb_clf.predict(X_test_vec)
output = open('C:\\Users\\rkrishnan\\Documents\\01 Personal\\MS\\IST 736\\Week6\\kaggle-sentiment\\kaggle-sentiment\\prediction_output.csv', 'w')
for x, value in enumerate(y_pred):
  output.write(str(value) + '\n') 
output.close()

# Step 6.1 Prepare submission to Kaggle sentiment classification competition

In [61]:
########## submit to Kaggle submission

# we are still using the model trained on 60% of the training data
# you can re-train the model on the entire data set 
#   and use the new model to predict the Kaggle test data
# below is sample code for using a trained model to predict Kaggle test data 
#    and format the prediction output for Kaggle submission

# read in the test data
kaggle_test=p.read_csv("C:\\Users\\rkrishnan\\Documents\\01 Personal\\MS\\IST 736\\Week6\\kaggle-sentiment\\kaggle-sentiment\\test.tsv", delimiter='\t') 

# preserve the id column of the test examples
kaggle_ids=kaggle_test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=kaggle_test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec=gram13_bool_vectorizer.transform(kaggle_X_test)
kaggle_X2_test_vec=gram13_tf_vectorizer.transform(kaggle_X_test)

# predict using the NB classifier that we built
kaggle_pred=nb_clf.fit(X_train_vec, y_train).predict(kaggle_X_test_vec)
kaggle_pred2=nb_clf2.fit(X2_train_vec, y_train).predict(kaggle_X2_test_vec)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)
kaggle_submission2=zip(kaggle_ids, kaggle_pred2)

# prepare output file
outf=open('C:\\Users\\rkrishnan\\Documents\\01 Personal\\MS\\IST 736\\Week6\\kaggle-sentiment\\kaggle-sentiment\\kaggle_submission.csv', 'w')
outf2=open('C:\\Users\\rkrishnan\\Documents\\01 Personal\\MS\\IST 736\\Week6\\kaggle-sentiment\\kaggle-sentiment\\kaggle_submission2.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')
outf2.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')
for x, value in enumerate(kaggle_submission2): outf2.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()
outf2.close()

In [42]:
kaggle_X_test_vec



<66292x11967 sparse matrix of type '<class 'numpy.int64'>'
	with 194704 stored elements in Compressed Sparse Row format>

In [43]:
X_train_vec

<93636x34579 sparse matrix of type '<class 'numpy.int64'>'
	with 501197 stored elements in Compressed Sparse Row format>

In [45]:
len(y_train)

93636

# Exercise E

In [None]:
# generate your Kaggle submissions with boolean representation and TF representation
# submit to Kaggle
# report your scores here
# which model gave better performance in the hold-out test
# which model gave better performance in the Kaggle test

Sample output:

(93636, 9968)
[[0 0 0 ..., 0 0 0]]
9968
[('disloc', 2484), ('surgeon', 8554), ('camaraderi', 1341), ('sketchiest', 7943), ('dedic', 2244), ('impud', 4376), ('adopt', 245), ('worker', 9850), ('buy', 1298), ('systemat', 8623)]
245

# BernoulliNB

In [30]:
from sklearn.naive_bayes import BernoulliNB
X_train_vec_bool = unigram_bool_vectorizer.fit_transform(X_train)
bernoulliNB_clf = BernoulliNB(X_train_vec_bool, y_train)

# Cross Validation

In [31]:
# cross validation

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
nb_clf_pipe = Pipeline([('vect', CountVectorizer(encoding='latin-1', binary=False)),('nb', MultinomialNB())])
scores = cross_val_score(nb_clf_pipe, X, y, cv=3)
avg=sum(scores)/len(scores)
print(avg)

0.559547456968


# Exercise F

In [33]:
# run 3-fold cross validation to compare the performance of 
# (1) BernoulliNB (2) MultinomialNB with TF vectors (3) MultinomialNB with boolean vectors

# Your code starts here


# Your code ends here

0.55315243657
0.553844611375
0.552306763002
0.560136963721


# Optional: use external linguistic resources such as stemmer

In [204]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk.stem

english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])

stem_vectorizer = StemmedCountVectorizer(min_df=3, analyzer="word")
X_train_stem_vec = stem_vectorizer.fit_transform(X_train)

In [194]:
# check the content of a document vector
print(X_train_stem_vec.shape)
print(X_train_stem_vec[0].toarray())

# check the size of the constructed vocabulary
print(len(stem_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(stem_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(stem_vectorizer.vocabulary_.get('adopt'))

(93636, 9968)
[[0 0 0 ..., 0 0 0]]
9968
[('disloc', 2484), ('surgeon', 8554), ('camaraderi', 1341), ('sketchiest', 7943), ('dedic', 2244), ('impud', 4376), ('adopt', 245), ('worker', 9850), ('buy', 1298), ('systemat', 8623)]
245
