## Naive Bayes Classification

Concepts in this Notebook:
- Naive Bayes Classifiers
- Train Test Split
- Vectorizer (fit and transform)
- Accuracy of training and test data
- Sparse array vs normal array
- Comparing classifiers in sklearn


Some extra resources:

- [SKL Naive Bayes Documentation](http://scikit-learn.org/stable/modules/naive_bayes.html)

- [Stanford Naive Bayes Math](http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf)

In [90]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)

In [91]:
critics = pd.read_csv('https://raw.githubusercontent.com/misrab/SG_DAT1/master/data/rt_critics.csv')

In [92]:
critics.quote[2]

'A winning animated feature that has something for everyone on the age spectrum.'

In [141]:
critics.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559,Toy story
1,Richard Corliss,fresh,114709,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559,Toy story
2,David Ansen,fresh,114709,Newsweek,A winning animated feature that has something ...,2008-08-18,9559,Toy story
3,Leonard Klady,fresh,114709,Variety,The film sports a provocative and appealing st...,2008-06-09,9559,Toy story
4,Jonathan Rosenbaum,fresh,114709,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559,Toy story


### Multinomial vs Bernoulli Models

- The **Multinomial model** actually counts occurences out of all possible occurences for probability - better for greater features
- The **Bernoulli model** counts only all documents with presence of the word - better for fewer features

In [94]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

### How the Count Vectorizer Works

In [95]:
#
### How the Count Vectorizer Works - TODO
#

from sklearn.feature_extraction.text import CountVectorizer

text = ['Math is great', 'Math is really great', 'Exciting exciting Math']
print "Original texts: \n\t", "\n\t".join(text)
CountV = CountVectorizer(ngram_range=(1,2))

# call 'fit' to build the vocabulary
CountV.fit(text)

Original texts: 
	Math is great
	Math is really great
	Exciting exciting Math


CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

### Q: What is an ngram? - Follow along

In [96]:
CountV.get_feature_names()

[u'exciting',
 u'exciting exciting',
 u'exciting math',
 u'great',
 u'is',
 u'is great',
 u'is really',
 u'math',
 u'math is',
 u'really',
 u'really great']

In [97]:
# Call 'transform' to convert text to a bag of words
x = CountV.transform(text)
print x

  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 7)	1
  (0, 8)	1
  (1, 3)	1
  (1, 4)	1
  (1, 6)	1
  (1, 7)	1
  (1, 8)	1
  (1, 9)	1
  (1, 10)	1
  (2, 0)	2
  (2, 1)	1
  (2, 2)	1
  (2, 7)	1


In [98]:
# CountVectorizer uses a sparse array to save memory
x_back = x.toarray()
x_back

array([[0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1],
       [2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0]], dtype=int64)

## Preparing our Features (X) and Target (Y) for Training

X is a (nreview, nwords) array. Each row corresponds to a bag-of-words representation for a single review. This will be the input to the model.

Y is a nreview-element 1/0 array, encoding whether a review is Fresh (1) or Rotten (0). This is the desired output

Now you try and create X and Y - TODO

In [111]:
# Use SKLearn's train_test_split - TODO

# Instantiate the vectorizer with n-grams of length one or two
CountV = CountVectorizer(ngram_range=(1,2))

# Create a Vector where each row is bag-of-words for a single quote
X = CountV.fit_transform(critics.quote)
print X

  (0, 126031)	1
  (0, 70236)	1
  (0, 68218)	1
  (0, 30919)	1
  (0, 36918)	1
  (0, 6446)	2
  (0, 46504)	1
  (0, 136476)	1
  (0, 162696)	1
  (0, 32582)	1
  (0, 155773)	1
  (0, 74183)	1
  (0, 100464)	1
  (0, 108976)	1
  (0, 129252)	1
  (0, 124784)	1
  (0, 120073)	1
  (0, 130059)	1
  (0, 16454)	1
  (0, 43885)	1
  (0, 23678)	1
  (0, 75443)	1
  (0, 27068)	1
  (0, 126219)	1
  (0, 70242)	1
  :	:
  (14071, 145418)	1
  (14071, 110536)	1
  (14071, 39514)	1
  (14071, 44081)	1
  (14071, 146659)	1
  (14071, 141490)	1
  (14071, 74474)	1
  (14071, 7715)	1
  (14071, 37185)	1
  (14071, 39569)	1
  (14071, 61942)	1
  (14071, 98137)	1
  (14071, 1730)	1
  (14071, 59727)	1
  (14071, 70595)	1
  (14071, 131226)	1
  (14071, 70597)	1
  (14071, 44125)	1
  (14071, 59730)	1
  (14071, 145554)	1
  (14071, 131227)	1
  (14071, 1734)	1
  (14071, 110549)	1
  (14071, 37204)	1
  (14071, 71185)	1


In [137]:
# Create an array where each element encodes whether the array is Fresh or Rotten

Y = (critics.fresh == 'fresh').values.astype(np.int)
print Y

[1 1 1 ..., 1 1 1]


In [136]:
# Use SKlearn's train_test_split
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)

## Creating the Classifier

In [152]:
# vector of all quotes
rotten_CV = CountV.fit(critics.quote)

# a few helper functions
def accuracy_report(_clf):
    print "Accuracy: %0.2f%%" % (100 * _clf.score(xtest, ytest))

    #Print the accuracy on the test and training dataset
    training_accuracy = _clf.score(xtrain, ytrain)
    test_accuracy = _clf.score(xtest, ytest)

    print "Accuracy on training data: %0.2f" % (training_accuracy)
    
# a function to run some tests
def AnalyzeReview(testquote, _clf):
    print "\""  + testquote + "\" is judged by clasifier to be..."
    testquote = rotten_CV.transform([testquote])

    if (_clf.predict(testquote)[0] == 1):
        print "... a fresh review."
    else:
        print "... a rotten review."
    return(_clf.predict(testquote)[0])

In [129]:
# TODO - run Multinomial NB, and report accuracy

from sklearn.naive_bayes import MultinomialNB

print "MultinomialNB"
clf_mn = MultinomialNB().fit(xtrain, ytrain)
accuracy_report(clf_mn)

MultinomialNB
Accuracy: 75.95%
Accuracy on training data: 0.99


In [132]:
# TODO - likewise for Bernoulli NB

from sklearn.naive_bayes import BernoulliNB
print "BernoulliNB"
clf_bern = BernoulliNB().fit(xtrain, ytrain)
accuracy_report(clf_bern)

BernoulliNB
Accuracy: 65.89%
Accuracy on training data: 0.87


In [134]:
# TODO - run Logistic Regression for comparison
from sklearn.linear_model import LogisticRegression
print "Logistic Regression"
clf_lr = LogisticRegression().fit(xtrain, ytrain)
accuracy_report(clf_lr)

Logistic Regression
Accuracy: 76.95%
Accuracy on training data: 1.00


In [154]:
AnalyzeReview("The year's most inventive comedy.", clf_mn)
AnalyzeReview("The year's most inventive comedy.", clf_bern)
AnalyzeReview("The year's most inventive comedy.", clf_lr)

"The year's most inventive comedy." is judged by clasifier to be...
... a fresh review.
"The year's most inventive comedy." is judged by clasifier to be...
... a fresh review.
"The year's most inventive comedy." is judged by clasifier to be...
... a fresh review.


1