## Naive Bayes Classification

Concepts in this Notebook:
- Naive Bayes Classifiers
- Train Test Split
- Vectorizer (fit and transform)
- Accuracy of training and test data
- Sparse array vs normal array
- Comparing classifiers in sklearn


Some extra resources:

- [SKL Naive Bayes Documentation](http://scikit-learn.org/stable/modules/naive_bayes.html)

- [Stanford Naive Bayes Math](http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf)

In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)

In [2]:
critics = pd.read_csv('https://raw.githubusercontent.com/gfleetwood/fall-2014-lessons/master/datasets/rt_critics.csv')

In [3]:
critics.quote[2]

'A winning animated feature that has something for everyone on the age spectrum.'

In [4]:
critics.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559,Toy story
1,Richard Corliss,fresh,114709,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559,Toy story
2,David Ansen,fresh,114709,Newsweek,A winning animated feature that has something ...,2008-08-18,9559,Toy story
3,Leonard Klady,fresh,114709,Variety,The film sports a provocative and appealing st...,2008-06-09,9559,Toy story
4,Jonathan Rosenbaum,fresh,114709,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559,Toy story


### Multinomial vs Bernoulli Models

- The **Multinomial model** actually counts occurences out of all possible occurences for probability - better for greater features
- The **Bernoulli model** counts only all documents with presence of the word - better for fewer features

In [5]:
# TODO - import both versions of naive bayes from sklearn

### How the Count Vectorizer Works

In [1]:
#
### How the Count Vectorizer Works
#

from sklearn.feature_extraction.text import CountVectorizer

text = ['Math is great', 'Math is really great', 'Exciting exciting Math']
print "Original text:\n\t", '\n\t'.join(text)

# TODO - create the instance of CountVectorizer class. Specify the ngram_range argument (see docs)

# TODO - call `fit` on the text to build the vocabulary


Original text:
	Math is great
	Math is really great
	Exciting exciting Math


### Q: What is an ngram?

### A: 

In [8]:
# display the names of the features (n grams)
vectorizer.get_feature_names()

[u'exciting',
 u'exciting exciting',
 u'exciting math',
 u'great',
 u'is',
 u'is great',
 u'is really',
 u'math',
 u'math is',
 u'really',
 u'really great']

In [3]:
# TODO call `transform` to convert text to a bag of words
# x = ...
print x

In [11]:
# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to 
# convert back to a "normal" numpy array
x_back = x.toarray()
x_back

array([[0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1],
       [2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0]])

In [12]:
print "Transformed text vector is \n", x

# `get_feature_names` tracks which word is associated with each column of the transformed x
print
print "Words for each feature:"
print vectorizer.get_feature_names()

# Notice that the bag of words treatment doesn't preserve information about the *order* of words, 
# just their frequency

Transformed text vector is 
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 7)	1
  (0, 8)	1
  (1, 3)	1
  (1, 4)	1
  (1, 6)	1
  (1, 7)	1
  (1, 8)	1
  (1, 9)	1
  (1, 10)	1
  (2, 0)	2
  (2, 1)	1
  (2, 2)	1
  (2, 7)	1

Words for each feature:
[u'exciting', u'exciting exciting', u'exciting math', u'great', u'is', u'is great', u'is really', u'math', u'math is', u'really', u'really great']


## Preparing our Features (X) and Target (Y) for Training

X is a (nreview, nwords) array. Each row corresponds to a bag-of-words representation for a single review. This will be the input to the model.

Y is a nreview-element 1/0 array, encoding whether a review is Fresh (1) or Rotten (0). This is the desired output

In [13]:
# Instantiate the vectorizer with n-grams of length one or two
vectorizer = CountVectorizer(ngram_range=(1,2))

# Create a vector where each row is bag-of-words for a single quote
X = vectorizer.fit_transform(critics.quote) 

In [4]:
# TODO - Create an array where each element encodes whether the array is Fresh or Rotten
# Y = ...
# hint: apply the == condition, then use .values.astype(np.int) on the result to get the right type


In [18]:
# Use SKLearn's train_test_split
# Important - we'll do this a thousand times
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)

## Creating the Classifier

In [25]:
# vector of all quotes
rotten_vectorizer = vectorizer.fit(critics.quote)

# a few helper functions
def accuracy_report(_clf):
    print "Accuracy: %0.2f%%" % (100 * _clf.score(xtest, ytest))

    #Print the accuracy on the test and training dataset
    training_accuracy = _clf.score(xtrain, ytrain)
    test_accuracy = _clf.score(xtest, ytest)

    print "Accuracy on training data: %0.2f" % (training_accuracy)
    
# a function to run some tests
def AnalyzeReview(testquote, _clf):
    print "\""  + testquote + "\" is judged by clasifier to be..."
    testquote = rotten_vectorizer.transform([testquote])

    if (_clf.predict(testquote)[0] == 1):
        print "... a fresh review."
    else:
        print "... a rotten review."
    return(clf.predict(testquote)[0])

In [26]:
from sklearn.naive_bayes import MultinomialNB

print "MultinomialNB:"
clf_mn = MultinomialNB().fit(xtrain, ytrain)
accuracy_report(clf)

MultinomialNB:
Accuracy: 76.24%
Accuracy on training data: 1.00


In [5]:
from sklearn.naive_bayes import BernoulliNB
print "BernoulliNB:"
# TODO - same as above with Bernoulli
# clf_b = ...

BernoulliNB:


In [6]:
from sklearn.linear_model import LogisticRegression
print "Logistic Regression:"
# TODO - same as above with LogReg
# clf_lr = ...

Logistic Regression:


In [30]:
AnalyzeReview("This movie was awesome", clf_mn)
AnalyzeReview("This movie was awesome", clf_b)
AnalyzeReview("This movie was awesome", clf_lr)

"This movie was awesome" is judged by clasifier to be...
... a fresh review.
"This movie was awesome" is judged by clasifier to be...
... a fresh review.
"This movie was awesome" is judged by clasifier to be...
... a rotten review.


0

In [31]:
# Save prediction and probability

# Outputs of X (just first column)
prob = clf.predict_proba(X)[:, 0]

predict = clf.predict(X)

In [32]:
Y==0 #(provides a mask where the actual review is bad)

array([False, False, False, ..., False, False, False], dtype=bool)

In [33]:
# argsort returns the positions of the top n sorted values
np.argsort((prob[Y==0]))[:5]

array([4925,  249, 2369,  174, 2130])

In [35]:
# Top 5 Review classification errors
bad_rotten = np.argsort(prob[Y == 0])[:5]
bad_fresh = np.argsort(prob[Y == 1])[-5:]

In [36]:
print "Mis-predicted Rotten quotes"
print '---------------------------'
for row in bad_rotten:
    print critics[Y == 0].quote.irow(row)
    print

print "Mis-predicted Fresh quotes"
print '--------------------------'
for row in bad_fresh:
    print critics[Y == 1].quote.irow(row)
    print

Mis-predicted Rotten quotes
---------------------------
If you loved Wolfe's book, you may very well hate the movie. If you simply liked the novel, you may be simultaneously entertained and disappointed by what De Palma and Cristofer have done to it.

There is absolutely nothing going on in Beautiful Girls that you haven't seen... [in] any other artistic endeavor in which untethered young men and women, bound by geography and fortified by beer, shamble their way toward overdue maturity.

Nava, who started his feature-film career with El Norte, is a good director who invariably finds a strong rapport with his actors. He's not much of a writer, though, and he should think twice about creating dialogue for his future projects.

Mr. Rodriguez demonstrates his talents more clearly than ever -- he's visually inventive, quick-witted and a fabulous editor -- while still hampering himself with sophomoric material.

By its midpoint, however, Thornton has begun forcing both the film's poetry and 