#### Classification Example:  Predict Review Score Based on Its Text

* Get a sample of reviews from our Amazon data set
* How accurately can we classify a new review based on fields different from its score?

First re-cast the problem
* Original problem is a regression problem -- predict score value
* This often does not work in practice, the difference between 2 and 3, and 3 and 4 often cannot be predicted from the review summary and body.

Quantize the problem
* Score of 1 and 2 are "bad" reviews
* Score of 4 and 5 are "good" reviews
* Problem becomes a classification problem

----------------------------------------------------

**First steps**

* Gather a set of training examples
* Reformat them to make them ready to use with a learning algorithm

The file "reviews.txt" was just sampled from the reviews collection.

The full set isn't really equally balanced -- in the full set, positive is about 5 to 1 over negative -- and dealing with imbalanced classes is a real issue, but we will ignore class imbalance for this exercise.

In [1]:
import numpy as np
import pandas as pd

In [2]:
def load_training(filename="reviews.txt"):
    training = []
    for line in open(filename):
        training.append(eval('(' + line + ')'))
    return training

In [3]:
raw_training = load_training()


In [4]:
raw_training

[{'reviewerID': 'A007549913T3THC2E44K5',
  'asin': 'B0046ZHQAW',
  'reviewerName': 'Marce',
  'helpful': [3, 4],
  'reviewText': 'is a very good iron, your hair look beautiful and in just one pass!, does not hurt the hair! It is all I can say',
  'overall': 5.0,
  'summary': 'Really good one',
  'unixReviewTime': 1366329600,
  'reviewTime': '04 19, 2013'},
 {'reviewerID': 'A03666331WS5WUZMM0AUD',
  'asin': 'B00007M0CP',
  'reviewerName': 'Susanne Gonzalez Delgado',
  'helpful': [3, 3],
  'reviewText': 'Good quality for this Price. The curls gets quite tight if you have m&eacute;dium long hair. I have now longer hair and the results looks now better as the curls look more natural.The only thing I dislike is that my fingers get constantly burned, .. I guess it I am not very good at this but it would be gread if it came with a special glove to avoid this.Top product, I would definitely recommend.',
  'overall': 5.0,
  'summary': 'Conair Xtreme Instant Heat Multisized Hot Rollers, Pink',
 

In [5]:
reviews = pd.DataFrame.from_records(raw_training)

In [6]:
reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A007549913T3THC2E44K5,B0046ZHQAW,Marce,"[3, 4]","is a very good iron, your hair look beautiful ...",5.0,Really good one,1366329600,"04 19, 2013"
1,A03666331WS5WUZMM0AUD,B00007M0CP,Susanne Gonzalez Delgado,"[3, 3]",Good quality for this Price. The curls gets qu...,5.0,Conair Xtreme Instant Heat Multisized Hot Roll...,1378080000,"09 2, 2013"
2,A03898972RKS9HXS4H1F3,B002610SB2,Nancy Denise Graves,"[4, 5]",The shirt fits perfectly--true to size. It is...,5.0,It was exactly what I was looking for!,1358726400,"01 21, 2013"
3,A04051149NH0IDZZ4FH7,B002610SB2,Donnie,"[4, 4]","Wife hates hot days, well we live in florida.....",5.0,Nice shirt for the wife,1394755200,"03 14, 2014"
4,A06263011FDVMGFVAALWO,B00007M0CP,Hannah Moss,"[0, 0]",So much faster for me than using a curling iro...,5.0,I love them!,1365984000,"04 15, 2013"


In [7]:
reviews.columns

Index(['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText',
       'overall', 'summary', 'unixReviewTime', 'reviewTime'],
      dtype='object')

In [None]:
reviews.drop(['reviewerID', 'asin', 'reviewerName', 'unixReviewTime', 'reviewTime'], axis=1, inplace=True)

In [8]:
reviews.reviewerID.value_counts()

A14OJS0VWMOSWO    7
A15C3UT9FCQYTN    2
A13QTZ8CIMHHG4    2
A1W8EAPN3ZTI9     1
A1W670KK0YNM74    1
                 ..
A1FUTP0MAGGN8V    1
A1FV7W4SZD9BVO    1
A1FV82A6PT1LEL    1
A1FV9CE5LZMKHL    1
A2E64WQS9WFEM9    1
Name: reviewerID, Length: 2965, dtype: int64

In [9]:
reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A007549913T3THC2E44K5,B0046ZHQAW,Marce,"[3, 4]","is a very good iron, your hair look beautiful ...",5.0,Really good one,1366329600,"04 19, 2013"
1,A03666331WS5WUZMM0AUD,B00007M0CP,Susanne Gonzalez Delgado,"[3, 3]",Good quality for this Price. The curls gets qu...,5.0,Conair Xtreme Instant Heat Multisized Hot Roll...,1378080000,"09 2, 2013"
2,A03898972RKS9HXS4H1F3,B002610SB2,Nancy Denise Graves,"[4, 5]",The shirt fits perfectly--true to size. It is...,5.0,It was exactly what I was looking for!,1358726400,"01 21, 2013"
3,A04051149NH0IDZZ4FH7,B002610SB2,Donnie,"[4, 4]","Wife hates hot days, well we live in florida.....",5.0,Nice shirt for the wife,1394755200,"03 14, 2014"
4,A06263011FDVMGFVAALWO,B00007M0CP,Hannah Moss,"[0, 0]",So much faster for me than using a curling iro...,5.0,I love them!,1365984000,"04 15, 2013"


In [10]:
reviews.shape

(2973, 9)

In [12]:
reviews.describe()

Unnamed: 0,overall,unixReviewTime
count,2973.0,2973.0
mean,4.15338,1331244000.0
std,1.237598,90932770.0
min,1.0,886464000.0
25%,4.0,1310256000.0
50%,5.0,1365984000.0
75%,5.0,1389485000.0
max,5.0,1405987000.0


In [13]:
reviews.overall.value_counts()

5.0    1712
4.0     604
3.0     286
1.0     228
2.0     143
Name: overall, dtype: int64

What fields should be our independent (X) variables?  
Convert data frame:  include only desired X variables, and code the y variable

In [14]:
reviews = reviews[reviews.overall != 3]

In [15]:
raw_training = np.delete(raw_training, list(map(lambda j: j['overall'] == 3.0, raw_training)))

In [16]:
reviews.shape

(2687, 9)

In [17]:
len(raw_training)

2687

In [18]:
reviews.overall.value_counts()

5.0    1712
4.0     604
1.0     228
2.0     143
Name: overall, dtype: int64

In [19]:
y = list(map(lambda s: 0 if s < 3 else 1, reviews.overall))

In [20]:
reviews.drop(['overall'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [21]:
reviews.shape

(2687, 8)

In [22]:
reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,summary,unixReviewTime,reviewTime
0,A007549913T3THC2E44K5,B0046ZHQAW,Marce,"[3, 4]","is a very good iron, your hair look beautiful ...",Really good one,1366329600,"04 19, 2013"
1,A03666331WS5WUZMM0AUD,B00007M0CP,Susanne Gonzalez Delgado,"[3, 3]",Good quality for this Price. The curls gets qu...,Conair Xtreme Instant Heat Multisized Hot Roll...,1378080000,"09 2, 2013"
2,A03898972RKS9HXS4H1F3,B002610SB2,Nancy Denise Graves,"[4, 5]",The shirt fits perfectly--true to size. It is...,It was exactly what I was looking for!,1358726400,"01 21, 2013"
3,A04051149NH0IDZZ4FH7,B002610SB2,Donnie,"[4, 4]","Wife hates hot days, well we live in florida.....",Nice shirt for the wife,1394755200,"03 14, 2014"
4,A06263011FDVMGFVAALWO,B00007M0CP,Hannah Moss,"[0, 0]",So much faster for me than using a curling iro...,I love them!,1365984000,"04 15, 2013"


In [23]:
reviews['text'] = reviews.reviewText + reviews.summary

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews['text'] = reviews.reviewText + reviews.summary


In [24]:
reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,summary,unixReviewTime,reviewTime,text
0,A007549913T3THC2E44K5,B0046ZHQAW,Marce,"[3, 4]","is a very good iron, your hair look beautiful ...",Really good one,1366329600,"04 19, 2013","is a very good iron, your hair look beautiful ..."
1,A03666331WS5WUZMM0AUD,B00007M0CP,Susanne Gonzalez Delgado,"[3, 3]",Good quality for this Price. The curls gets qu...,Conair Xtreme Instant Heat Multisized Hot Roll...,1378080000,"09 2, 2013",Good quality for this Price. The curls gets qu...
2,A03898972RKS9HXS4H1F3,B002610SB2,Nancy Denise Graves,"[4, 5]",The shirt fits perfectly--true to size. It is...,It was exactly what I was looking for!,1358726400,"01 21, 2013",The shirt fits perfectly--true to size. It is...
3,A04051149NH0IDZZ4FH7,B002610SB2,Donnie,"[4, 4]","Wife hates hot days, well we live in florida.....",Nice shirt for the wife,1394755200,"03 14, 2014","Wife hates hot days, well we live in florida....."
4,A06263011FDVMGFVAALWO,B00007M0CP,Hannah Moss,"[0, 0]",So much faster for me than using a curling iro...,I love them!,1365984000,"04 15, 2013",So much faster for me than using a curling iro...


In [25]:
print(reviews.shape)
print(len(y))

(2687, 9)
2687


Now we have a text field as our X, and we have to convert it to a real input vector.
First look at CountVectorizer:  https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [55]:
from sklearn.feature_extraction.text import CountVectorizer
#v = CountVectorizer(min_df=2, max_df=.75)
#v = CountVectorizer(max_df=.25)
v = CountVectorizer()
X = v.fit_transform(reviews.text)

Side step:  look at excluding very rare terms (min_df) and very common terms (max_df)

In [42]:
len(v.vocabulary_.keys())

15368

In [28]:
v.vocabulary_

{'is': 7356,
 'very': 14699,
 'good': 6046,
 'iron': 7341,
 'your': 15320,
 'hair': 6284,
 'look': 8151,
 'beautiful': 1511,
 'and': 870,
 'in': 6971,
 'just': 7558,
 'one': 9457,
 'pass': 9807,
 'does': 4241,
 'not': 9253,
 'hurt': 6808,
 'the': 13730,
 'it': 7378,
 'all': 747,
 'can': 2236,
 'sayreally': 11831,
 'quality': 10775,
 'for': 5572,
 'this': 13793,
 'price': 10471,
 'curls': 3534,
 'gets': 5923,
 'quite': 10816,
 'tight': 13872,
 'if': 6871,
 'you': 15311,
 'have': 6413,
 'eacute': 4470,
 'dium': 4198,
 'long': 8141,
 'now': 9288,
 'longer': 8142,
 'results': 11395,
 'looks': 8156,
 'better': 1627,
 'as': 1110,
 'more': 8864,
 'natural': 9055,
 'only': 9472,
 'thing': 13777,
 'dislike': 4152,
 'that': 13722,
 'my': 8989,
 'fingers': 5379,
 'get': 5919,
 'constantly': 3140,
 'burned': 2125,
 'guess': 6226,
 'am': 802,
 'at': 1169,
 'but': 2148,
 'would': 15222,
 'be': 1481,
 'gread': 6145,
 'came': 2216,
 'with': 15136,
 'special': 12765,
 'glove': 6010,
 'to': 13922,
 'avo

In [49]:
v.stop_words_

{'and',
 'are',
 'as',
 'but',
 'for',
 'great',
 'have',
 'in',
 'is',
 'it',
 'my',
 'not',
 'of',
 'on',
 'so',
 'that',
 'the',
 'this',
 'to',
 'was',
 'with',
 'you'}

In [54]:
print(X.shape)
print(len(y))

(2687, 7228)
2687


**Our Input Data**
1. The X matrix
2. The vectorizer
3. The y vector

Simple NB classifier for binary response variable -- the [documentation](https://scikit-learn.org/stable/modules/naive_bayes.html) is good -- be sure to read it! 

In [56]:
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
nb.fit(X,y)

BernoulliNB()

In [57]:
# Use the model to predict the value of all records in the training set
ypred = nb.predict(X)
print(len(ypred))
print(ypred)

2687
[1 1 1 ... 1 1 1]


In [58]:
# How many "good" predicitions are they, and how does that compare to actual y values?
from collections import Counter
print(Counter(ypred))
print(Counter(y))

Counter({1: 2491, 0: 196})
Counter({1: 2316, 0: 371})


Various tools to measure prediction accuracy

In [59]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y, nb.predict(X)))

0.8641607740975065


In [60]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Confusion matrix -- argument order matters -- y is 
# down the rows because it is first.  Predicted is across the 
# columns because it is second.

print(confusion_matrix(y, nb.predict(X)))
print(accuracy_score(y, nb.predict(X)))

# Remember, this is training accuracy.  Bad if it's low, meaningless if it's high

[[ 101  270]
 [  95 2221]]
0.8641607740975065


In [62]:
# Quick way to split our X and y into training and test pieces for measuring
# test accuracy

# The random_state argument seeds the random number generator so you get the
# same split each time.  Not random, but good for debugging

from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [63]:
# Just a method so we can evaluate a classifier easily

def eval_test_set(clf, xtrain, ytrain, xtest, ytest):
    clf.fit(xtrain, ytrain)
    print(confusion_matrix(ytest, clf.predict(xtest)))
    print(accuracy_score(ytest, clf.predict(xtest)))

In [64]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
eval_test_set(BernoulliNB(), xTrain, yTrain, xTest, yTest)
eval_test_set(MultinomialNB(), xTrain, yTrain, xTest, yTest)

#  Too bad, test accuracy is much worse than training accuracy

[[ 11  85]
 [ 53 523]]
0.7946428571428571
[[ 22  74]
 [ 15 561]]
0.8675595238095238


**Cross-Fold Validation**

In [65]:
# Cross-fold validation -- this is 10-fold, split the X and y into 
# ten pieces, loop ten times, one for each piece:  train on the other 9, 
# then test on the 10th.   Average is estimate for test accuracy

from sklearn.model_selection import cross_val_score
def cross_validate(clf, x, y):
    return cross_val_score(clf, x, y, cv=10).mean()

In [66]:
def eval_classifier(clf, X, y):
    clf.fit(X,y)
    print(f"Results for {clf}")
    print(f"Cross validation mean accuracy: {cross_validate(clf, X, y)}")
    print(f"Training accuracy: {accuracy_score(y, clf.predict(X))}")
    print(confusion_matrix(y, clf.predict(X)))


In [67]:
from sklearn.naive_bayes import ComplementNB
print(cross_validate(BernoulliNB(), X, y))
print(cross_validate(MultinomialNB(), X, y))
print(cross_validate(ComplementNB(), X, y))

# This should be a little better than the holdback test accuracy
# because we get to use all the test data

# This is our baseline accuracy -- can we change preprocessing and/or get 
# rid of features to improve it?

0.8191116906175442
0.8797841646784663
0.8768074127503744


In [None]:
eval_classifier(BernoulliNB(), X, y)
print()
eval_classifier(MultinomialNB(), X, y)
print()
eval_classifier(ComplementNB(), X, y)


**Other Options for a Classifier**

How do these perform compared to our Naive Bayes baseline?

1. Rocchio (nearest centroid)
2. Nearest neighbor

In [None]:
from sklearn.neighbors import NearestCentroid
eval_classifier(NearestCentroid(), X,y)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
eval_test_set(KNeighborsClassifier(n_neighbors=20), X,y)

In [None]:
# Here is apples-to-apples comparison of all models we have tried so far

print(f"Bernoulli Naive Bayes {cross_validate(BernoulliNB(), X, y)}")
print(f"Multinomial Naive Bayes {cross_validate(MultinomialNB(), X, y)}")
print(f"Complement Naive Bayes {cross_validate(ComplementNB(), X, y)}")
print(f"Nearest Centroid {cross_validate(NearestCentroid(), X,y)}")
print(f"Nearest neighbors (20) {cross_validate(KNeighborsClassifier(n_neighbors=20), X, y)}")
print(f"Nearest neighbors (5) {cross_validate(KNeighborsClassifier(n_neighbors=5), X, y)}")
print(f"Nearest neighbors (12) {cross_validate(KNeighborsClassifier(n_neighbors=12), X, y)}")

Tentative decision:  choose Multinomial Naive Bayes, reject Rocchio, and experiment to see how much we can improve KNN

**Hyperparameter Optimization**

Nearest Neighbor is the only algorithm with a hyperparameter (number of neighbors)

In [None]:
from sklearn.model_selection import GridSearchCV

# Loop over number of neighbors, and compute cross-validated accuracy for each setting.
# Remember the parameter setting with the best accuracy.
# Notice that the result of GridSearchCV is also a classifier -- it has fit and predict methods

clf = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': range(1,50,2)}, cv=10, scoring='accuracy')
clf.fit(X, y)

In [None]:
# Notice that the grid search is actually a classifier too!
clf.best_estimator_

In [None]:
clf.cv_results_

In [None]:
# Just for completeness, this is optimal accuracy for KNN.
print(f"Nearest neighbors (optimal) {cross_validate(clf, X, y)}")

In [None]:
#  There is an interesting curve relating number of neighbors to accuracy
#  This piece just collects (x,y) pairs where x is neighbors and y is accuracy,
#   so we can do a plot

neighbors = []
accuracies = []

for nn in range(1,100,2):
    neighbors.append(nn)
    accuracies.append(cross_validate(KNeighborsClassifier(n_neighbors=nn), X, y))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# (x,y) line plot of number of neighbors vs test accuracy
plt.style.use('seaborn-whitegrid')
fig = plt.figure()
ax = plt.axes()
ax.plot(neighbors, accuracies);
ax.set_xlabel('# Neighbors')
ax.set_ylabel("Accuracy")

In [None]:
# What value is the accuracy converging to?
Counter(y)[1]/ (Counter(y)[0]+Counter(y)[1])

----------------------------------------------
### Feature Engineering

Improve the algorithm by modifying the vectorizer, limiting features

This all starts with the assumption that we will go with Naive Bayes as our classifier of choice

1.  See if we can improve the tokenizer -- TFIDF
  1. We could try n-grams and stop words too, because the Vectorizer can do those for us
2.  See if we can limit features -- do it in the Vectorizer, and also try some more informed greedy methods like mutual information
3.  Examine the misclassifications to see if there's anything obviously wrong

* Variations on the vectorizer -- try just Count vs Tfidf
* Other variants would be stemming, n-grams, stop words 

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

b = MultinomialNB()

v = CountVectorizer()
X = v.fit_transform(reviews.text)
print(f"Count Vectorizer {cross_validate(b, X, y)}")

v = TfidfVectorizer()
X = v.fit_transform(reviews.text)
print(f"Tfidf Vectorizer {cross_validate(b, X, y)}")


In [None]:
# Try restricting out uncommon and super common words
v = CountVectorizer(min_df=2, max_df=.95)
X = v.fit_transform(reviews.text)
print(f"Count Vectorizer {cross_validate(b, X, y)}")

In [None]:
# Try bigrams
v = CountVectorizer(min_df=2, max_df=.95, ngram_range=(1,2))
X = v.fit_transform(reviews.text)
print(f"Count Vectorizer {cross_validate(b, X, y)}")

In [None]:
X.shape

### Feature Selection by Mutual Information


#### Feature selection by mutual information measure

In [None]:
from sklearn.feature_selection import mutual_info_classif
# https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
mutual_info_classif(X,y)

In [None]:
# Select K best
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
from sklearn.feature_selection import SelectKBest
newX = SelectKBest(mutual_info_classif, k=100).fit_transform(X,y)
newX.shape

In [None]:
def k_features_x(k, X, y, fs=mutual_info_classif):
    kbest = SelectKBest(fs, k=k)
    return kbest.fit_transform(X, y)

#  Assess accuracy of the classifier after choosing k best features
def k_features_accuracy(k, clf, x, y, fs=mutual_info_classif):
    X_new = k_features_x(k, x, y, fs)
    return cross_validate(clf, X_new, y)


In [None]:
X = CountVectorizer(min_df=2, max_df=.95, ngram_range=(1,2)).fit_transform(reviews.text)
k_features_accuracy(5000, MultinomialNB(), X, y)

In [None]:
###########################
# Caution!!  This takes a very long time!

mnb = MultinomialNB()
v = CountVectorizer(min_df=2, max_df=.95, ngram_range=(1,2))
X = v.fit_transform(reviews.text)

#  Set up to plot accuracy as a function of k (number features)
xval = []
yval = []
for k in range(100, 4000, 400):
    print(k)
    xval.append(k)
    yval.append(k_features_accuracy(k, mnb, X, y))
for k in range(4000, X.shape[1], 500):
    print(k)
    xval.append(k)
    yval.append(k_features_accuracy(k, mnb, X, y))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
fig = plt.figure()
ax = plt.axes()
ax.plot(xval, yval);

# Nice plot with optimal number of features around 3000 (out of 14.8K!)

Sample plot of cross-val accuracy vs number of features

![Accuracy vs num features](accuracy_vs_num_features.png)

### Classifier-Specific Feature Impact

Since we are using Naive Bayes, the classifier itself has a measure of feature impact, through the ratio of P(term|class=0) / P(term|class=1).

In that ratio, the highest value are the terms that are "most negative" and the lowest values are the terms that are "most positive" and a value of 1 means the term is irrelevant.   

This code shows the features that have the highest and lowest ratio.

The question is whether this measure is better than mutual information for limiting # features

In [None]:
v = CountVectorizer(min_df=2, max_df=.95)
X = v.fit_transform(reviews.text)
b = MultinomialNB()
b.fit(X, y)

#v.get_feature_names()
b.feature_log_prob_

In [None]:
def log_prob_ratios(classifier):
    return classifier.feature_log_prob_[0] / classifier.feature_log_prob_[1]

In [None]:
sorted(list(log_prob_ratios(b)), reverse=True)[0:10]

In [None]:
def important_features(vectorizer,classifier,n=20):
    feature_names =vectorizer.get_feature_names()
    log_prob_frac = log_prob_ratios(classifier)
    frac_and_name_pos = sorted(zip(log_prob_frac, feature_names),reverse=True)[:n]
    frac_and_name_neg = sorted(zip(log_prob_frac, feature_names),reverse=False)[:n]   
    print(f"Important words POSITIVE")
    for coef, feat in frac_and_name_pos:
        print(coef, feat)
    print("==========")
    print(f"Important words NEGATIVE")
    for coef, feat in frac_and_name_neg:
        print(coef, feat)


In [None]:
v = CountVectorizer(min_df=2, max_df=.95)
X = v.fit_transform(reviews.text)
b = MultinomialNB()
b.fit(X, y)
important_features(v, b, n=20)

The SelectKBest feature selector can take a ranking function as an input -- previously we used the built-in ranker mutual information, but we can use the probability ratio instead.  In this case a feature is more significant the farther away from 1 it is.

In [None]:
# Signature is dictated by SelectKBest -- 
#  "Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores"

# Returns a vector of scores for each feature based on the score ratio
def prob_ratio_score(X, y):
    b = MultinomialNB()
    b.fit(X, y)
    log_prob_frac = log_prob_ratios(b)
    return abs(log_prob_frac - 1.0)

In [None]:
#  Now we can select the K best features based on the ratio.  Just for a reality check:  
#  this transform should reduce the number of features from 14.8K to 2K
v = CountVectorizer(min_df=2, max_df=.95)
X = v.fit_transform(reviews.text)
kbest = SelectKBest(prob_ratio_score, k=2000)
XX = kbest.fit_transform(X, y)
print(X.shape)
print(XX.shape)

In [None]:
# And here is our head-to-head comparison.  Same classifier, same vectorizer, same 
#  number of features, but we just rank the features differently
v = CountVectorizer(min_df=2, max_df=.95)

print(k_features_accuracy(3000, MultinomialNB(), v.fit_transform(reviews.text), y))
print(k_features_accuracy(3000, MultinomialNB(), v.fit_transform(reviews.text), y, prob_ratio_score))


-------------------------------------------------------

### Examining Misclassified Instances

1. Find some misclassified instances
2. Visually, can we see anything that would make us worry or not worry?
3. If the reason for misclassification isn't obvious, look more closely at the words and their scores according to the classifier

In [None]:

# This is our baseline best classifier.
b = MultinomialNB()
v = CountVectorizer(min_df=2, max_df=.95)
X = v.fit_transform(reviews.text)
kbest = SelectKBest(prob_ratio_score, k=3000)
X = kbest.fit_transform(X, y)
eval_test_set(b, Xnew, y)


In [None]:
#  Get indices of some misclassified instances.  Remember that
#  the index (observation #) is the same in X, y
b.fit(X, y)
pred = b.predict(X)
mask = pred != np.array(y)
misclassed = np.array(list(zip(range(0,len(pred)), zip(y, pred))), dtype=object)
misclassed[mask]

# For example [167, (1, 0)] means that for example # 167, the actual value was 1 but we predicted 0

In [None]:
# Actual label is good (1), we classify bad (0)
raw_training[45]

In [None]:
b.predict_proba(X)[45]

In [None]:
kbest.get_support()

In [None]:
#  Get the fitted classifier's term probabilities -- i.e the ratio P(term|class=0) / P(term|class=1)
#  Remember that the classifier is working on a reduced set of features!
#  The vectorizer knows the term names, but it stores the larger set of features so we have to 
def feature_log_prob(b, v, k):
    # Select the feature names actually used by kbest
    feature_names = np.array(v.get_feature_names())[k.get_support()]
    log_prob_frac = b.feature_log_prob_[0] / b.feature_log_prob_[1]
    return dict(zip(feature_names, log_prob_frac))

In [None]:
#  Show the log prob ratio for the instance with this index.
#  NOTE -- this is using values of b (classifier), v(vectorizer), and kbest(X with reduced features)

def show_scores_for(index, b, v, k):
    # Mapping of term to log prob (all possible features)
    flp = feature_log_prob(b,v,k)
    # Feature names, only the k best
    feature_names = np.array(v.get_feature_names())[k.get_support()]
    # Term values (just a count) 
    termvals = X[index].toarray().ravel()
    scores = []
    for i in range(0, termvals.shape[0]):
        if termvals[i] > 0:
            scores.append((feature_names[i], termvals[i], flp[feature_names[i]]))
    return sorted(scores, key=lambda x: x[2])

In [None]:
show_scores_for(45, b, v, kbest)
# These are the term scores for the short review

In [None]:
# This is a misprediction where label is bad but we predicted good
b.predict_proba(Xnew)[817]

In [None]:
raw_training[817]

In [None]:
show_scores_for(817, b, v, kbest)