# Challenge: Iterate and evaluate your classifier
It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

Do any of your classifiers seem to overfit?

Which seem to perform the best? Why?

Which features seemed to be most impactful to performance?

Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.

In [66]:
import pandas as pd
import numpy as np
import sklearn 
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

imdb = pd.read_csv('~/thinkful_mac/thinkful_large_files/imdb_labelled.csv', header = None)
imdb.columns = ['review', 'positive']
imdb.head(10)

Unnamed: 0,review,positive
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1
5,"The rest of the movie lacks art, charm, meanin...",0
6,Wasted two hours.,0
7,Saw the movie today and thought it was a good ...,1
8,A bit predictable.,0
9,Loved the casting of Jimmy Buffet as the scien...,1


# Model 1

Original list (already iterated based on examination of IMDB ratings file, trial & error)


In [61]:
keywords = ['terrible', 'awful', 'worst', 'bad', 'stupid', 'poor', 'worse', 'attempt', 'crap', 'fail', 'annoying', 'cheap',
           'painful', 'avoid', 'slow', 'pretentious', 'problem', 'embarrassing', 'bored', 'horrible', 'lousy', 'unfortunate', 
           'boring', 'sucks', 'sucked', 'waste', 'unbear', ' mess ', 'wasting', 'mediocre', 'sloppy',
           'disappoint', 'garbage', 'whine', 'whiny', 'plot', 'hate ', 'hated', 'negative', 'nobody', 'flaw',
           'script', 'insult', 'do not', 'torture', ' lack', 'lame', 'ridiculous', 'not', 'unbelievable', 'skip', 'shame', 
           'not even', 'miss', 'excellent', 'amazing', 'love', 'incredible', 'fantastic', 'terrific', 'best', 'great', 'fun',
           'beautiful', 'well done', 'enjoy', 'perfect', 'smart', 'highly', 'impress', 'well']

# Removed the required space before/after the keyword to improve model accuracy (many sentences in IMDB dataset began with
#these words, so no space in front)
for key in keywords:
    imdb[str(key)] = imdb.review.str.contains(str(key), case = False)

imdb['positive'] = (imdb['positive'] == 1)
    
data = imdb[keywords]
target = imdb['positive']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

pred = bnb.predict(data)

print('Out of {} predictions, {} were misclassified'.format(data.shape[0], (pred != target).sum()))
print('Accuracy: {}'.format(format(100*(target == pred).sum()/len(pred), '0.2f'))+'%')

Out of 748 predictions, 150 were misclassified
Accuracy: 79.95%


In [50]:
#Test the accuracy, sensitivity, and specificity of MODEL 1 (original model)

from sklearn.metrics import confusion_matrix
c = confusion_matrix(target, pred)

print('Confusion Matrix: \n{}'.format(c))

#Accuracy
print('The accuracy of the model is: ', 1-((pred != target).sum()/data.shape[0]))

#Sensitivity
print('The sensitivity (Percentage of positives correctly predicted) of the model is: {}'.format((c[1][1])/(c[1][1] + c[1][0])))

#Specificity
print('The specificity (Percentage of negatives correctly predicted) of the model is: {}'.format((c[0][0])/(c[0][0] + c[0][1])))


Confusion Matrix: 
[[240 122]
 [ 28 358]]
The accuracy of the model is:  0.7994652406417112
The sensitivity (Percentage of positives correctly predicted) of the model is: 0.927461139896373
The specificity (Percentage of negatives correctly predicted) of the model is: 0.6629834254143646


# Run some cross validation on Model 1!

In [51]:
from sklearn.model_selection import train_test_split, cross_validate
bnb = BernoulliNB()
cross_validate(bnb, data, target, cv = 3, return_train_score=True)

{'fit_time': array([0.00199032, 0.00159693, 0.00150824]),
 'score_time': array([0.00050378, 0.00042605, 0.00039506]),
 'test_score': array([0.736     , 0.744     , 0.75403226]),
 'train_score': array([0.80120482, 0.79317269, 0.798     ])}

Given the results of the cross-validation, there is a small amount of over-fitting for my initial model. Accuracies ranged from 73.6% to 75.4% on the test sets using 3 folds

# Model 2

Try to minimize false positives (minimize the number of reviews tagged as negative that are actually positive). In this instance, we don't care as much about accuracy as we do about categorizing a negative review incorrectly...



In [52]:
# So let's try predicting by only using a very negative list of words

keywords = ['awful', 'worst', 'trash', 'painful', 'sloppy', 'pretentious', 'embarrassing', 'hate', 'torture', 'skip']

for key in keywords:
    imdb[str(key)] = imdb.review.str.contains(str(key), case = False)

imdb['positive'] = (imdb['positive'] == 1)
    
data = imdb[keywords]
target = imdb['positive']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

pred = bnb.predict(data)
            
#Test the accuracy, sensitivity, and specificity

from sklearn.metrics import confusion_matrix
c = confusion_matrix(target, pred)

print('Confusion Matrix: \n{}'.format(c))

#Accuracy
print('The accuracy of the model is: ', 1-((pred != target).sum()/data.shape[0]))

#Sensitivity
print('The sensitivity (Percentage of positives correctly predicted) of the model is: {}'.format((c[1][1])/(c[1][1] + c[1][0])))

#Specificity
print('The specificity (Percentage of negatives correctly predicted) of the model is: {}'.format((c[0][0])/(c[0][0] + c[0][1])))


Confusion Matrix: 
[[ 47 315]
 [  0 386]]
The accuracy of the model is:  0.5788770053475936
The sensitivity (Percentage of positives correctly predicted) of the model is: 1.0
The specificity (Percentage of negatives correctly predicted) of the model is: 0.1298342541436464


This model was actually pretty terrible - it did not do what I thought it would, which is predict negatives correctly! In hindsight, the reason for this is clear: the keyword list simply does not capture enough of the negative descriptors. Basically, if the review contains one of these words, it will be flagged as negative, but so many of the negative reviews don't contain these words, and hence will be marked as positive (the dominant class).

# Model 3
Try to maximize accuracy using positive sentiment wordlist from internet (words from http://ptrckprry.com/course/ssd/data/positive-words.txt):

In [69]:
df = pd.read_csv('~/thinkful_mac/thinkful_large_files/positive_word_list_from_internet.csv', header=None)
df.columns = ['positive_sentiment_list']
pos_list = df['positive_sentiment_list'].tolist()

In [70]:
imdb = pd.read_csv('~/thinkful_mac/thinkful_large_files/imdb_labelled.csv', header = None)
# Renamed column from 'positive' to 'positive_review' due to the word being present in the keyword list
imdb.columns = ['review', 'positive_review']

keywords = pos_list

for key in keywords:
    imdb[str(key)] = imdb.review.str.contains(str(key), case = False)

imdb['positive_review'] = (imdb['positive_review'] == 1)
    
data = imdb[keywords]
target = imdb['positive_review']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

pred = bnb.predict(data)

from sklearn.metrics import confusion_matrix
c = confusion_matrix(target, pred)

print('Confusion Matrix: \n{}'.format(c))

#Accuracy
print('The accuracy of the model is: ', 1-((pred != target).sum()/data.shape[0]))

#Sensitivity
print('The sensitivity (Percentage of positives correctly predicted) of the model is: {}'.format((c[1][1])/(c[1][1] + c[1][0])))

#Specificity
print('The specificity (Percentage of negatives correctly predicted) of the model is: {}'.format((c[0][0])/(c[0][0] + c[0][1])))


Confusion Matrix: 
[[340  22]
 [142 244]]
The accuracy of the model is:  0.7807486631016043
The sensitivity (Percentage of positives correctly predicted) of the model is: 0.6321243523316062
The specificity (Percentage of negatives correctly predicted) of the model is: 0.9392265193370166


Positive keyword list was slightly less accurate than my model (Model 1). Sensitivity and specificity were lower and higher, respectively (the scores were basically reversed from mine -- very interesting). What about it's cross-validation performance?

In [72]:
#Now let's see how well the model accuracy stands up to cross-validation. 

cross_validate(bnb, data, target, cv=3, return_train_score=True)

{'fit_time': array([0.01265931, 0.01078606, 0.00988102]),
 'score_time': array([0.00759578, 0.00451303, 0.00364494]),
 'test_score': array([0.668     , 0.72      , 0.65322581]),
 'train_score': array([0.80923695, 0.79919679, 0.846     ])}

Model 3, with only positive keywords, suffered when performing cross-validation. The data is over-fitting: our test scores ranged from 65-72% accuracy (versus 78% on the whole dataset). This is worse than my initial model (Model 1), BUT that model is performing better in the cross-validations because the list of keywords was curated from looking through all samples, so not exactly fair.

# Model 4 
Try to maximize accuracy by using negative sentiment wordlist from internet (words from http://ptrckprry.com/course/ssd/data/negative-words.txt):

In [101]:
df = pd.read_csv('~/thinkful_mac/thinkful_large_files/negative_word_list_from_internet.csv', header=None)
df.columns = ['negative_sentiment_list']

# The list contains several characters that we need to replace in order for the code to run, including "-" and "*"
df['negative_sentiment_list'] = df['negative_sentiment_list'].apply(lambda x: str(x).replace("-", ""))
df['negative_sentiment_list'] = df['negative_sentiment_list'].apply(lambda x: str(x).replace("*", ""))
neg_list = df['negative_sentiment_list'].tolist()

In [102]:
imdb = pd.read_csv('~/thinkful_mac/thinkful_large_files/imdb_labelled.csv', header = None)
imdb.columns = ['review', 'positive']

keywords = neg_list

for key in keywords:
    imdb[str(key)] = imdb.review.str.contains(str(key), case = False)
    
imdb['positive'] = (imdb['positive'] == 1)

data = imdb[keywords]
target = imdb['positive']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

pred = bnb.predict(data)

from sklearn.metrics import confusion_matrix
c = confusion_matrix(target, pred)

print('Confusion Matrix: \n{}'.format(c))

#Accuracy
print('The accuracy of the model is: ', 1-((pred != target).sum()/data.shape[0]))

#Sensitivity
print('The sensitivity (Percentage of positives correctly predicted) of the model is: {}'.format((c[1][1])/(c[1][1] + c[1][0])))

#Specificity
print('The specificity (Percentage of negatives correctly predicted) of the model is: {}'.format((c[0][0])/(c[0][0] + c[0][1])))


Confusion Matrix: 
[[174 188]
 [  5 381]]
The accuracy of the model is:  0.7419786096256684
The sensitivity (Percentage of positives correctly predicted) of the model is: 0.9870466321243523
The specificity (Percentage of negatives correctly predicted) of the model is: 0.48066298342541436


In [103]:
#Now let's see how well the model accuracy stands up to cross-validation.

cross_validate(bnb, data, target, cv=3, return_train_score=True)

{'fit_time': array([0.03428984, 0.02639675, 0.0234828 ]),
 'score_time': array([0.02043509, 0.01096606, 0.01031399]),
 'test_score': array([0.612     , 0.56      , 0.54435484]),
 'train_score': array([0.74297189, 0.64457831, 0.672     ])}

The model with the negative keyword list also suffered on cross-validation, with test scores in the 54-61% accuracy range (compared to 74% for the entire dataset).

# Model 5
Positive & Negative sentiment lists combined (from internet, not my list)


In [104]:
imdb = pd.read_csv('~/thinkful_mac/thinkful_large_files/imdb_labelled.csv', header = None)
imdb.columns = ['review', 'positive_review']

posneg_list = pos_list + neg_list

keywords = posneg_list

for key in keywords:
    imdb[str(key)] = imdb.review.str.contains(str(key), case = False)

imdb['positive_review'] = (imdb['positive_review'] == 1)
    
data = imdb[keywords]
target = imdb['positive_review']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

pred = bnb.predict(data)

from sklearn.metrics import confusion_matrix
c = confusion_matrix(target, pred)

print('Confusion Matrix: \n{}'.format(c))

#Accuracy
print('The accuracy of the model is: ', 1-((pred != target).sum()/data.shape[0]))

#Sensitivity
print('The sensitivity (Percentage of positives correctly predicted) of the model is: {}'.format((c[1][1])/(c[1][1] + c[1][0])))

#Specificity
print('The specificity (Percentage of negatives correctly predicted) of the model is: {}'.format((c[0][0])/(c[0][0] + c[0][1])))

Confusion Matrix: 
[[230 132]
 [  6 380]]
The accuracy of the model is:  0.8155080213903744
The sensitivity (Percentage of positives correctly predicted) of the model is: 0.9844559585492227
The specificity (Percentage of negatives correctly predicted) of the model is: 0.6353591160220995


The model performs best with both positive and negative keywords, and this model outperformed my original model in terms of accuracy. Let's see how it does on cross-validation..

In [116]:
#Now let's see how well the model accuracy stands up to cross-validation. 

cross_validate(bnb, data, target, cv=3, return_train_score=True)

{'fit_time': array([0.04736495, 0.03517675, 0.03415608]),
 'score_time': array([0.02315116, 0.01321006, 0.01299405]),
 'test_score': array([0.7       , 0.592     , 0.56854839]),
 'train_score': array([0.86144578, 0.69879518, 0.714     ])}

Model 5 performs better than model 4 in the cross-validations with test scores ranging from 56-70%, but not as well as Model 3 which only used positive keywords and had cross-validation scores in the 65-72% range.