My challenge for this assignment is to evaluate the classifier I had created in a previous assignment and to try to make 5 versions of it as I try to make it better.

In [3]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import sklearn

I am going to reload the code for the model that I had made previously in order to then evaluate it.

In [54]:
df = pd.read_csv('yelp_labelled.txt', delimiter = "\t", names=['Comment', 'Sentiment'])

keywords = ["not", "bad", "didn't", "worst", "poor", "slow"]

for key in keywords:
    df[str(key)] = df.Comment.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
       
data = df[keywords]
target = df['Sentiment']

from sklearn.naive_bayes import BernoulliNB

bnb = BernoulliNB()

# Fit our model to the data
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

I will now run a confusion matrix to assess what type of errors are occuring.

In [6]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

array([[119, 381],
       [ 21, 479]], dtype=int64)

My model was very good at identifying the positive reviews, hitting at 479 out of 500, or a 95.8% rate. However, my model did not identify negative well, hitting at 119 out of 500, or a 23.8% rate. I therefore have high sensitivity, identifying the positive well, but low specificity because of how often I falsely identified negatives as positive. This sample has an equal number of class type, with 500 negative reviews and 500 positive reviews, so the failure is not related to class imbalance.

I will run a holdout group evaluation and then a cross validation to see if those evaluations give me different results

In [7]:
from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.585
Testing on Sample: 0.598


In [10]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([0.65, 0.58, 0.61, 0.52, 0.61, 0.62, 0.63, 0.62, 0.56, 0.58])

In the holdout group evaluation it was essentially the same, so there are now red flags there. In the cross validation, the scores are all around the 59% mark that I have hit.

My goal now is to find better features that can help my model distinguish when a review is negative.

In [17]:
df_2 = pd.read_csv('yelp_labelled.txt', delimiter = "\t", names=['Comment', 'Sentiment'])
# I will keep the keywords from the first version, and try adding more
keywords_2 = ["not", "bad", "didn't", "worst", "poor", "slow", "disappointed",
            "gross", "bland", "stinks", "wrong", "flavorless", "stay away"]

for key in keywords_2:
    df_2[str(key)] = df_2.Comment.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
       
data_2 = df_2[keywords_2]
target_2 = df_2['Sentiment']

from sklearn.naive_bayes import BernoulliNB

bnb_2 = BernoulliNB()

# Fit our model to the data
bnb_2.fit(data_2, target_2)

# Classify, storing the result in a new variable.
y_pred_2 = bnb_2.predict(data_2)

In [19]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target_2, y_pred_2)

array([[134, 366],
       [ 21, 479]], dtype=int64)

Disappointingly, I got a similar result. My sensitivity is exactly the same at 95.8%. My specificity had a slight increase from 23.8% to 26.8%, classifying only 15 more negatives correctly. I will need to come up with either more keywords or a different approach.

I will try TF-IDF feature extraction. I will first see what features I can extract from the positive reviews.

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

df_3 = pd.read_csv('yelp_labelled.txt', delimiter = "\t", names=['Comment', 'Sentiment'])

vectorizer = TfidfVectorizer(stop_words='english', max_features=13, ngram_range=(2,3))
X = vectorizer.fit_transform(df_3[df_3['Sentiment'] == 1]['Comment'])
keywords_3 = vectorizer.get_feature_names()

print(keywords_3)

['food delicious', 'food good', 'food great', 'food service', 'friendly staff', 'good food', 'good prices', 'good service', 'great food', 'great place', 'great service', 'really good', 'service good']


In [57]:
for key in keywords_3:
    df_3[str(key)] = df_3.Comment.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
       
data_3 = df_3[keywords_3]
target_3 = df_3['Sentiment']

from sklearn.naive_bayes import BernoulliNB

bnb_3 = BernoulliNB()

bnb_3.fit(data_3, target_3)

y_pred_3 = bnb_3.predict(data_3)

In [59]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target_3, y_pred_3)

array([[500,   0],
       [492,   8]], dtype=int64)

So this essentially called everything positive. Which is great for my problem of specificity, but threw my sensitivity out the window. I will have to try to add more features from positive and I will add features from the negative reviews as well.

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer

df_4 = pd.read_csv('yelp_labelled.txt', delimiter = "\t", names=['Comment', 'Sentiment'])

vectorizer_4 = TfidfVectorizer(stop_words='english', max_features=20, ngram_range=(2,3))
X_4_pos = vectorizer_4.fit_transform(df_4[df_4['Sentiment'] == 1]['Comment'])
keywords_4_pos = vectorizer_4.get_feature_names()

print(keywords_4_pos)

['food delicious', 'food good', 'food great', 'food service', 'friendly staff', 'good food', 'good prices', 'good service', 'great food', 'great place', 'great service', 'love place', 'pretty good', 'really good', 'second time', 'service food', 'service good', 'super friendly', 'vegas buffet', 'won disappointed']


In [62]:
#Now add negative features
vectorizer_4 = TfidfVectorizer(stop_words='english', max_features=20, ngram_range=(2,3))
X_4_neg = vectorizer_4.fit_transform(df_4[df_4['Sentiment'] == 0]['Comment'])
keywords_4_neg = vectorizer_4.get_feature_names()

print(keywords_4_neg)

['10 minutes', '20 minutes', '30 minutes', 'anytime soon', 'bad food', 'customer service', 'don know', 'don think', 'feel like', 'going anytime', 'going anytime soon', 'good food', 'good way', 'mediocre food', 'minutes food', 'service slow', 'think ll', 'waste time', 'won going', 'zero stars']


In [65]:
for key in keywords_4_pos:
    df_4[str(key)] = df_4.Comment.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

for key in keywords_4_neg:
    df_4[str(key)] = df_4.Comment.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
              
data_4 = df_4[keywords_4_pos] & df_4[keywords_4_neg]
target_4 = df_4['Sentiment']

from sklearn.naive_bayes import BernoulliNB

bnb_4 = BernoulliNB()

bnb_4.fit(data_4, target_4)

y_pred_4 = bnb_4.predict(data_4)   

In [66]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target_4, y_pred_4)

array([[  2, 498],
       [  1, 499]], dtype=int64)

I am surprised at this development. My model is now calling virtually everything negative. This is the reverse of the problem I had with the previous iteration using tf-idf. For my fifth and final iteration, I will try to find a balance of the two by having less. Perhaps I added too many. I will return to 13 features for positive like my original, and only have 10 features for negative. Hopefully this will balance it out.

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer

df_5 = pd.read_csv('yelp_labelled.txt', delimiter = "\t", names=['Comment', 'Sentiment'])

vectorizer_5 = TfidfVectorizer(stop_words='english', max_features=13, ngram_range=(2,3))
X_5_pos = vectorizer_5.fit_transform(df_5[df_5['Sentiment'] == 1]['Comment'])
keywords_5_pos = vectorizer_5.get_feature_names()

print(keywords_5_pos)

['food delicious', 'food good', 'food great', 'food service', 'friendly staff', 'good food', 'good prices', 'good service', 'great food', 'great place', 'great service', 'really good', 'service good']


In [76]:
#Now add negative features
vectorizer_5 = TfidfVectorizer(stop_words='english', max_features=10, ngram_range=(2,3))
X_5_neg = vectorizer_5.fit_transform(df_5[df_5['Sentiment'] == 0]['Comment'])
keywords_5_neg = vectorizer_5.get_feature_names()

print(keywords_5_neg)

['10 minutes', 'anytime soon', 'bad food', 'customer service', 'don think', 'going anytime', 'service slow', 'waste time', 'won going', 'zero stars']


In [77]:
for key in keywords_5_pos:
    df_5[str(key)] = df_5.Comment.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

for key in keywords_5_neg:
    df_5[str(key)] = df_5.Comment.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )
              
data_5 = df_5[keywords_5_pos] & df_5[keywords_5_neg]
target_5 = df_5['Sentiment']

from sklearn.naive_bayes import BernoulliNB

bnb_5 = BernoulliNB()

bnb_5.fit(data_5, target_5)

y_pred_5 = bnb_5.predict(data_5)   

In [78]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target_5, y_pred_5)

array([[500,   0],
       [500,   0]], dtype=int64)

And just like that, I am back to the same problem. I have come to another point of needing to learn more I believe. My experiment with tf-idf didn't turn out the way I wanted. But I believe that I will be learning more in depth about how to use it later in the course. 