# Challenge: Iterate & Evaluate Classifier #

## Iterate & Evaluate Classifier on Yelp Feedback ##

## by Lorenz Madarang ##

## Data: https://archive.ics.uci.edu/ml/machine-learning-databases/00331/ ##

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
from string import punctuation
from collections import Counter
import operator

In [2]:
# Import .txt files and create two columns for the message and the sentiment
sentiment_raw = pd.read_csv('yelp_labelled.txt', delimiter= '\t', header=None)
sentiment_raw.columns = ['message', 'sentiment']

### 1.) Naive Bayes Classifier with Positive Keywords list length of 23 words ###

In [50]:
keywords = ['great', 'friendly', 'delicious', 'Great', 'nice', 'love', 'excellent', 'awesome', 'fantastic', 'tasty', 'stars', '5', 'Best', 'clean', 'perfect', 'loved', 'tender', 'attentive', 'wonderful', 'Good', 'recommend', 'enjoyed', 'deal']
print('Keywords list length is {} words long'.format(len(keywords)))

Keywords list length is 23 words long


In [27]:
for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    sentiment_raw[str(key)] = sentiment_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

In [28]:
data = sentiment_raw[keywords]
target = sentiment_raw['sentiment']

In [29]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 343


In [30]:
from sklearn.model_selection import train_test_split

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.67
Testing on Sample: 0.657


In [31]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([ 0.63,  0.67,  0.65,  0.61,  0.65,  0.62,  0.71,  0.67,  0.64,  0.7 ])

In [32]:
confusion_matrix(target, y_pred)
matrix = confusion_matrix(target, y_pred)
print('Sensitivity is {}'.format(matrix[1,1]/(matrix[0,1]+matrix[1,1])))
print('Specificity is {}'.format(matrix[0,0]/(matrix[0,0]+matrix[1,0])))

Sensitivity is 0.8792270531400966
Specificity is 0.5989911727616646


### Evaluation: ###

This classifier seems to be consistent when I conduct a one time holdout of 20% but when we conduct a cross-validation with 10 folds the accuracy scores seems to fluctuate.  The sensitivity is pretty good but the specificty is not so good for this classifier.  

_____________________________________________________________________________________________________________________

### 2.) Naive Bayes Classifier with Positive Keywords list length of 10 words ###

In [33]:
keywords = ['great', 'friendly', 'delicious', 'Great', 'nice', 'love', 'excellent', 'awesome', 'fantastic', 'tasty']

In [34]:
for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    sentiment_raw[str(key)] = sentiment_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

data = sentiment_raw[keywords]
target = sentiment_raw['sentiment']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 408


In [35]:
from sklearn.model_selection import train_test_split

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.595
Testing on Sample: 0.592


In [36]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([ 0.59,  0.57,  0.57,  0.55,  0.6 ,  0.54,  0.64,  0.59,  0.6 ,  0.65])

In [37]:
confusion_matrix(target, y_pred)
matrix = confusion_matrix(target, y_pred)
print('Sensitivity is {}'.format(matrix[1,1]/(matrix[0,1]+matrix[1,1])))
print('Specificity is {}'.format(matrix[0,0]/(matrix[0,0]+matrix[1,0])))

Sensitivity is 0.9339622641509434
Specificity is 0.5514541387024608


### Evaluation: ###

This classifier seems to be consistent when I conduct a one time holdout of 20% but when we conduct a cross-validation with 10 folds the accuracy scores seems to fluctuate.  The sensitivity is better than the first classifier but the specificty is a little bit worse for this classifier.  

_____________________________________________________________________________________________________________________

### 3.) Naive Bayes Classifier with Positive Keywords list length of 5 words ###

In [38]:
keywords = ['great', 'friendly', 'delicious', 'Great', 'nice']

In [39]:
for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    sentiment_raw[str(key)] = sentiment_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

data = sentiment_raw[keywords]
target = sentiment_raw['sentiment']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 435


In [40]:
from sklearn.model_selection import train_test_split

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.565
Testing on Sample: 0.565


In [41]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([ 0.59,  0.54,  0.56,  0.51,  0.57,  0.52,  0.62,  0.56,  0.58,  0.6 ])

In [42]:
confusion_matrix(target, y_pred)
matrix = confusion_matrix(target, y_pred)
print('Sensitivity is {}'.format(matrix[1,1]/(matrix[0,1]+matrix[1,1])))
print('Specificity is {}'.format(matrix[0,0]/(matrix[0,0]+matrix[1,0])))

Sensitivity is 0.9452054794520548
Specificity is 0.535059331175836


### Evaluation: ###

This classifier seems to be consistent when I conduct a one time holdout of 20% but when we conduct a cross-validation with 10 folds the accuracy scores seems to fluctuate.  The sensitivity is better than the first classifier but the specificty is a little bit worse for this classifier.  

_____________________________________________________________________________________________________________________

### 4.) Naive Bayes Classifier with Positive Keywords list length of 215 words ###

In [52]:
keywords = ['and',
 'the',
 'was',
 'I',
 'a',
 'is',
 'The',
 'to',
 'good',
 'in',
 'place',
 'food',
 'it',
 'of',
 'great',
 'this',
 'with',
 'for',
 'very',
 'had',
 'are',
 'you',
 'service',
 'were',
 'have',
 'so',
 'on',
 'This',
 'here',
 'friendly',
 'my',
 'amazing',
 's',
 'be',
 'that',
 'back',
 'time',
 'really',
 'they',
 'delicious',
 'we',
 'but',
 'all',
 'Great',
 'nice',
 't',
 'like',
 'our',
 'also',
 'just',
 'not',
 'restaurant',
 'go',
 'staff',
 'We',
 'as',
 'Vegas',
 'at',
 'love',
 'an',
 'will',
 'They',
 'menu',
 'Service',
 'first',
 'their',
 'best',
 'experience',
 'It',
 'made',
 'can',
 'by',
 'My',
 'been',
 'fresh',
 'out',
 'steak',
 'excellent',
 'even',
 'always',
 'atmosphere',
 'awesome',
 'has',
 'which',
 'only',
 'definitely',
 'your',
 'fantastic',
 'ever',
 'pizza',
 'selection',
 'chicken',
 'could',
 'server',
 'came',
 'he',
 'again',
 'what',
 'well',
 'pretty',
 'say',
 'breakfast',
 'up',
 'one',
 'get',
 'spot',
 'm',
 'some',
 'or',
 'happy',
 'prices',
 'there',
 'beer',
 'did',
 'want',
 'tasty',
 'us',
 'when',
 'Their',
 'stars',
 '5',
 've',
 'Best',
 'Food',
 'town',
 'night',
 'come',
 'every',
 'clean',
 'eat',
 'Very',
 'about',
 'taste',
 'from',
 'me',
 'perfect',
 'loved',
 'right',
 'quality',
 'tender',
 'attentive',
 'If',
 'sandwich',
 'salad',
 'inside',
 'wonderful',
 'sauce',
 'd',
 'went',
 'still',
 'buffet',
 'All',
 'spicy',
 'Good',
 'sushi',
 'than',
 'while',
 'quite',
 'order',
 'would',
 'recommend',
 'family',
 'A',
 'tried',
 'So',
 'thing',
 'cooked',
 'any',
 'side',
 'ordered',
 'way',
 'Our',
 'meal',
 'next',
 'worth',
 'ambiance',
 'day',
 'since',
 'going',
 'dish',
 'bread',
 'them',
 'enough',
 'twice',
 'absolutely',
 'more',
 'disappointed',
 'reasonable',
 'little',
 'll',
 'enjoyed',
 'bar',
 'super',
 'fries',
 'too',
 'deal',
 'visit',
 'his',
 'table',
 'huge',
 'sweet',
 'potato',
 'lunch',
 'try',
 'Everything',
 're',
 'price',
 'dishes',
 'better',
 'Nice',
 'steaks',
 'hit',
 'other',
 'i',
 'feel',]
print('Keywords list length is {} words long'.format(len(keywords)))

Keywords list length is 215 words long


In [18]:
for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    sentiment_raw[str(key)] = sentiment_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

data = sentiment_raw[keywords]
target = sentiment_raw['sentiment']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 244


In [19]:
from sklearn.model_selection import train_test_split

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.675
Testing on Sample: 0.756


In [20]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([ 0.67,  0.67,  0.69,  0.66,  0.68,  0.7 ,  0.71,  0.76,  0.68,  0.73])

In [21]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

array([[418,  82],
       [162, 338]])

In [25]:
matrix = confusion_matrix(target, y_pred)
print('Sensitivity is {}'.format(matrix[1,1]/(matrix[0,1]+matrix[1,1])))
print('Specificity is {}'.format(matrix[0,0]/(matrix[0,0]+matrix[1,0])))

Sensitivity is 0.8047619047619048
Specificity is 0.7206896551724138


### Evaluation: ###

This classifier seems to be overfitting the data when I conduct a one time holdout of 20% and when we conduct a cross-validation with 10 folds it also confirms that there could be overfitting.  The sensitivity is worse than the first three classifiers but the specificity is better than the first three classifiers.

_____________________________________________________________________________________________________________________

### 5.) Naive Bayes Classifier with Positive Keywords list length of 1000 words ###

In [53]:
keywords = ['and',
 'the',
 'was',
 'I',
 'a',
 'is',
 'The',
 'to',
 'good',
 'in',
 'place',
 'food',
 'it',
 'of',
 'great',
 'this',
 'with',
 'for',
 'very',
 'had',
 'are',
 'you',
 'service',
 'were',
 'have',
 'so',
 'on',
 'This',
 'here',
 'friendly',
 'my',
 'amazing',
 's',
 'be',
 'that',
 'back',
 'time',
 'really',
 'they',
 'delicious',
 'we',
 'but',
 'all',
 'Great',
 'nice',
 't',
 'like',
 'our',
 'also',
 'just',
 'not',
 'restaurant',
 'go',
 'staff',
 'We',
 'as',
 'Vegas',
 'at',
 'love',
 'an',
 'will',
 'They',
 'menu',
 'Service',
 'first',
 'their',
 'best',
 'experience',
 'It',
 'made',
 'can',
 'by',
 'My',
 'been',
 'fresh',
 'out',
 'steak',
 'excellent',
 'even',
 'always',
 'atmosphere',
 'awesome',
 'has',
 'which',
 'only',
 'definitely',
 'your',
 'fantastic',
 'ever',
 'pizza',
 'selection',
 'chicken',
 'could',
 'server',
 'came',
 'he',
 'again',
 'what',
 'well',
 'pretty',
 'say',
 'breakfast',
 'up',
 'one',
 'get',
 'spot',
 'm',
 'some',
 'or',
 'happy',
 'prices',
 'there',
 'beer',
 'did',
 'want',
 'tasty',
 'us',
 'when',
 'Their',
 'stars',
 '5',
 've',
 'Best',
 'Food',
 'town',
 'night',
 'come',
 'every',
 'clean',
 'eat',
 'Very',
 'about',
 'taste',
 'from',
 'me',
 'perfect',
 'loved',
 'right',
 'quality',
 'tender',
 'attentive',
 'If',
 'sandwich',
 'salad',
 'inside',
 'wonderful',
 'sauce',
 'd',
 'went',
 'still',
 'buffet',
 'All',
 'spicy',
 'Good',
 'sushi',
 'than',
 'while',
 'quite',
 'order',
 'would',
 'recommend',
 'family',
 'A',
 'tried',
 'So',
 'thing',
 'cooked',
 'any',
 'side',
 'ordered',
 'way',
 'Our',
 'meal',
 'next',
 'worth',
 'ambiance',
 'day',
 'since',
 'going',
 'dish',
 'bread',
 'them',
 'enough',
 'twice',
 'absolutely',
 'more',
 'disappointed',
 'reasonable',
 'little',
 'll',
 'enjoyed',
 'bar',
 'super',
 'fries',
 'too',
 'deal',
 'visit',
 'his',
 'table',
 'huge',
 'sweet',
 'potato',
 'lunch',
 'try',
 'Everything',
 're',
 'price',
 'dishes',
 'better',
 'Nice',
 'steaks',
 'hit',
 'other',
 'i',
 'feel',
 'down',
 'area',
 'how',
 'dining',
 'once',
 'ice',
 'incredible',
 'bacon',
 'Both',
 'make',
 'fun',
 'chef',
 'fast',
 'far',
 'You',
 'wrong',
 'didn',
 'everything',
 'check',
 'won',
 'authentic',
 'times',
 'if',
 'options',
 'quick',
 'because',
 'during',
 'recommendation',
 'beautiful',
 'That',
 'tacos',
 'burger',
 'found',
 'Overall',
 'lot',
 'portions',
 'moist',
 'On',
 'dessert',
 'beef',
 'Greek',
 'hummus',
 'duck',
 'He',
 'left',
 'over',
 'perfectly',
 'second',
 'yummy',
 'party',
 'mouth',
 'where',
 'got',
 'Thai',
 'fine',
 'waitress',
 'who',
 'places',
 'eaten',
 'thought',
 'away',
 'pork',
 'delish',
 'melt',
 'cream',
 'fact',
 'drink',
 'never',
 'seated',
 'pasta',
 'full',
 'wait',
 'Phoenix',
 'healthy',
 'decor',
 'butter',
 'chips',
 'white',
 'salmon',
 'though',
 'folks',
 'special',
 'must',
 'stop',
 'flavorful',
 'know',
 'seafood',
 'each',
 'Just',
 'patio',
 'In',
 'house',
 'tell',
 'And',
 'two',
 'pleased',
 'Pretty',
 'warm',
 'As',
 'cool',
 'don',
 'think',
 'regular',
 'until',
 'dinner',
 'small',
 'now',
 'new',
 'many',
 'homemade',
 'thin',
 'BEST',
 'An',
 'boyfriend',
 'Wow',
 'Loved',
 'Stopped',
 'off',
 'touch',
 'recommended',
 'care',
 'stuff',
 'wall',
 'Mexican',
 'Also',
 'inexpensive',
 'shrimp',
 'dressing',
 'pita',
 'rare',
 'outside',
 'portion',
 'amount',
 'glad',
 'Always',
 '2',
 'drinks',
 'seasoned',
 'Today',
 'Pho',
 'rolls',
 'die',
 'Will',
 'quickly',
 'cafe',
 'marrow',
 'added',
 'extra',
 'waiter',
 'helpful',
 'coming',
 'beat',
 'LOVED',
 'wine',
 'bartender',
 'ambience',
 'music',
 'playing',
 'trip',
 'said',
 'belly',
 'crispy',
 'burgers',
 'Delicious',
 'selections',
 'cheese',
 'real',
 'Subway',
 'seriously',
 'top',
 'itself',
 'decorated',
 'greeted',
 'Bay',
 'Some',
 'joint',
 'different',
 'several',
 'years',
 'ago',
 'priced',
 'used',
 'ladies',
 'vegetarian',
 'Not',
 'Love',
 'something',
 'flavor',
 'What',
 'served',
 'hot',
 'home',
 'watch',
 'wings',
 'feeling',
 'bowl',
 'drive',
 'things',
 'hope',
 'brunch',
 'sides',
 'puree',
 'friend',
 'Waitress',
 'couple',
 'Everyone',
 'customer',
 'Now',
 'Seriously',
 'dark',
 'people',
 'creamy',
 'Fantastic',
 'sticks',
 'around',
 'generous',
 '8',
 'world',
 'kind',
 'job',
 'liked',
 'outstanding',
 'meat',
 'soooo',
 'fish',
 'summer',
 'delightful',
 'expect',
 'considering',
 'potatoes',
 'serve',
 'may',
 'am',
 'OMG',
 'feels',
 'brick',
 'oven',
 'equally',
 'pleasant',
 'treat',
 'tasted',
 'large',
 'comfortable',
 'fry',
 'interesting',
 'highly',
 'stuffed',
 'To',
 'lots',
 'see',
 'perfection',
 'impeccable',
 'Hot',
 'soon',
 'tea',
 'assure',
 'professional',
 'told',
 'These',
 'owners',
 'week',
 'sure',
 'ask',
 'tots',
 'disappoint',
 'enjoy',
 'group',
 'much',
 'favorite',
 'manager',
 'pizzas',
 'lovely',
 'such',
 'need',
 'thumbs',
 'choose',
 'desserts',
 'Perfect',
 'do',
 'tapas',
 'setting',
 'take',
 'felt',
 'friends',
 'satisfying',
 'grilled',
 'especially',
 'vibe',
 'few',
 'seating',
 'Italian',
 'close',
 'him',
 'owner',
 'Awesome',
 'Prices',
 'simple',
 'One',
 'Chicken',
 'Chinese',
 'paper',
 'sat',
 'When',
 'definately',
 'tribute',
 'last',
 'salsa',
 'she',
 'late',
 'May',
 'bank',
 'holiday',
 'Rick',
 'Steve',
 'prompt',
 'Cape',
 'Cod',
 'ravoli',
 'cranberry',
 'mmmm',
 'Highly',
 'cute',
 'less',
 'interior',
 'performed',
 'red',
 'velvet',
 'cake',
 'ohhh',
 'hole',
 'street',
 'combos',
 '23',
 'decent',
 'accident',
 'happier',
 'redeeming',
 'Ample',
 'Hiro',
 'delight',
 'positive',
 'note',
 'provided',
 'prime',
 'rib',
 'section',
 'Firehouse',
 'refreshing',
 'pink',
 'char',
 'running',
 'after',
 'realized',
 'husband',
 'sunglasses',
 'chow',
 'mein',
 'servers',
 'imaginative',
 'power',
 'scallop',
 'receives',
 'APPETIZERS',
 'cocktails',
 'handmade',
 'give',
 'military',
 'discount',
 'Dos',
 'Gringos',
 'Update',
 'finish',
 'included',
 'tastings',
 'Jeff',
 'above',
 'beyond',
 'expected',
 'Really',
 'rice',
 'spring',
 'oh',
 'Omelets',
 'sexy',
 'outrageously',
 'flirting',
 'hottest',
 'person',
 'arrived',
 'serves',
 'wife',
 'loves',
 'roasted',
 'garlic',
 'bone',
 'another',
 'kept',
 'bloddy',
 'mary',
 'Buffet',
 'cannot',
 'mussels',
 'reduction',
 'buffets',
 'Tigerlilly',
 'afternoon',
 'personable',
 'AND',
 'Sooooo',
 'Check',
 'guys',
 'loving',
 'son',
 'worst',
 'venture',
 'further',
 'Phenomenal',
 'Definitely',
 'venturing',
 'strip',
 'return',
 'Penne',
 'vodka',
 'including',
 'massive',
 'meatloaf',
 'wrap',
 'tuna',
 'NYC',
 'bagels',
 'Lox',
 'capers',
 'meet',
 'expectations',
 'solid',
 'bars',
 'empty',
 'suggestions',
 'blanket',
 'moz',
 'done',
 'cover',
 'subpar',
 'bathrooms',
 'fiancé',
 'middle',
 'Mandalay',
 'highlights',
 'nigiri',
 'cut',
 'piece',
 'flavored',
 'Voodoo',
 'gluten',
 'free',
 'immediately',
 'diverse',
 'reasonably',
 'Restaurant',
 'DELICIOUS',
 'hands',
 'metro',
 'Bacon',
 'hella',
 'salty',
 'menus',
 'handed',
 'no',
 'listed',
 'waitresses',
 'Lordy',
 'Khao',
 'Soi',
 'missed',
 'curry',
 'lovers',
 'terrific',
 'thrilled',
 'accommodations',
 'daughter',
 'modern',
 'hip',
 'maintaining',
 'coziness',
 'weekly',
 'haunt',
 'hits',
 'lacking',
 'quantity',
 'Lemon',
 'raspberry',
 'cocktail',
 'Interesting',
 'crepe',
 'station',
 'bits',
 'original',
 'preparing',
 'egg',
 'satisfied',
 'heard',
 'exceeding',
 'dreamed',
 'serivce',
 'inviting',
 'mixed',
 'mushrooms',
 'yukon',
 'gold',
 'corn',
 'beateous',
 'tartar',
 'Extremely',
 'Tasty',
 'Jamaican',
 'mojitos',
 'rich',
 'accordingly',
 'wrapped',
 'dates',
 'unbelievable',
 'BARGAIN',
 'Otto',
 'welcome',
 'pho',
 'whenever',
 'sporting',
 'events',
 'walls',
 'covered',
 'TV',
 'hardest',
 'decision',
 'Honestly',
 'M',
 'supposed',
 'providing',
 'flavourful',
 'delights',
 'Much',
 'AYCE',
 'lighting',
 'set',
 'mood',
 'Owner',
 'peanut',
 '7',
 'exquisite',
 'boot',
 'Plus',
 'bucks',
 'Thus',
 'visited',
 'year',
 'Veggitarian',
 'platter',
 'cant',
 'Madison',
 'Ironman',
 'chefs',
 'goat',
 'taco',
 'skimp',
 'wow',
 'FLAVOR',
 'Bachi',
 'Burger',
 'Pizza',
 'Salads',
 'pulled',
 'incredibly',
 'prepared',
 'dine',
 'charming',
 'outdoor',
 'high',
 'Back',
 'BBQ',
 'lighter',
 'fare',
 'pricing',
 'public',
 'old',
 'ways',
 '20',
 'exceptional',
 'reviews',
 'months',
 'later',
 'returned',
 'Favorite',
 'shawarrrrrrma',
 'black',
 'eyed',
 'peas',
 'UNREAL',
 'vinaigrette',
 'overall',
 'truly',
 'unbelievably',
 'delicioso',
 'Of',
 'vegetables',
 'driving',
 'Tucson',
 'Chipotle',
 'BETTER',
 'Classy',
 'appetizers',
 'succulent',
 'Baseball',
 'app',
 'multiple',
 'treated',
 'genuinely',
 'enthusiastic',
 'evening',
 'life',
 'bathroom',
 'door',
 'Outstanding',
 'Server',
 'handling',
 'rowdy',
 'Would',
 'craving',
 'deserves',
 'space',
 'tiny',
 'elegantly',
 'customize',
 'usual',
 'Eggplant',
 'Green',
 'Bean',
 'stir',
 'part',
 'dinners',
 'outshining',
 'Halibut',
 'Def',
 'ethic',
 'continue',
 'andddd',
 'date',
 'anyone',
 'past',
 'walked',
 'located',
 'Crystals',
 'shopping',
 'mall',
 'Aria',
 'summarize',
 'nay',
 'transcendant',
 'nothing',
 'brings',
 'joy',
 'memory',
 'pneumatic',
 'condiment',
 'dispenser',
 'Kids',
 'kiddos',
 'Cooked',
 'reminds',
 'mom',
 'pop',
 'shops',
 'San',
 'Francisco',
 'Area',
 'Buldogis',
 'Gourmet',
 'Dog',
 'possible',
 'petty',
 'iced',
 'Come',
 'hungry',
 'leave',
 'eating',
 'First',
 'become',
 'looked',
 'overwhelmed',
 'needs',
 'stayed',
 'end',
 'From',
 'companions',
 'texture',
 'No',
 'complaints',
 'expert',
 'connisseur',
 'topic',
 'nicest',
 'across',
 'biscuits',
 'absolutley',
 'Steiners',
 'familiar',
 'Anyway',
 'FS',
 'Each',
 'mention',
 'combination',
 'pears',
 'almonds',
 'big',
 'winner',
 'spicier',
 'prefer',
 'ribeye',
 'mesquite',
 'gooodd',
 'mouthful',
 'enjoyable',
 'relaxed',
 'venue',
 'couples',
 'groups',
 'etc',
 'Nargile',
 'tater',
 'southwest',
 'vanilla',
 'smooth',
 'profiterole',
 'choux',
 'pastry',
 'Im',
 'AZ',
 'margaritas',
 'trimmed',
 '70',
 'claimed',
 '40',
 'handled',
 'beautifully',
 'jewel',
 'Las',
 'exactly',
 'hoping',
 'find',
 'nearly',
 'ten',
 'living',
 'isn',
 'establishment',
 'toro',
 'tartare',
 'cavier',
 'extraordinary',
 'thinly',
 'sliced',
 'wagyu',
 'truffle',
 'How',
 'decide',
 'CONCLUSION',
 'filling',
 'meals',
 'daily',
 'specials',
 'pancake',
 'crawfish',
 'monster',
 'fried',
 'eggs',
 'funny',
 'Mom',
 'multi',
 'grain',
 'pumpkin',
 'pancakes',
 'pecan',
 'fluffy',
 'Cant',
 'hand',
 'pastas',
 'Give',
 'By']
print('Keywords list length is {} words long'.format(len(keywords)))

Keywords list length is 1000 words long


In [44]:
for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    sentiment_raw[str(key)] = sentiment_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

data = sentiment_raw[keywords]
target = sentiment_raw['sentiment']

# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 196


In [45]:
from sklearn.model_selection import train_test_split

# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.685
Testing on Sample: 0.804


In [46]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([ 0.7 ,  0.67,  0.72,  0.69,  0.7 ,  0.7 ,  0.64,  0.78,  0.67,  0.76])

In [47]:
matrix = confusion_matrix(target, y_pred)
print('Sensitivity is {}'.format(matrix[1,1]/(matrix[0,1]+matrix[1,1])))
print('Specificity is {}'.format(matrix[0,0]/(matrix[0,0]+matrix[1,0])))

Sensitivity is 0.874384236453202
Specificity is 0.7558922558922558


### Evaluation: ###

This classifier seems to be overfitting the data when I conduct a one time holdout of 20% and when we conduct a cross-validation with 10 folds it also confirms that there could be overfitting.  The sensitivity is better and has increased.  The specificity is the highest of all the classifiers. 

_____________________________________________________________________________________________________________________

### Overall Evaluation: ###
### The classifiers with a lot of keywords in the keywords tend to overfit.  It can be readily seen when we conduct a test/train split with a 20% holdout.  I believe that classifier that seems to work the best without overfitting is the classifier with a keywords list of 23 words.  The features that seem to be the most helpful in the performance are the top 5 positive sentiment words.  You can see that with just the top 5 positive sentiment words we are able to get a sensitivity score of about 94%