## Cross-Validating a Naive Bayes Classifier used to Perform Sentiment Analysis

Dataset: UCI sentiment labeled sentences
 https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
 
Positive/Negative word list: University of Pittsburgh Subjectivity Lexicon http://mpqa.cs.pitt.edu/ 
Instructions: 
1. Pick one of the company data files and build your own classifier. 
2. When you're satisfied with its performance (at this point just using the accuracy measure shown in the example), test it on one of the other datasets to see how well these kinds of classifiers translate from one context to another.
3. Include your model and a brief writeup of your feature engineering and selection process to submit and review with your mentor.

In [116]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import sys
import config
import string

# data is binary so I'll use the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

In [117]:
# load text file of negative and positive words from http://mpqa.cs.pitt.edu/

df_positive_words = pd.read_csv('positive-words.txt', header = None)
df_positive_words.columns=['pos_words']
df_negative_words = pd.read_csv('negative-words2.txt', header = None, encoding = "ISO-8859-1")
df_negative_words.columns=['neg_words']

df_positive_words.head()


Unnamed: 0,pos_words
0,a+
1,abound
2,abounds
3,abundance
4,abundant


In [118]:
# load sentiment data and label columns
sentiment_raw = pd.read_csv(filepath_or_buffer='yelp_labelled.txt', delimiter='\t', header=None)
# name new columns
sentiment_raw.columns=['message', 'sentiment']

sentiment_raw['message_cleaned'] = sentiment_raw['message'].apply(lambda x:''.join([i for i in x 
                                                  if i not in string.punctuation]))
sentiment_raw.head(n=10)

Unnamed: 0,message,sentiment,message_cleaned
0,Wow... Loved this place.,1,Wow Loved this place
1,Crust is not good.,0,Crust is not good
2,Not tasty and the texture was just nasty.,0,Not tasty and the texture was just nasty
3,Stopped by during the late May bank holiday of...,1,Stopped by during the late May bank holiday of...
4,The selection on the menu was great and so wer...,1,The selection on the menu was great and so wer...
5,Now I am getting angry and I want my damn pho.,0,Now I am getting angry and I want my damn pho
6,Honeslty it didn't taste THAT fresh.),0,Honeslty it didnt taste THAT fresh
7,The potatoes were like rubber and you could te...,0,The potatoes were like rubber and you could te...
8,The fries were great too.,1,The fries were great too
9,A great touch.,1,A great touch


### Check for class imbalance:

In [119]:
# check to see if there is a dominant class
print("Proportion of positive sentiments in initial training data:")
print(sentiment_raw['sentiment'].sum()/1000)

Proportion of positive sentiments in initial training data:
0.5


There are equal numbers of positive and negative sentiments in the original (entire dataset). The training data should not have been impacted by class imbalance.

In [120]:

#create an initial dataframe to get frequency of words from positive text file
# positive_features = pd.DataFrame()
# negative_features = pd.DataFrame()


#create a series for negative words and for positive words using the text files 

keywords_positive = df_positive_words['pos_words']
keywords_negative = df_negative_words['neg_words']


#create a binary feature for the presence of positive words
data = pd.DataFrame()
for key in keywords_positive:
    # spaces around the key to get the word,not just pattern matching.
    data[str(key)] = sentiment_raw.message_cleaned.str.contains(' ' + str(key) + ' ', case=False).astype(int)

for key in keywords_negative:
    # spaces around the key to get the word,not just pattern matching.
    data[str(key)] = sentiment_raw.message_cleaned.str.contains(' ' + str(key) + ' ', case=False).astype(int)

data.head()
             

Unnamed: 0,a+,abound,abounds,abundance,abundant,accessable,accessible,acclaim,acclaimed,acclamation,...,wrongly,wrought,yawn,zap,zapped,zaps,zealot,zealous,zealously,zombie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [121]:
# return most common sentiments used in training data
sums = data.sum(axis=0)
sums.sort(ascending=False)
print(sums)

data.drop(['a+'],axis=1,inplace=True)




a+               194
good              62
like              43
great             43
friendly          21
nice              19
love              16
best              16
pretty            15
better            12
recommend         11
worth             10
enough            10
loved              9
amazing            9
worst              9
fresh              9
well               9
happy              8
hot                8
excellent          8
perfect            8
right              7
clean              7
awesome            7
super              7
warm               6
authentic          6
enjoy              6
slow               6
                ... 
occlude            0
occluded           0
occludes           0
occluding          0
odd                0
odder              0
oddest             0
oddities           0
oddity             0
oddly              0
obstructed         0
obstinately        0
oblivious          0
obstinate          0
obnoxious          0
obnoxiously        0
obscene      

  app.launch_new_instance()


In [122]:
# build training data as new dataframe for model and assign target (outcome variable)
target = sentiment_raw['sentiment']

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 251


### Use Cross-Validation to test model

In [None]:
# create variants of feature sets to test with cross validation
#try dropping uncommon feature values
data.drop([col for col, val in data.sum().iteritems() if val < 1], axis=1, inplace=True)

#try dropping a+ feature

# try adding allcaps feature
#data['allcaps'] = sentiment_raw.message_cleaned.str.isupper()

# try adding exclamation points as feature
#
data_2 = data
#identify number of folds and split the dataset
n = 10
data_2_split = np.array_split(data, n)
target_split = np.array_split(target, n)

#for each fold assign it as the test dataset and fit your model on the remaining data
sum = 0
for x in range(n):
    train_data = []
    train_target = []
   
    for y in range(n):
        if y == x :
            test_data = data_2_split[x]
            test_target = target_split[y]
        else:
            train_data.append(data_2_split[y])
            train_target.append(target_split[y])
            df_train_data = pd.concat(train_data)
            df_train_target = pd.concat(train_target)
    bnb = BernoulliNB()
    bnb.fit(df_train_data, df_train_target)
    y_pred = bnb.predict(test_data)
    print("Test Fold ",x+1, ": Number of mislabeled points out of a total {} points : {}".format(test_data.shape[0], (test_target != y_pred).sum()))
    sum = sum + (test_target != y_pred).sum()
    
print (sum/n)




Test Fold  1 : Number of mislabeled points out of a total 100 points : 36
Test Fold  2 : Number of mislabeled points out of a total 100 points : 35
Test Fold  3 : Number of mislabeled points out of a total 100 points : 31
Test Fold  4 : Number of mislabeled points out of a total 100 points : 34
Test Fold  5 : Number of mislabeled points out of a total 100 points : 35
Test Fold  6 : Number of mislabeled points out of a total 100 points : 27
Test Fold  7 : Number of mislabeled points out of a total 100 points : 29
Test Fold  8 : Number of mislabeled points out of a total 100 points : 42
Test Fold  9 : Number of mislabeled points out of a total 100 points : 24
Test Fold  10 : Number of mislabeled points out of a total 100 points : 12
30.5


### Original model appears to have some overfitting (25% accuracy when tested with the same dataset, 26.7% accuracy on average when tested using cross-validation)

In [None]:
# Criteria for removing features - remove features one at a time to test impact on model accuracy
sums = []
columns = []

# systematically drop each feature and measure accuracy using a 10 fold split cross-validation
for column in data:
    data_2 = data
    data_3 = data_2.drop([column], axis=1)
    n = 10
    data_3_split = np.array_split(data_3, n)
    target_split = np.array_split(target, n)

    #for each fold assign it as the test dataset and fit your model on the remaining data
    sum = 0
    # for each fold x make it's records test data and concat the rest of the folds into train data
    for x in range(n):
        train_data = []
        train_target = []
   
        for y in range(n):
            if y == x :
                test_data = data_3_split[x]
                test_target = target_split[y]
            else:
                train_data.append(data_3_split[y])
                train_target.append(target_split[y])
                df_train_data = pd.concat(train_data)
                df_train_target = pd.concat(train_target)
        #fit the model with the train data        
        bnb = BernoulliNB()
        bnb.fit(df_train_data, df_train_target)
        #calculate predictions based on results
        y_pred = bnb.predict(test_data)
        #print("Test Fold ",x+1, ": Number of mislabeled points out of a total {} points : {}".format(test_data.shape[0], (test_target != y_pred).sum()))
        #
        sum = sum + (test_target != y_pred).sum()
        
        
    sum_average = sum/n
    sums.extend([sum_average])
    columns.extend([column])
    #print(column,":",sum_average)


In [None]:
columns_sums_sort = column_sums.sort(['sum_average'], ascending=[0])
print(columns_sums_sort)

### Adjusting the model:

Conclusions: Model seems to perform even less well on data that is not included in training set. It wasn't performing that well anyway but is obviously still overfitted. - Average 30.5% mislabeled (versus 25% when whole dataset trained)

for column
1. Including all of the features created from the text files resulted in test fold accuracy between 5-75%. This is hugely variable result 
2. Removing features not represented in the message fields (ie. feature.sum > 0) changed the accuracy in the test folds to between 58-88%. This is still not a very accurate model given that it should be 50% accurate by chance. 
3. Removing features that show up less than 2 times did not seem to improve the accuracy.
4. Adding exclamation points as a feature accuracy 62-83%



In [None]:
# First new classifier: Only including features that hurt the average when removed

keywords_df = columns_sums_sort[columns_sums_sort.sum_average > 30.5]
print(keywords_df.count())
keywords = keywords_df['columns']
data_class1 = pd.DataFrame()
for key in keywords:
    # spaces around the key to get the word,not just pattern matching.
    data_class1[str(key)] = sentiment_raw.message_cleaned.str.contains(' ' + str(key) + ' ', case=False).astype(int)

    
# Model and test on whole dataset
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(data_class1, target)
# Classify, storing the result in a new variable.
y_pred = bnb.predict(data_class1)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(data_class1.shape[0],(target != y_pred).sum())) 
    
#create data folds and test
n= 10
data_class1_split = np.array_split(data_class1, n)
target_split = np.array_split(target, n)
sum=0
for x in range(n):
    train_data = []
    train_target = []
    for y in range(n):
        if y == x :
            test_data = data_class1_split[x]
            test_target = target_split[y]
        else:
            train_data.append(data_class1_split[y])
            train_target.append(target_split[y])
            df_train_data = pd.concat(train_data)
            df_train_target = pd.concat(train_target)
    #fit the model with the train data        
    bnb = BernoulliNB()
    bnb.fit(df_train_data, df_train_target)
    #calculate predictions based on results
    y_pred = bnb.predict(test_data)
    print("Test Fold ",x+1, ": Number of mislabeled points out of a total {} points : {}".format(test_data.shape[0], (test_target != y_pred).sum()))
    sum = sum + (test_target != y_pred).sum()
            
sum_average = sum/n
print("Average number wrong:",sum_average)

### Classifier 1 conclusions:
1. Slightly more accurate than original model
2. Does not appear to be overfit - the number of mislabeled points when the same dataset is the same as the average of the different fold tests

### Conclusion: Adjusted Classifier 1 

In [None]:
# First new classifier: Only including features that hurt the average when removed - less stringent
keywords_df = columns_sums_sort[columns_sums_sort.sum_average >= 31]
print(keywords_df.count())
keywords = keywords_df['columns']
data_class2 = pd.DataFrame()
for key in keywords:
    # spaces around the key to get the word,not just pattern matching.
    data_class2[str(key)] = sentiment_raw.message_cleaned.str.contains(' ' + str(key) + ' ', case=False).astype(int)

# Model and test on whole dataset
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(data_class2, target)
# Classify, storing the result in a new variable.
y_pred = bnb.predict(data_class2)    
# Display results.
print("Number of mislabeled points out of a total {} points : {}".format(data_class2.shape[0],(target != y_pred).sum()))    
n= 10

data_class2_split = np.array_split(data_class2, n)
target_split = np.array_split(target, n)
sum=0
for x in range(n):
    train_data = []
    train_target = []
    for y in range(n):
        if y == x :
            test_data = data_class2_split[x]
            test_target = target_split[y]
        else:
            train_data.append(data_class2_split[y])
            train_target.append(target_split[y])
            df_train_data = pd.concat(train_data)
            df_train_target = pd.concat(train_target)
    #fit the model with the train data        
    bnb = BernoulliNB()
    bnb.fit(df_train_data, df_train_target)
    #calculate predictions based on results
    y_pred = bnb.predict(test_data)
    print("Test Fold ",x+1, ": Number of mislabeled points out of a total {} points : {}".format(test_data.shape[0], (test_target != y_pred).sum()))
    sum = sum + (test_target != y_pred).sum()
            
sum_average = sum/n
print("Average number wrong:",sum_average)

### Classifier 2 conclusions:
1. Not very accurate
2. Does not appear to be overfit - the number of mislabeled points when the same dataset is the same as the average of the different fold tests

In [None]:
#Classifier3 - add punctuation classifiers to classifier 1
data_class3 = data_class1
data_class3['exclamation'] = sentiment_raw.message.str.contains('!')
data_class3['not'] = sentiment_raw.message_cleaned.str.contains('not')

# Model and test on whole dataset
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(data_class3, target)
# Classify, storing the result in a new variable.
y_pred = bnb.predict(data_class3)    
# Display results.
print("Number of mislabeled points out of a total {} points : {}".format(data_class3.shape[0],(target != y_pred).sum()))

data_class3_split = np.array_split(data_class1, n)
target_split = np.array_split(target, n)

    
n= 10
sum=0
for x in range(n):
    train_data = []
    train_target = []
    for y in range(n):
        if y == x :
            test_data = data_class3_split[x]
            test_target = target_split[y]
        else:
            train_data.append(data_class3_split[y])
            train_target.append(target_split[y])
            df_train_data = pd.concat(train_data)
            df_train_target = pd.concat(train_target)
    #fit the model with the train data        
    bnb = BernoulliNB()
    bnb.fit(df_train_data, df_train_target)
    #calculate predictions based on results
    y_pred = bnb.predict(test_data)
    print("Test Fold ",x+1, ": Number of mislabeled points out of a total {} points : {}".format(test_data.shape[0], (test_target != y_pred).sum()))
    sum = sum + (test_target != y_pred).sum()
            
sum_average = sum/n
print("Average number wrong:",sum_average)

### Classifier 3 conclusions:
1. Seems to be the most accurate
2. Does not appear to be overfit - the number of mislabeled points when the same dataset is the same as the average of the different fold tests

In [None]:
#Classifier4 - remove all but a few classifiers
data_class4 = pd.DataFrame()

data_class4['worst'] = sentiment_raw.message_cleaned.str.contains('worst')
data_class4['not'] = sentiment_raw.message_cleaned.str.contains('not')


# Model and test on whole dataset
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(data_class4, target)
# Classify, storing the result in a new variable.
y_pred = bnb.predict(data_class4)    
# Display results.
print("Number of mislabeled points out of a total {} points : {}".format(data_class4.shape[0],(target != y_pred).sum()))

n= 10

data_class4_split = np.array_split(data_class4, n)
target_split = np.array_split(target, n)
sum=0
for x in range(n):
    train_data = []
    train_target = []
    for y in range(n):
        if y == x :
            test_data = data_class4_split[x]
            test_target = target_split[y]
        else:
            train_data.append(data_class4_split[y])
            train_target.append(target_split[y])
            df_train_data = pd.concat(train_data)
            df_train_target = pd.concat(train_target)
    #fit the model with the train data        
    bnb = BernoulliNB()
    bnb.fit(df_train_data, df_train_target)
    #calculate predictions based on results
    y_pred = bnb.predict(test_data)
    print("Test Fold ",x+1, ": Number of mislabeled points out of a total {} points : {}".format(test_data.shape[0], (test_target != y_pred).sum()))
    sum = sum + (test_target != y_pred).sum()
            
sum_average = sum/n
print("Average number wrong:",sum_average)



### Classifier 4 conclusions:
1. Not very accurate
2. Does not appear to be overfit - the number of mislabeled points when the same dataset is the same as the average of the different fold tests

In [None]:
#classifier 5: try removing features who's absence seemed to improve the model in analysis above
data_class5 = data
#data_class5.drop(['waste','lacked','sick','poor','work','mediocre','liked','slow','gold','like','right','recommend',\
#                  'authentic','convenient','pretty','better','worst'],axis=1,inplace=True)
data_class5['exclamation'] = sentiment_raw.message.str.contains('!')
data_class5['not'] = sentiment_raw.message_cleaned.str.contains('not')

# Model and test on whole dataset
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(data_class5, target)
# Classify, storing the result in a new variable.
y_pred = bnb.predict(data_class5)    
# Display results.
print("Number of mislabeled points out of a total {} points : {}".format(data_class5.shape[0],(target != y_pred).sum()))


n= 10

data_class5_split = np.array_split(data_class5, n)
target_split = np.array_split(target, n)
sum=0
for x in range(n):
    train_data = []
    train_target = []
    for y in range(n):
        if y == x :
            test_data = data_class5_split[x]
            test_target = target_split[y]
        else:
            train_data.append(data_class5_split[y])
            train_target.append(target_split[y])
            df_train_data = pd.concat(train_data)
            df_train_target = pd.concat(train_target)
    #fit the model with the train data        
    bnb = BernoulliNB()
    bnb.fit(df_train_data, df_train_target)
    #calculate predictions based on results
    y_pred = bnb.predict(test_data)
    print("Test Fold ",x+1, ": Number of mislabeled points out of a total {} points : {}".format(test_data.shape[0], (test_target != y_pred).sum()))
    sum = sum + (test_target != y_pred).sum()
            
sum_average = sum/n
print("Average number wrong:",sum_average)




### Classifier 5 conclusions:
1. Appears to be overfit - the number of mislabeled points when the same dataset is used to both fit and test the model is much better than the average when folds are used

### Overall Conclusions:
1. Classifier 3 seems to be the best model with the highest accuracy (73.2%) on average when cross-validated. 
2. Based on the differences in average accuracy for cross-validated tests and the accuracy of the model when testing against the training dataset, both the original model and the 5th adjusted model appear to be overfit
2. The features that have the greatest impact are almost all words from the positive text dataset, along with the presence of exclamation points and the word "not."