## Build a Naive Bayes Classifier to Perform Sentiment Analysis

Dataset: UCI sentiment labeled sentences
 https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
 
Positive/Negative word list: University of Pittsburgh Subjectivity Lexiconhttp://mpqa.cs.pitt.edu/ 
Instructions: 
1. Pick one of the company data files and build your own classifier. 
2. When you're satisfied with its performance (at this point just using the accuracy measure shown in the example), test it on one of the other datasets to see how well these kinds of classifiers translate from one context to another.
3. Include your model and a brief writeup of your feature engineering and selection process to submit and review with your mentor.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import sys
import config

# data is binary so I'll use the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

In [2]:
# load text file of negative and positive words from http://mpqa.cs.pitt.edu/

df_positive_words = pd.read_csv('positive-words.txt', header = None)
df_positive_words.columns=['pos_words']
df_negative_words = pd.read_csv('negative-words2.txt', header = None, encoding = "ISO-8859-1")
df_negative_words.columns=['neg_words']

df_positive_words.head()


Unnamed: 0,pos_words
0,a+
1,abound
2,abounds
3,abundance
4,abundant


In [3]:
# load sentiment data and label columns
sentiment_raw = pd.read_csv(filepath_or_buffer='yelp_labelled.txt', delimiter= '\t', header=None)
# name new columns
sentiment_raw.columns = ['message', 'sentiment']
sentiment_raw.head(n=10)

Unnamed: 0,message,sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


In [23]:

#create an initial dataframe to get frequency of words from positive text file
positive_features = pd.DataFrame()
negative_features = pd.DataFrame()


#create a series for negative words and for positive words using the text files 

keywords_positive = df_positive_words['pos_words']
keywords_negative = df_negative_words['neg_words']


#create a binary feature for the presence of positive words
data = pd.DataFrame()
for key in keywords_positive:
    # spaces around the key to get the word,not just pattern matching.
    data[str(key)] = sentiment_raw.message.str.contains(' ' + str(key) + ' ', case=False).astype(int)

for key in keywords_negative:
    # spaces around the key to get the word,not just pattern matching.
    data[str(key)] = sentiment_raw.message.str.contains(' ' + str(key) + ' ', case=False).astype(int)
             

In [24]:

#naive bayes assumes independence between variables/features - test with a pairwise correlation matrix and heatmap
sns.heatmap(data.corr())
plt.show()


In [25]:
# build training data as new dataframe for model and assign target (outcome variable)
target = sentiment_raw['sentiment']

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 265


In [10]:
# test my model on a different dataset 
sentiment_raw_amazon = pd.read_csv(filepath_or_buffer='amazon_cells_labelled.txt', delimiter= '\t', header=None)
# name new columns
sentiment_raw_amazon.columns = ['message', 'sentiment']
sentiment_raw_amazon.head(n=10)

Unnamed: 0,message,sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
5,I have to jiggle the plug to get it to line up...,0
6,If you have several dozen or several hundred c...,0
7,If you are Razr owner...you must have this!,1
8,"Needless to say, I wasted my money.",0
9,What a waste of money and time!.,0


In [11]:
#create a binary feature for the presence of positive words
data = pd.DataFrame()
for key in keywords_positive:
    # spaces around the key to get the word,not just pattern matching.
    data[str(key)] = sentiment_raw_amazon.message.str.contains(' ' + str(key) + ' ', case=False).astype(int)

for key in keywords_negative:
    # spaces around the key to get the word,not just pattern matching.
    data[str(key)] = sentiment_raw_amazon.message.str.contains(' ' + str(key) + ' ', case=False).astype(int)
             

In [12]:
# build training data as new dataframe for model and assign target (outcome variable)
target = sentiment_raw_amazon['sentiment']

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 259


### Notes on feature selection
1. Created features from positive and negative keywords (list from University of Pittsburgh) that showed up frequently in reviews - positive words seem to have a bigger impact
2. The model only seems to have 75% accuracy but returns consistent accuracy for second dataset