# Experimenting Naive Bayes for Sentiment Analysis#

In this project, I'm working with the Naive Bayes to see how this performs on sentiment or activity predictions.

- In one hand, I will build a Naive Bayes algorithm to decide whether I **will or won't go hiking** this weekend based on the weather during the last two months (8 weekends).

- In the other hand, I will use the Naive Bayes model to predict if people **liked or didn't like** a restaurant based on words and expressions on the reviews data.

## 1. Am I going for a hike this weekend ? (or just staying home reading some comics ?)##

Let's say the weather and my hiking decision were as follows:

|Outlook|Temperature|Humidity|Windy|GO HIKING|
|-------|-----------|--------|-----|---------|
|Rainy|Hot|High|False|No|
|Rainy|Mild|High|True|No|
|Cloudy|Cool|Normal|False|Yes|
|Sunny|Hot|Normal|False|Yes|
|Sunny|Hot|High|False|No|
|Cloudy|Mild|Normal|True|Yes|
|Sunny|Cool|High|True|Yes|
|Rainy|Cool|Normal|False|Yes|

The weather forecast for this weekend shows that:
- Outlook: Cloudy
- Temperature: Hot
- Humidity: Normal
- Windy: False.

Based on [Naive Bayes formula](https://en.wikipedia.org/wiki/Naive_Bayes_classifier), I have 4 features: x1 = "cloudy", x2 = "cool", x3 = "high", x4 = "true".

I'm going to compare between **the probablity to go hiking given the weather condition this weekend P(yes|x1,x2,x3,x4)** and **the probabily NOT to go hiking given the weather condition this weekend P(no|x1,x2,x3,x4)**.

THE NAIVE BAYES FORMULAE:

P(yes|x1,x2,x3,x4) = P(yes) * P(x1|yes) * P(x2|yes) * P(x3|yes) * P(x4|yes)  /  P(x1,x2,x3,x4)   

P(no|x1,x2,x3,x4) = P(no) * P(x1|no) * P(x2|no) * P(x3|no) * P(x4|no)  /  P(x1,x2,x3,x4)     

If P(yes|x1,x2,x3,x4) > P(no|x1,x2,x3,x4), I'm going for a hike; otherwise I'm staying home.

In [1]:
# list of weather conditions and my decisions over the last 8 weekends
weekends = [["rainy", "hot", "high", "false", "NO"],
            ["rainy", "mild", "high", "true", "NO"],
            ["cloudy", "cool", "normal", "false", "YES"],
            ["sunny", "hot", "normal", "false", "YES"],
            ["sunny", "hot", "high", "false", "NO"],
            ["cloudy", "mild", "normal", "true", "YES"],
            ["sunny", "cool", "high", "true", "YES"],
            ["rainy", "cool", "normal", "false", "YES"]]

# the weather condition this weekend
this_wk = ["cloudy", "hot", "normal", "false"]

**Building the functions to calculate the Naive Bayes formulae**

Let's say that the decisions "yes" and "no" are both called **y**.

The NB Formula is generalized as : P(y|x1,x2,x3,x4) = P(y) * P(x1|y) * P(x2|y) * P(x3|y) * P(x4|y) / P(x1,x2,x3,x4)

In [2]:
# the probability one decision were made over the last 8 weekends : P(y)
def y_proba(y, weekends):
    return len([w for w in weekends if w[4] == y]) / len(weekends)

# the probability an outlook occurs given one decision: P(x1|y)
def outlook_proba_given_y(x1, y, weekends):
    return len([w for w in weekends if w[4] == y and w[0] == x1]) / len(weekends)

# the probability a temperature occurs given one decision: P(x2|y)
def temp_proba_given_y(x2, y, weekends):
    return len([w for w in weekends if w[4] == y and w[1] == x2]) / len(weekends)

# the probability a humidity occurs given one decision: P(x3|y)
def humi_proba_given_y(x3, y, weekends):
    return len([w for w in weekends if w[4] == y and w[2] == x3]) / len(weekends)

# the probability it is windy or not given one decision: P(x4|y)
def windy_proba_given_y(x4, y, weekends):
    return len([w for w in weekends if w[4] == y and w[3] == x4]) / len(weekends)

# the denominator = the probability that the weather condition this weekend happens given the last 8 weekends: P(x1,x2,x3,x4)
denominator = len([w for w in weekends if 
                   w[0] == this_wk[0] 
                   and w[1] == this_wk[1] 
                   and w[2] == this_wk[2] 
                   and w[3] == this_wk[3]]) / len(weekends) 

As we can see, the combination of 4 weather features this weekend has never been happened over the last 2 months. Therefore, my denominator is definitely equal 0, hence **the "division by 0" problem**.

To resolve this problem, I will use **Laplace smoothing** method: 
- adding 1 to the numerateur
- adding the count of all (x1,x2,x3,x4) for one decision in `weekends` to the denominateur.

In [3]:
def count(y):
    count = 0
    for w in weekends:
        if w[4] == y:
            for i in range(4):
                if w[i] == this_wk[i]:
                    count += 1 
    return count

# ULTIMATELY, THE COMPLETE FORMULA IS:
# decision_prob = ( y_proba(y,weekends) * outlook_proba_given_y(x1,y,weekends) 
#                  * temp_proba_given_y(x2,y,weekends)
#                  * humi_proba_given_y(x3,y,weekends) 
#                  * windy_proba_given_y(x4,y,weekends) + 1 ) / (denominator + count(y))

The probability that I said "YES" for a hike given the weather condition this weekend: **P(A|x1,x2,x3,x4)**

In [4]:
yes_prob = ( y_proba("YES",weekends) 
            * outlook_proba_given_y("cloudy","YES",weekends) 
            * temp_proba_given_y("hot","YES",weekends)
            * humi_proba_given_y("normal","YES",weekends) 
            * windy_proba_given_y("false","YES",weekends) + 1 ) / (denominator + count("YES"))

yes_prob

0.1003662109375

The probability that I said "NO" for a hike given the weather condition this weekend: **P(B|x1,x2,x3,x4)**

In [5]:
no_prob = ( y_proba("NO",weekends) 
            * outlook_proba_given_y("cloudy","NO",weekends) 
            * temp_proba_given_y("hot","NO",weekends)
            * humi_proba_given_y("normal","NO",weekends) 
            * windy_proba_given_y("false","NO",weekends) + 1 ) / (denominator + count("NO"))

no_prob

0.25

**Comparing the two probabilities and making decision**

In [6]:
decision = "YES"
if yes_prob < no_prob:
    decision = "NO"
print("Am I going for a hike this weekend?  %s" % decision)

Am I going for a hike this weekend?  NO


Okay let's stay home and read some Marvels then ^^

# 2. Restaurant reviews: You liked it or not ?#



In this session, I'm working with the `Restaurant_Review.tsv` file containing 1000 reviews on a restaurant. The dataset can be download [here on Kaggle](https://www.kaggle.com/hj5992/restaurantreviews#Restaurant_Reviews.tsv).

Binary classification: A review is classified as 1 if people **liked** the restaurant, and 0 if people **didn't like** it.

I'm going to split the data set into 900 reviews for training and 100 reviews for testing.

Next, I'm using the Naive Bayes model train and make predictions. Besides, a cross-validation split will be applied to make our data random.

**Reading the data**

In [7]:
import pandas as pd

data = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t')
print(len(data))
data.head()

1000


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


**Preprocessing the data**

Because my interest is only **words**, so every stopwords, numeric and special characters need to be removed. Besides, each review will be normalized using the stemming approach.

In [8]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(1000):
    review = re.sub('[^a-zA-Z]', ' ', data['Review'][i])
    review = review.lower().split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# print the first 10 reviews to check
corpus[:10]

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'get angri want damn pho',
 'honeslti tast fresh',
 'potato like rubber could tell made ahead time kept warmer',
 'fri great',
 'great touch']

**Building and cross-validating the model**

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve, auc, accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np

np.random_seed = 3

# generate counts from text using a vectorizer
vectorizer = CountVectorizer(max_df=.05)
X = vectorizer.fit_transform(corpus).toarray()
y = data['Liked']

# splitting the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=40)

# instantiate the Naive Bayes model, train and make predictions
nb = MultinomialNB(alpha=1.0)
nb.fit(X_train, y_train)

predictions = nb.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
print("accuracy score: ", accuracy)

fpr, tpr, thresholds = roc_curve(y_test, predictions, pos_label=1)
print("multinomial NB area under curve: {}".format(auc(fpr,tpr)))

accuracy score:  0.77
multinomial NB area under curve: 0.7791666666666667


We can see that the accuracy given on the test score is 77%. And the area under curve is roughly 78%.