In this notebook I will use Naive Bayes for sentiment analysis of tweets.

Basically we want to compute P(class|sentence) = P(sentence|class) * P(class) / P(sentence)

P(sentence | class) = P(word1|class) * P(word2 | class) * ...

We should go through 5 steps:

- get the data
- preprocessed the data
- compute freq(w, class)
- Get P(word | neg) and P(word | pos)
- compute log of bayes rule

In [2]:
from utils import process_tweets, build_freqs
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer


In [None]:
# in the sentiment analysis notebook I go through this step by step

In [3]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/shimaimani/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [4]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

divide data to test and train

In [6]:
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]


In [16]:
train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones((len(train_pos), 1)) , np.zeros((len(train_neg), 1)))
test_y = np.append(np.ones((len(test_pos), 1)) , np.zeros((len(test_neg), 1)))

In [18]:

freqs = build_freqs(train_x, train_y)

In [25]:
freqs[('wrong', 1.0)], freqs[('wrong', 0.0)]

(9, 27)

In [27]:
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
build_freqs(tweets, ys)

defaultdict(int,
            {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2})

Now we will calculate:

P(class|sentence) = P(sentence|class) * P(class) / P(sentence)

but we will calculate P(pos | sentence) / P(neg | sentence)

P(pos | sentence) / P(neg | sentence) = P(sentence | pos) * P(pos) /(P(sentence | neg) * P(neg))

and since here P(pos) = P(neg)

P(pos | sentence) / P(neg | sentence) = P(sentence | pos) /P(sentence | neg) 

and P(word | pos) = freq / pos_volume

but we will do laplacian smoothing which is P(word|pos) = (freq + 1) / (pos_volume + V)
where V is unique words

In [71]:
def train_naive_bayes(freqs, train_x, train_y):
    """
    Input: 
        freqs, train_x and train_y
    Output:
        logprior
        loglikelihood: this is a dictionary with a key equal to word and the value is P(pos|word)/P(neg|word)
    
    """
    V = len(freqs)
    pos_volume = 0
    neg_volume = 0
    loglikelihood = {}

    for (key, value) in freqs.items():
        if type(key) == tuple:
            if key[1] == 1.0:
                pos_volume += value
            else:
                neg_volume += value
            loglikelihood[key[0]] = 0
    V = len(loglikelihood)
    for key in loglikelihood:
        pos = (freqs[(key, 1.0)] + 1) / (pos_volume + V)
        neg = (freqs[(key, 0.0)] + 1) / (neg_volume + V)
        loglikelihood[key] = np.log(pos/neg)
    return loglikelihood
        

In [73]:
loglikelihood = train_naive_bayes(freqs, train_x, train_y)

In [74]:
def naive_bayes_predict(tweet, loglikelihood):
    prob = 0
    for word in process_tweets(tweet):
        prob += loglikelihood[word]
    return prob

In [75]:
my_tweet = 'She smiled.'
p = naive_bayes_predict(my_tweet, loglikelihood)
print('The expected output is', p)

The expected output is 1.5741937653457425


In [90]:
def test_naive_bayes(test_x, test_y, loglikelihood):
    y_hat = []
    for tweet in test_x:
        prob = 0
        for word in process_tweets(tweet):
            if word in loglikelihood:
                prob += loglikelihood[word]
            else:
                prob += 0
        if prob > 0:
            prob = 1
        else:
            prob = 0
        y_hat.append(prob)
  
    accuracy = (y_hat == test_y).sum()/len(test_y)
    return accuracy

In [91]:
test_naive_bayes(test_x, test_y, loglikelihood)

0.994