# Sentiment Analysis with Naive Bayes
### Probability and Bayes's Rule
Suppose, we are given a corpus of tweets. The number of tweets in this case is 20. 7 of these tweets are classified as negative, and 13 of them are classified as positive. If A is a variable representing a positive tweet, then the probability of randomly choosing a tweet and that to be a positive tweet is $P(A) = \frac{N_{pos}}{N} = \frac{13}{20} = 0.65$

Now, let's assume that the total number of tweets containing the word "happy" is. Three of these tweets are classified as positive tweets, and one as negative. Now, if we want to find out, randomly choosing a tweet, what's the probability that, that tweet is both positive and contains the word "happy", we need to calculate the probability of two events happening. 

![Probability of Two Events Happening](corpus.png)

We can make sense of it with Venn Diagram. For this investigation, we would be looking at the intersection, or overlap of events. The positive and tweets containing the word "happy" intersect in three boxes. So, out of total 20 tweets, the event that a randomly chosen tweet would be both positive and containing the word "happy" is $P(A, B) = P(A \cap B) = \frac{3}{20} = 0.15$


### Conditional Probability 
If we are asked to find out tomorrows temperature, if we are given some more information, like the location and time of the year, then the prediction would be much easier than knowing no information at all. 

In conditional probability, some conditions are already given. The sample space has reduced because of the condition. For example, in the previous example, the total number of items in sample space was 20 (Total number of tweets). Now, if we ask, given all the tweets containing the word "happy", what is the probability of randomly choosing one tweet and that be a positive? Now, the number of tweets having the word "happy" is 4. So, given this condition - "tweets having the word 'happy'", we have reduced the sample space from 20 to 4. We express this mathematically as $P(positive | "happy")$ - which reads, the probability of getting a tweet which is positive given that a tweet contains a word "happy". More generally, we say $P(X|Y)$- the probability of getting X given Y. 

We can interpret conditional probability $P(B|A)$ as probability of B, given A happened. Put another way, looking at the elements of set A, the chance that one element also belongs to set B. 

### Bayes's Rule

From the above example, we can generalize the math: 

$$
P(X|Y) = \frac{P(X \cup Y)}{P(Y)} \implies P(X \cup Y) = P(X|Y)P(Y) \\
P(Y|X) = \frac{P(Y \cup X)}{P(X)} \implies P(Y \cup X) = P(Y|X)P(X) \\
P(X|Y)P(Y) = P(Y|X)P(X) \\
P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)} \\ 
$$




### Binary Classification with Bayes Rule
Suppose, we have two examples of classified tweets, one positive and the other negative. 

<div style="display: flex; gap: 1rem;">

  <div style="border: 1px solid #333; padding: 0.5rem; flex: 1;">
    <b>Positive Tweets</b> <br>
    I am happy because I am learning NLP.<br>
    I am happy, not sad. 
  </div>

  <div style="border: 1px solid #333; padding: 0.5rem; flex: 1;">
    <b>Negative Tweets </b><br>  
    I am sad, I am not learning NLP.<br> 
    I am sad, not happy. 
  </div>

</div>

Now, we will create a dictionary, along with how many times each word occurs in both positive and negative class. 

| Word      | Positive | Negative |Conditional Probability for Positive Class |Conditional Probability for Positive Class | 
|-----------|---------:|---------:|------------------------------------------:|------------------------------------------:|
| am        |        3 |        3 |                                       0.20|                                       0.20|
| because   |        1 |        0 |                                       0.10|                                          0|
| happy     |        2 |        1 |                                       0.14|                                       0.10|
| i         |        3 |        3 |                                       0.20|                                       0.20|
| learning  |        1 |        1 |                                       0.10|                                       0.10|
| nlp       |        1 |        1 |                                       0.10|                                       0.10|
| not       |        1 |        2 |                                       0.10|                                       0.15|
| sad       |        1 |        2 |                                       0.10|                                       0.15|
| **Total** |       13 |       13 |                                          1|                                          1|

Now, we want to expand the table to have each words conditional probability. For example, the positive class sample space consists of 13 elements. Given the positive class, what's the probability of choosing "I"? 
$P("I"| Positive) = \frac{3}{13}$
Hence, for each word, we get corresponding probabilities for that word to be either in positive or negative classes. 
If we study the table, we observe that few words have similar probabilities (i.e., am, i, learning, nlp). The words are "neutral" as their values don't help us to find the overall tweets classification. On the other hand, words like sad, not, happy are "power words", as they can be strong factors to determine the tweets classification. 

### Naive Bayes
It's called "naive" because this algorithm assumes that each feature are independent of each other - which might not be always true. In real text, words are often highly correlated. "New" and "York" almost always appear together. Similarly, "ice" & "cream" carry joint meaning. The following expression, which is called the Naive Bayes inference condition rule for binary classification, can be used to find out the class of a given tweet:
$$
\prod_{i=1}^m \frac{P(w_i | pos)}{P(w_i | neg)}
$$

Suppose, we are given a new tweet: "I am happy today, I am learning."
For each word, we will get the associated negative and positive conditional probabilities and then multiply them all together. If the output is greater than 1, we can classify it as positive, else it would be negative. 
$$
\prod_{i=1}^m \frac{P(w_i | pos)}{P(w_i | neg)} \\
= \frac{0.20}{0.20}_{i}*\frac{0.20}{0.20}_{am}*\frac{0.14}{0.10}_{happy}*\frac{0.20}{0.20}_{I}*\frac{0.20}{0.20}_{am}*\frac{0.10}{0.10}_{learning} \\
= 1.4 > 1
$$

## Laplacian Smoothing
We haven't included the word "today" because it doesn't exist in our dictionary. Also, notice that for the word "because", there is no entry in negative class, so its probability is 0. But for this reason, the product will blow out. To solve this problem, Laplacian Smoothing technique is used. 

Formally, in Naive Bayes we estimate 
$$
P(w|C) = \frac{count(w, C)}{\sum_{w'}count(w', c)}
$$
But if a word w never appeared in class C in our training data, count(w, C) = 0, so P(w|C) = 0. Multiplying many probabilities means a single zero makes the entire score zero, no matter how strong the other evidence. 





## Laplace Smoothing Formula
To avoid zeroes, add 1 to every word-count, and add V (the vocabulary size) to the denominator: 
$$
\hat{P}(w|C) = \frac{count(w, C)+ 1}{N_C + V} \text{ where } N_C = \sum_{w'} count(w', C)
$$

Suppose, we have two classes and this tiny vocabulary: 

| Word    | count(word, Pos) | count(word, Neg) |
| ------- | :--------------: | :--------------: |
| good    |         2        |         1        |
| bad     |         0        |         3        |
| awesome |         1        |         0        |

- Vocabulary size: V = 3
- Total counts: $N_{Pos} = 2 + 0 + 1 = 3$, and $N_{Neg} = 1 + 3 + 0 = 4$
  When we are adding 1 in a class to each word, we have added 1 V times altogether. As the sum of probabilities for each class must be 1, to normalize, we therefore add V to the denominator. 

$$
\sum_{w} \hat P(w \mid C)
\;=\;
\sum_{w} \frac{\mathrm{count}(w,C) + 1}{N_C + V}
\;=\;
\frac{\sum_{w} \bigl(\mathrm{count}(w,C) + 1\bigr)}{\,N_C + V\,}
\;=\;
\frac{N_C + V}{\,N_C + V\,}
\;=\;
1.
$$

If you only added 1 in the numerator but left the denominator as \(N_C\), then

$$
\sum_{w} \hat P(w \mid C)
\;=\;
\sum_{w} \frac{\mathrm{count}(w,C) + 1}{N_C}
\;=\;
\frac{N_C + V}{N_C}
\;>\;
1,
$$

which would not be a valid probability distribution.  



If we update our previously calculated table with Laplacian smoothing, we get 

| Word      | Positive | Negative | `P(w∣Positive)` | `P(w∣Negative)` |
|-----------|---------:|---------:|----------------:|----------------:|
| am        |        3 |        3 |            0.19 |            0.19 |
| because   |        1 |        0 |            0.10 |            0.05 |
| happy     |        2 |        1 |            0.14 |            0.10 |
| i         |        3 |        3 |            0.19 |            0.19 |
| learning  |        1 |        1 |            0.10 |            0.10 |
| nlp       |        1 |        1 |            0.10 |            0.10 |
| not       |        1 |        2 |            0.10 |            0.14 |
| sad       |        1 |        2 |            0.10 |            0.14 |
| **Total** |       13 |       13 |            1.00 |            1.00 |


### Ratio of Probabilities
Words have many different meaning, but in general, we can divide them up into positive, negative and neutral. If we observe the table above, we see that few words are neutral, as their ratio equals to 1. If a word has ratio greater than 1, the word is positive. The larger the value, the more "positive" is that word. Similarly, if the ration is less than 1, the word is negative. The closer the value to 0, the more "negative" is the word. We now expand the table to include each word's ration of positive and negative count according to the formula :
$$
\text{Ratio}(w)
\;=\;
\frac{P\bigl(w \mid \text{Positive}\bigr)}
     {P\bigl(w \mid \text{Negative}\bigr)}.
$$

| Word      | Positive | Negative | `P(w∣Positive)` | `P(w∣Negative)` | Ratio  |
|-----------|---------:|---------:|----------------:|----------------:|-------:|
| am        |        3 |        3 |            0.19 |            0.19 |   1.00 |
| because   |        1 |        0 |            0.10 |            0.05 |   2.00 |
| happy     |        2 |        1 |            0.14 |            0.10 |   1.40 |
| i         |        3 |        3 |            0.19 |            0.19 |   1.00 |
| learning  |        1 |        1 |            0.10 |            0.10 |   1.00 |
| nlp       |        1 |        1 |            0.10 |            0.10 |   1.00 |
| not       |        1 |        2 |            0.10 |            0.14 |   0.71 |
| sad       |        1 |        2 |            0.10 |            0.14 |   0.71 |
| **Total** |       13 |       13 |            1.00 |            1.00 |        |




Finally, we will take the logarithm of the ration column, when we are given a text to evaluate. We take the log likelihood because, we have to multiply many values close to zero, and multiplying them in computer might lead to underflow. Underflow happens when you multiply lots of tiny probabilities together and the result becomes smaller than the smallest number your computer’s floating-point format can represent—so it gets “rounded” all the way down to zero. Once it’s zero, you lose all information (you can’t recover whether it was 1e-100 or 1e-1000), and further multiplications stay zero.

Now, given a text, " I am happy, because I am learning", let's calculate its log likelihood based on the above table. 

| Word      | Positive | Negative | `P(w∣Positive)` | `P(w∣Negative)` | Ratio | λ = ln(Ratio) |
|-----------|---------:|---------:|----------------:|----------------:|------:|--------------:|
| am        |        3 |        3 |            0.19 |            0.19 |  1.00 |          0.00 |
| because   |        1 |        0 |            0.10 |            0.05 |  2.00 |          0.69 |
| happy     |        2 |        1 |            0.14 |            0.10 |  1.40 |          0.34 |
| i         |        3 |        3 |            0.19 |            0.19 |  1.00 |          0.00 |
| learning  |        1 |        1 |            0.10 |            0.10 |  1.00 |          0.00 |
| nlp       |        1 |        1 |            0.10 |            0.10 |  1.00 |          0.00 |
| not       |        1 |        2 |            0.10 |            0.14 |  0.71 |         −0.34 |
| sad       |        1 |        2 |            0.10 |            0.14 |  0.71 |         −0.34 |
| **Total** |       13 |       13 |            1.00 |            1.00 |  1.00 |          0.00 |

score = $\lambda (happy)$ + $\lambda (because)$ = 0.69+0.34 = 1.03
So, based on Naive Bayes, the given text is likely to be positive is the score is greater than 0. As we are using logarithm, neutral words corresponds to 0, positive words correspond to values greater than 0, and negative words correspond to values less than 0. 




## Training Naive Bayes Steps

1. Get or annotate a dataset with positive and negative tweets
2. Preprocess the tweets
   - Lowercase
   - Remove punctuation, urls, handles, names
   - Remove stop words
   - Stemming
   - Tokenize sentences
3. Compute freq(w, class) <br>
   Create a table with the vocabulary with their frequency counts in positive and negative classes. Sum up the frequency counts in each class. 
4. Compute P(w | pos), P(w | neg) <br>
   From the table, calculate these two probabilities for each word in vocabulary. 
5. Compute $\lambda (w)$ <br>
   Using the P(w | pos), P(w | neg) for each word in the vocabulary, calculate $\lambda (w) = 
\log \frac{P\bigl(w \mid \text{Positive}\bigr)}
     {P\bigl(w \mid \text{Negative}\bigr)}$
6. Compute log (prior) = $\log \frac{P(pos)}{P(neg)}$ <br>
   If the dataset is balanced, i.e., the number of examples in positive and negative classes are equal, then we don't need to worry about this term, as it'll be zero. It's important only when the dataset is unbalanced. 
   


# Implementation
We will use the dataset provided by the nltk library. It has a balanced dataset of annotated 5000 positive tweets, and 5000 negative tweets. 

In [4]:
# Importing libraries 

import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer
from os import getcwd
import re 
from collections import Counter
import matplotlib.pyplot as plt
from nltk.stem import PorterStemmer
from matplotlib.patches import Ellipse
import matplotlib.transforms as transforms
# Downloading the Twitter samples and stopwords from NLTK   

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Abir\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Abir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Preparing the Dataset: Splitting into training and test (validation) sets

In [5]:
# get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the positive and negative tweets into training and test sets
train_pos = all_positive_tweets[:4000]
train_neg = all_negative_tweets[:4000]
test_pos = all_positive_tweets[4000:]
test_neg = all_negative_tweets[4000:]

train_x = train_pos + train_neg
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))

test_x = test_pos + test_neg
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))


### Data Preprocessing

In [12]:
def process_tweet(tweet):
    '''
    Input: 
    tweet: a string containing a tweet
    Output:
    tweets_clean: a list of strings containing the cleaned tweet

    '''
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    tweet = re.sub(r'\$\w*', '', tweet) # remove $ sign and the word after it
    tweet = re.sub(r'@\w*', '', tweet) # remove @ sign and the word after it
    tweet = re.sub(r'#', '', tweet) # remove # sign
    tweet = re.sub(r'RT[\s]+', '', tweet) # remove RT sign
    tweet = re.sub(r'https?:\/\/\S+', '', tweet) # remove links

    #tokenize the tweet
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
    #print(tweet_tokens)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and word not in string.punctuation):
            stem_word = stemmer.stem(word)
            tweets_clean.append(stem_word)
    return tweets_clean


In [9]:
# Test the function
tweet = "I am so happy today :) #happy"
print("Original tweet: ", tweet)
print("Cleaned tweet: ", process_tweet(tweet))

Original tweet:  I am so happy today :) #happy
['i', 'am', 'so', 'happy', 'today', ':)', 'happy']
Cleaned tweet:  ['happi', 'today', ':)', 'happi']


We will now create a function count_tweets that will take a list of tweets, labels as input, preprocess them, and returns a dictionary. 
- The key in the dictionary is a tuple containing the stemmed word and its class label, e.g., (happi, 1)
- The value would be the the number of times this word appears in a class, e.g., (happi, 1) : 10, (happi, 0): 2

In [10]:
def count_tweets(result, tweets, ys):
    '''
    Input:
    result: a dictionary that will be used to map each pair to its frequency
    tweets: a list of tweets
    ys: a list of labels (1 for positive, 0 for negative)
    Output:
    result: a dictionary that contains the frequency of each pair of words in the tweets
    '''
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in result:
                result[pair] += 1
            else:
                result[pair] = 1
    return result

In [13]:
# Test the function
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

## Train your Model using Naive Bayes

#### So how do you train a Naive Bayes classifier?
- The first part of training a naive bayes classifier is to identify the number of classes that you have. In this project, we have two classes (Positive, Negative). 
- You will create a probability for each class.
$P(D_{pos})$ is the probability that the document (tweet) is positive.
$P(D_{neg})$ is the probability that the document (tweet) is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative. In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the “prior.”

The prior itself is $\frac{P(D_{\text{pos}})}{P(D_{\text{neg}})}$.

Taking logs to rescale, we define the **logprior**:

$$
\text{logprior}
= \log\!\biggl(\frac{P(D_{\text{pos}})}{P(D_{\text{neg}})}\biggr)
= \log P(D_{\text{pos}})\;-\;\log P(D_{\text{neg}}) = \log (D_{pos}) - \log(D_{neg}).
$$


#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{log likelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

Now, we want to create and populate freqs dictionary. We will use the count_tweets function and give it an empty dictionary, the train_x, and the train_y lists as input. 

In [14]:
# Build the frequency dictionary for the training set
result = {}
freqs = count_tweets(result, train_x, train_y)

In [None]:
# Convert the dictionary to a pandas DataFrame, we won't use it directly but will be useful for inspection
data = pd.DataFrame.from_dict(freqs, orient='index')
data.reset_index(inplace=True)

Unnamed: 0,index,0
0,"(followfriday, 1.0)",23
1,"(top, 1.0)",30
2,"(engag, 1.0)",7
3,"(member, 1.0)",14
4,"(commun, 1.0)",27
5,"(week, 1.0)",72
6,"(:), 1.0)",2960
7,"(hey, 1.0)",60
8,"(jame, 1.0)",7
9,"(odd, 1.0)",2


In [17]:
data.head(-10)

Unnamed: 0,index,0
0,"(followfriday, 1.0)",23
1,"(top, 1.0)",30
2,"(engag, 1.0)",7
3,"(member, 1.0)",14
4,"(commun, 1.0)",27
...,...,...
11378,"(agov, 0.0)",1
11379,"(brasileirao, 0.0)",1
11380,"(abus, 0.0)",1
11381,"(unpar, 0.0)",1


Given a freqs dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), we will now implement a naive bayes classifier.

##### Calculate $V$
- You can then compute the number of unique words that appear in the `freqs` dictionary to get your $V$ (you can use the `set` function).

##### Calculate $freq_{pos}$ and $freq_{neg}$
- Using your `freqs` dictionary, you can compute the positive and negative frequency of each word $freq_{pos}$ and $freq_{neg}$.

##### Calculate $N_{pos}$, and $N_{neg}$
- Using `freqs` dictionary, you can also compute the total number of positive words and total number of negative words $N_{pos}$ and $N_{neg}$.

##### Calculate $D$, $D_{pos}$, $D_{neg}$
- Using the `train_y` input list of labels, calculate the number of documents (tweets) $D$, as well as the number of positive documents (tweets) $D_{pos}$ and number of negative documents (tweets) $D_{neg}$.
- Calculate the probability that a document (tweet) is positive $P(D_{pos})$, and the probability that a document (tweet) is negative $P(D_{neg})$

##### Calculate the logprior
- the logprior is $log(D_{pos}) - log(D_{neg})$

##### Calculate log likelihood
- Finally, you can iterate over each word in the vocabulary, use your `lookup` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using equations 4 & 5.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

**Note:** We'll use a dictionary to store the log likelihoods for each word.  The key is the word, the value is the log likelihood of that word).

- You can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)$.

In [20]:
def train_naive_bayes(freqs, train_x, train_y):
    ''' 
    Input:
    freqs: a dictionary that contains the frequency of each pair of words in the tweets
    train_x: a list of tweets
    train_y: a list of labels (1 for positive, 0 for negative)
    Output:
    logprior: the log prior
    loglikelihood: the log likelihood of your Naive Bayes model
    '''
    loglikelihood = {}
    logprior = 0
    #calculate V, the number of unique words in the training set
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)
    # Calculate N_pos, N_neg
    #N_pos = sum([freqs[pair] for pair in freqs.keys() if pair[1] == 1])
    #N_neg = sum([freqs[pair] for pair in freqs.keys() if pair[1] == 0])
    N_pos = N_neg = 0
    for pair in freqs.keys():
        if pair[1] == 1:
            N_pos += freqs[pair]
        else:
            N_neg += freqs[pair]
    
    # Calculate D, the number of documents in the training set
    D = len(train_x)
    # Calculate D_pos, D_neg
    D_pos = np.sum(train_y == 1)
    D_neg = np.sum(train_y == 0)
    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    # Calculate and populate loglikelihood 
    for word in vocab:
        # get the positive and negative frequencies of the word
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)

        # calculate the probability that each word is positive or negative
        p_word_given_pos = (freq_pos + 1) / (N_pos + V)
        p_word_given_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of each word
        loglikelihood[word] = np.log(p_word_given_pos) - np.log(p_word_given_neg)

    return logprior, loglikelihood




In [22]:
# Test the function
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print("logprior: ", logprior)
print("loglikelihood: ", loglikelihood)
print(len(loglikelihood))

logprior:  0.0
loglikelihood:  {'repres': 0.3911928099737665, 'carlton': -1.1128845868025063, '2ish': 0.6788748824255482, '╰': 1.3720220629854918, 'ngarepfollbackdarinabilahjkt': 0.6788748824255482, 'btw': 0.902018433739757, '4yr': 0.6788748824255482, 'filbarbarian': 0.6788748824255482, 'cypru': -0.7074194786943426, 'symphoni': -0.7074194786943426, 'camsex': -1.8060317673624517, 'percentag': -0.7074194786943426, '346': -0.7074194786943426, 'prize': -0.41973740624256095, "s'okay": 0.6788748824255482, '╱': 1.7774871710936573, 'hasb': -0.7074194786943426, 'knee': 0.49655332563159327, 'justic': -0.014272298134397232, 'pete': -0.014272298134397232, 'wnt': -0.7074194786943426, 'soul': 0.8330255622528071, 'showpo': 0.6788748824255482, 'foto': 0.6788748824255482, 'idaho': -1.1128845868025063, 'hollywood': 1.3720220629854918, 'quiet': -0.7074194786943426, 'oval': 0.6788748824255482, "people'": 0.3911928099737665, '7am': -0.014272298134397232, 'bloi': 0.6788748824255482, 'ye': 0.4077221119249765

### Testing Naive Bayes

Now, we will implement the naive_bayes_predict function to make predictions on tweets. 

* The function takes in the `tweet`, `logprior`, `loglikelihood`.
* It returns the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ p = logprior + \sum_i^N (loglikelihood_i)$$

#### Note
Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets).  This means that the ratio of positive to negative 1, and the logprior is 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.  However, please remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [23]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
    tweet: a string
    logprior: the log prior, a number
    loglikelihood: the log likelihood of your Naive Bayes model, a dictionary
    output:
    p: the sum of all loglikelihoods of the words in the tweet plus the logprior
    '''
    # process the tweet to get the list of words
    tweet = process_tweet(tweet)
    # initialize probability to zero
    p = 0
    # add the logprior to the probability
    p += logprior
    # iterate through the words in the tweet
    for word in tweet:
        # if the word is in the loglikelihood dictionary, add its loglikelihood to the probability
        if word in loglikelihood:
            p += loglikelihood[word]
    return p

Now, we will test the naive_bayes_predict on the test dataset. 

In [27]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    '''
    input:
    test_x: a list of tweets
    test_y: a list of labels (1 for positive, 0 for negative)
    logprior: the log prior, a number
    loglikelihood: the log likelihood of your Naive Bayes model, a dictionary
    output:
    accuracy: the accuracy of your model on the test set (# of correct predictions / total # of predictions)
    ''' 
    accuracy = 0
    y_hats = []
    for tweet in test_x:
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            y_hat_i = 1
        else:
            y_hat_i = 0
        y_hats.append(y_hat_i)
    error = np.sum(np.abs(y_hats - test_y))/len(test_y) # calculate the error
    accuracy = 1 - error # calculate the accuracy
    return accuracy    

In [28]:
# Test the function
accuracy = test_naive_bayes(test_x, test_y, logprior, loglikelihood)
print("Accuracy: ", accuracy)

Accuracy:  0.9955


#### Filter words by ratio of positive to negative counts
- Some words have more positive counts than others, and can be considered "more positive".  Likewise, some words can be considered more negative than others.
- One way for us to define the level of positiveness or negativeness, without calculating the log likelihood, is to compare the positive to negative frequency of the word.
    - Note that we can also use the log likelihood calculations to compare relative positivity or negativity of words.
- We can calculate the ratio of positive to negative frequencies of a word.
- Once we're able to calculate these ratios, we can also filter a subset of words that have a minimum ratio of positivity / negativity or higher.
- Similarly, we can also filter a subset of words that have a maximum ratio of positivity / negativity or lower (words that are at least as negative, or even more negative than a given threshold). <br>

Now, we will implement the get_ratio function. We will write another helper function lookup. The get_ratio function will return $$ratio = \frac{\text{pos\_words} + 1}{\text{neg\_words} + 1}$$



In [29]:
def lookup(freqs, word, label):
    '''
    Input: 
    freqs: a dictionary that contains the frequency of each pair of words in the tweets
    word: the word to look up
    label: the label corresponding to the word (1 for positive, 0 for negative)
    Output:
    n: the frequency of the word in the tweets with the given label
    '''
    n = 0
    pair = (word, label)
    if pair in freqs:
        n = freqs[pair]
    return n


In [33]:
def get_ratio(freqs, word):
    '''
    Input:
    freqs: dictionary of frequencies
    Output:
    pos_net_ratio: a dictionary of the ratio of positive to negative frequencies for each word
    '''
    pos_neg_ratio = {'positive':0, 'negative':0, 'ratio':0.0}
    pos_neg_ratio['positive'] = lookup(freqs, word, 1)
    pos_neg_ratio['negative'] = lookup(freqs, word, 0)
    pos_neg_ratio['ratio'] = (pos_neg_ratio['positive'] + 1) / (pos_neg_ratio['negative'] + 1) # add 1 to avoid division by zero
    return pos_neg_ratio

In [35]:
# Test the function
get_ratio(freqs, 'happi')

{'positive': 162, 'negative': 18, 'ratio': 8.578947368421053}

We will now Implement get_words_by_threshold(freqs,label,threshold)

* If we set the label to 1, then we'll look for all words whose threshold of positive/negative is at least as high as that threshold, or higher.
* If we set the label to 0, then we'll look for all words whose threshold of positive/negative is at most as low as the given threshold, or lower.
* Use the `get_ratio` function to get a dictionary containing the positive count, negative count, and the ratio of positive to negative counts.
* Append the `get_ratio` dictionary inside another dictinoary, where the key is the word, and the value is the dictionary `pos_neg_ratio` that is returned by the `get_ratio` function.
An example key-value pair would have this structure:
```
{'happi':
    {'positive': 10, 'negative': 20, 'ratio': 0.524}
}
```

In [36]:
def get_words_by_threshold(freqs, label, threshold, get_ratio = get_ratio):
    '''
    Input:
    freqs: dictionary of frequencies
    label: 1 for positive, 0 for negative
    threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
    Output:
    word_list: dictionary containing the word and information on its positive, negative count, and ratio
    '''
    word_list = {}
    for key in freqs.keys():
        word, _ = key
        pos_neg_ratio = get_ratio(freqs, word)
        if pos_neg_ratio['ratio'] >= threshold and label == 1:
            word_list[word] = pos_neg_ratio
        elif pos_neg_ratio['ratio'] <= threshold and label == 0:
            word_list[word] = pos_neg_ratio
    return word_list

In [37]:
# Test the function
get_words_by_threshold(freqs, label=0, threshold=0.05)

{':(': {'positive': 1, 'negative': 3675, 'ratio': 0.000544069640914037},
 ':-(': {'positive': 0, 'negative': 386, 'ratio': 0.002583979328165375},
 'zayniscomingbackonjuli': {'positive': 0, 'negative': 19, 'ratio': 0.05},
 '26': {'positive': 0, 'negative': 20, 'ratio': 0.047619047619047616},
 '>:(': {'positive': 0, 'negative': 43, 'ratio': 0.022727272727272728},
 'lost': {'positive': 0, 'negative': 19, 'ratio': 0.05},
 '♛': {'positive': 0, 'negative': 210, 'ratio': 0.004739336492890996},
 '》': {'positive': 0, 'negative': 210, 'ratio': 0.004739336492890996},
 'beli̇ev': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'wi̇ll': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'justi̇n': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'ｓｅｅ': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'ｍｅ': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776}}

In [38]:
# Error Analysis
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))
        

1	0.00	b'truli later move know queen bee upward bound movingonup'
1	0.00	b'new report talk burn calori cold work harder warm feel better weather :p'
1	0.00	b'harri niall 94 harri born ik stupid wanna chang :d'
1	0.00	b'park get sunlight'
1	0.00	b'uff itna miss karhi thi ap :p'
0	1.00	b'hello info possibl interest jonatha close join beti :( great'
0	1.00	b'u prob fun david'
0	1.00	b'pat jay'
0	1.00	b'sr financi analyst expedia inc bellevu wa financ expediajob job job hire'
