# Naïve Bayes

## Probability and Bayes' Rule

To start, we are going to first review what's probabilities and conditional probabilities are, how they operate, and how they can be expressed mathematically. Imagine we have an extensive corpus of tweets that can be categorized as either positive or negative sentiment, but not both. 

<img src="images/corpus_tweets.svg" width="15%"/>

Within that corpus, the word happy is sometimes being labeled positive (green) and sometimes negative (red). One way to think about probabilities is by counting how frequently events occur. Suppose we defined events `A` as a tweet being labeled positive, then the probability of events `A` shown as `P(A)` is calculated as the ratio between the count of positive tweets and the corpus divided by the total number of tweets in the corpus. In this example, that number comes out to 13 over 20 or 0.65. 

$$
P(A) = \frac{|\text{Positive}|}{|\text{Tweets}|} = \frac{N_{pos}}{N} = \frac{13}{20} = 0.65
$$

We can express that value as a percentage, *i.e.* 65% positive. It's worth noting that the complementary probability, which is the probability of the tweets expressing a negative sentiment is just equal to one minus the probability of a positive sentiment. Notes that for this to be true, all tweets must be categorized as either positive or negative but not both. 

$$
P(\text{Negative}) = 1 - P(\text{Positive}) = 1 - 0.65 = 0.35
$$

Let's define events `B` in a similar way by counting tweets containing the word "happy". 

<img src="images/corpus_tweets_happy.svg" width="15%"/>

In this case, the total number of tweets containing the word happy shown here as $N_{\text{happy}}$ is 4. Thus,

$$
P(B) = P(\text{happy}) = \frac{N_{\text{happy}}}{N} = \frac{4}{20} = 0.2
$$

Below, we have another way of looking at it, take a look at the section of the diagram where tweets are labeled positive and also contain the word "happy". 

<img src="images/venn_diagram.svg" width="50%"/>

In the context of this diagram, the probability that a tweet is labeled positive and also contains the word happy is just the ratio of the area of the intersection divided by the area of the entire corpus. 

In other words, if there were 20 tweets in the corpus and three of them are labeled positive and also contain the word happy, then the associated probability is 3 / 20 or or 0.15.

$$
P(A \cap B) = P(A,B) = \frac{3}{20} = 0.15
$$


## Bayes' Rule

In order to derive Bayes' rule, let's first take a look at the conditional probabilities. Now think about what happens if instead of the entire Corpus, we only consider tweets that contain the word "happy". This is the same as saying given that a tweet contains the word "happy" with that, we would be considering only the tweets inside the blue circle where many of the positive tweets are now excluded. 

<img src="images/venn_diagram_happy.svg" width="60%"/>

In this case, the probability that a tweet is positive given that it contains the word happy simply becomes the number of tweets that are positive and also contain the word "happy". Then, we divide that by the number that contain the word "happy". 

$$
P(A|B) = P(\text{Positive}|\text{happy}) = \frac{3}{4} = 0.75
$$

As we can see, the tweet has a 75% likelihood of being positive if it contains the word "happy". We could make the same case for positive tweets. The purple area denotes the probability that a positive tweets contains the word "happy". 

<img src="images/venn_diagram_positive.svg" width="50%"/>

In this case, the probability is 3 over 13, which is 0.231. 

$$
P(B|A) = P(\text{happy}|\text{Positive}) = \frac{3}{13} = 0.231
$$

With all of this discussion of the probability of missing certain conditions, we are talking about conditional probabilities. **Conditional probabilities** could be interpreted as the probability of an outcome `B` knowing that event `A` already happened. Or given that I'm looking at an element from set `A`, the probability that it also belongs to set `B`. 

There's another way of looking at this with the Venn diagram. Using the previous example, the probability of a tweets being positive given that it has the word "happy" is equal to the probability of the intersection between the tweets that are positive and the tweets that have the word "happy" divided by the probability of a tweets given from the Corpus having the word "happy". 

$$
P(\text{Positive}|\text{happy}) = \frac{P(\text{Positive} \cap \text{happy})}{\text{happy}}
$$

We could also write a similar equation by simply swapping the position of the two conditions. For example, we can have the conditional probability of a tweets containing the word "happy" given that it is a positive tweets. 

$$
P(\text{happy}|\text{Positive}) = \frac{P(\text{Positive} \cap \text{happy})}{\text{Positive}}
$$

Armed with both of these equations, we can derive Bayes' rule. To combine these equations, note that the intersection represents the same quantity no matter which way it is written. With a little algebraic manipulation, we are able to arrive at this equation. 

$$
P(\text{Positive}|\text{happy}) = P(\text{happy}|\text{Positive}) \times \frac{P(\text{Positive})}{P(\text{happy})}
$$

This is now an expression of Bayes' rule in the context of the previous sentiment analysis problem. More generally, Bayes' rule states that the probability of `X` given `Y` is equal to the probability of `Y` given `X` times the ratio of the probability of `X` over the probability of `Y`. 

$$
P(X|Y) = P(Y|X) \frac{P(X)}{P(Y)}
$$

This is the basic formulation of Bayes' rule.

## Naïve Bayes Introduction

Now, we learn how to solve the problem of sentiment classification using a method called the **Naive Bayes**. It's a very good, quick, and dirty baseline for many text classification tasks. Naive Bayes is an example of supervised machine learning, and shares many similarities with the logistic regression method. It's called **naive** because this method makes the assumption that the features used for classification are all independent, which in reality is rarely the case. However, it still works nicely as a simple method for sentiment analysis. 

As before, we begin with two corporal, one for the positive tweets and one for the negative tweets. 

**Positive class**
```
I am happy because I am learning NLP
I am happy, not sad
```

**Negative class**
```
I am sad, I am not learning NLP
I am sad, not happy
```


First, we need to extract the vocabulary containing all the different words that appear in the corpus along with their counts. We get the word counts for each occurrence of a word in the positive corpus, then do the same for the negative corpus. Then, we get a total count of all the words in the positive corpus and do the same again for the negative corpus. For positive tweets, there's a total of 13 words, and for negative tweets, a total of 13 words. 

| Vocabulary | PosFreq(1) | NegFreq(0) |
| :--------- | :--------: | :--------: |
| I          |     3      |     3      |
| am         |     3      |     3      |
| happy      |     2      |     1      |
| because    |     1      |     0      |
| learning   |     1      |     1      |
| NLP        |     1      |     1      |
| sad        |     1      |     2      |
| not        |     1      |     2      | 
| **TOTAL**  |  **13**    |  **13**    |

In [1]:
import misc
corpus = ['I am happy because I am learning NLP', 
          'I am happy not sad', 
          'I am sad I am not learning NLP', 
          'I am sad not happy']
ys = [1, 1, 0, 0]
vocabulary, V, total_pos, total_neg = utils.create_vocabulary(corpus, ys, sumvals=True)
utils.show_vocabulary(vocabulary)

0,1,2
word,freqPos,freqNeg
I,3,3
NLP,1,1
am,3,3
because,1,0
happy,2,1
learning,1,1
not,1,2
sad,1,2
TOTAL,13,13


This is the first new step for Naive Bayes, and it's very important because it allows to compute the conditional probabilities of each word given the class $P(w_i|\text{class})$. Now divide the frequency of each word in a class by its corresponding sum of words in the class. So for the word "I", the conditional probability for the positive class would be 3 over 13. We store that value in a new table with the corresponding value, 0.24. Now for the word "I", in the negative glass, we get 3 over 12, and store it with the corresponding value, 0.25. Now apply the same procedure for each word in the vocabulary to complete the table of conditional probabilities. 

| Vocabulary | PosFreq(1) | NegFreq(0) |
| :--------- | :--------: | :--------: |
| I          |    0.23    |    0.23    |
| am         |    0.23    |    0.23    |
| happy      |    0.15    |    0.08    |
| because    |    0.08    |    0.00    |
| learning   |    0.08    |    0.08    |
| NLP        |    0.08    |    0.08    |
| sad        |    0.08    |    0.15    |
| not        |    0.08    |    0.15    | 
| **TOTAL**  |   **1**    |   **1**    |

In [2]:
def convert_probability(vocabulary, total_pos, total_neg):
    for word, cl in vocabulary:
        freq = vocabulary[(word, cl)]
        if cl:
            vocabulary[(word, cl)] = freq/float(total_pos)
        else:
            vocabulary[(word, cl)] = freq/float(total_neg)

convert_probability(vocabulary, total_pos, total_neg)
utils.show_vocabulary(vocabulary)

0,1,2
word,freqPos,freqNeg
I,0.23076923076923078,0.23076923076923078
NLP,0.07692307692307693,0.07692307692307693
am,0.23076923076923078,0.23076923076923078
because,0.07692307692307693,0
happy,0.15384615384615385,0.07692307692307693
learning,0.07692307692307693,0.07692307692307693
not,0.07692307692307693,0.15384615384615385
sad,0.07692307692307693,0.15384615384615385
TOTAL,1.0,1.0


One key property of this table is that summing over all the probabilities for each class, we get one. First, note how many words have a nearly identical conditional probability, like "I", "am", "learning", and "NLP". The interesting thing here is words that are equally probable do not add anything to the sentiment. In contrast to these neutral words, look at some of these other words like "happy", "sad", and "not". They have a significant difference between probabilities. These are the power words tending to express one sentiment or the other. These words carry a lot of weight in determining the tweet sentiments. 

Now let's take a look at "because", since it only appears in the positive corpus. So its conditional probability for the negative class is zero. When this happens, we have no way of comparing between the two corpora, which will become a problem for the calculations. To avoid this, we smooth the probability function. Thus, when we get a new tweet from a friends and the tweet says, "I'm happy today, I'm learning," we can use the table of probabilities to predict the sentiments of the whole tweet using the *Naive Bayes inference condition rule* for binary classification. 

$$
\prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)}
$$

This expression says that we are going to take the product across all of the words in the tweets of the probability for each word in the positive class divided by the probability in the negative class. 

| Vocabulary | PosFreq(1) | NegFreq(0) |
| :--------- | :--------: | :--------: |
| I          |    0.19    |    0.19    |
| am         |    0.19    |    0.19    |
| happy      |    0.14    |    0.10    |
| because    |    0.10    |    0.05    |
| learning   |    0.10    |    0.10    |
| NLP        |    0.10    |    0.10    |
| sad        |    0.10    |    0.15    |
| not        |    0.10    |    0.15    | 
| **TOTAL**  |   **1**    |   **1**    |


Let's calculate this product for the tweet below. 

> I am happy today; I am learning.

For each word, select its probabilities from the table. So for "I", we get a positive probability of 0.2 and a negative probability of 0.2. So the ratio that goes into the products is just 0.2 over 0.2. For "am", we also get 0.2 over 0.2. For "happy", we get 0.14 over 0.10. For "today", we don't find any word in table, meaning this word is not in the vocabulary, so we won't include any term in this score. For the second occurrence of "I", we have 0.2 over 0.2. For the second occurrence of "am", we have 0.2 over 0.2, and "learning" gets 0.10 over 0.10. 

$$
\prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)} = \color{red}{\frac{0.19}{0.19}} * \color{red}{\frac{0.19}{0.19}} * \frac{0.14}{0.10} * \color{red}{\frac{0.19}{0.19}} * \color{red}{\frac{0.19}{0.19}} * \color{red}{\frac{0.10}{0.10}} = \frac{0.14}{0.10} = 1.4
$$

Now note that all the neutral words in the tweet like "I" and "am" just cancel out in the expression (marked in red in the expression above). What we end up is with 0.14 over 0.10 which is equal to 1.4. This value is higher than one, which means that overall, the words in the tweets are more likely to correspond to a positive sentiment, so we conclude that the tweet is positive. 

So far, we have created a table to store the conditional probabilities of words in the vocabulary and applied the Naive Bayes inference condition rule for binary classification of a tweet.

In [15]:
def compute_product(vocabulary, tweet):
    prod = 1.0
    for word in tweet.split():
        if (word, 0) in vocabulary:
            freq_neg = vocabulary[(word, 0)]
            freq_pos = vocabulary[(word, 1)]
            div = freq_pos / float(freq_neg)
            prod *= div
    return prod

tweet = 'I am happy today I am learning'
score = compute_product(vocabulary, tweet)
print('Score: {}'.format(score))
if score > 1:
    print('Positive tweet')
else:
    print('Negative tweet')

Score: 1.5
Positive tweet


## Laplacian Smoothing

Let's now dive into **Laplacian smoothing**, a technique we can use to avoid the probabilities being zero. The expression used to calculate the conditional probability of a word, given the class, is the frequency of the word in the corpus shown below as the frequenct of word $i$ in a class, divided by the number of words in the corpus or $N_{\text{class}}$. 

$$
P(w_i|\text{class}) = \frac{\text{freq}(w_i, \text{class})}{N_{\text{class}}} \ \ \ \ \ \text{where class}\ \in\ \text{\{Positive, Negative\}}
$$

Smoothing the probability function means that we use a slightly different formula from the original. Note that we have added a one in the numerator. This little transformation avoids the probability being zero. However, it adds a new term to all the frequencies that is not correctly normalized by $N_{\text{class}}$. To account for this, we add a new term in the denominator $V$ that represents the number of unique words in the whole vocabulary. Now, all the probabilities in each column will sum to one. This process is called **Laplacian smoothing**. 

$$
P(w_i|\text{class}) = \frac{\text{freq}(w_i, \text{class}) + 1}{N_{\text{class}} + V}
$$

Going back to the previous example, let's use the formula on it. The first thing we need to calculate is the number of unique words in the vocabulary. In this example, we have eight unique words. 

| Vocabulary | PosFreq(1) | NegFreq(0) |
| :--------- | :--------: | :--------: |
| I          |     3      |     3      |
| am         |     3      |     3      |
| happy      |     2      |     1      |
| because    |     1      |     0      |
| learning   |     1      |     1      |
| NLP        |     1      |     1      |
| sad        |     1      |     2      |
| not        |     1      |     2      | 
| **TOTAL**  |  **13**    |  **13**    |

So now let's calculate the probability for each word in the positive class. For the word "I", the positive class, we get 3 plus 1 divided by 13 plus 8 which is 0.19. 

$$
P(I|pos) = \frac{\text{freq(I, Positive)}+1}{N_{pos} + V} = \frac{3+1}{13+8} = 0.19
$$

For the negative class, we have 3 plus 1 divided by 12 plus 8 which is 0.2, and then so on for the rest of the table. 

$$
P(I|neg) = \frac{\text{freq(I, Negative)}+1}{N_{neg} + V} = \frac{3+1}{13+8} = 0.19
$$

The numbers shown here have been rounded, but using this method the sum of probabilities in the table will still be one. 

| Vocabulary | PosFreq(1) | NegFreq(0) |
| :--------- | :--------: | :--------: |
| I          |    0.19    |    0.19    |
| am         |    0.19    |    0.19    |
| happy      |    0.14    |    0.10    |
| because    |    0.10    |    0.05    |
| learning   |    0.10    |    0.10    |
| NLP        |    0.10    |    0.10    |
| sad        |    0.10    |    0.15    |
| not        |    0.10    |    0.15    | 
| **TOTAL**  |   **1**    |   **1**    |

Note that the word "because" no longer has a probability of zero.

In [3]:
import misc
corpus = ['I am happy because I am learning NLP', 
          'I am happy not sad', 
          'I am sad I am not learning NLP', 
          'I am sad not happy']
ys = [1, 1, 0, 0]
vocabulary, unique, total_pos, total_neg = utils.create_vocabulary(corpus, ys, sumvals=True)
utils.show_vocabulary(vocabulary)

0,1,2
word,freqPos,freqNeg
I,3,3
NLP,1,1
am,3,3
because,1,0
happy,2,1
learning,1,1
not,1,2
sad,1,2
TOTAL,13,13


In [4]:
def laplacian_smoothing(vocabulary, unique, total_pos, total_neg):
    classes = [0, 1]
    totals = [total_neg, total_pos]
    V = len(unique.keys())
    for word in unique.keys():
        for cl in classes:
            total = totals[cl]
            if (word, cl) in vocabulary:
                freq = vocabulary[(word, cl)]
            else:
                freq = 0
            vocabulary[(word, cl)] = (freq + 1.)/(total + V)

vocabulary, unique, total_pos, total_neg = utils.create_vocabulary(corpus, ys, sumvals=True)
laplacian_smoothing(vocabulary, unique, total_pos, total_neg)
utils.show_vocabulary(vocabulary)

0,1,2
word,freqPos,freqNeg
I,0.19047619047619047,0.19047619047619047
NLP,0.09523809523809523,0.09523809523809523
am,0.19047619047619047,0.19047619047619047
because,0.09523809523809523,0.047619047619047616
happy,0.14285714285714285,0.09523809523809523
learning,0.09523809523809523,0.09523809523809523
not,0.09523809523809523,0.14285714285714285
sad,0.09523809523809523,0.14285714285714285
TOTAL,0.9999999999999999,0.9999999999999998


## Log Likelihood, Part 1

Now, we introduce **log likelihoods** that are just logarithms of the probabilities we are calculating. They are way more convenient to work with and they appear throughout deep-learning and NLP. Okay, so let's go back to the table we saw previously that contains the conditional probabilities of each word, for positive or negative sentiment. 

| Vocabulary | PosFreq(1) | NegFreq(0) |
| :--------- | :--------: | :--------: |
| I          |    0.19    |    0.19    |
| am         |    0.19    |    0.19    |
| happy      |    0.14    |    0.10    |
| because    |    0.10    |    0.05    |
| learning   |    0.10    |    0.10    |
| NLP        |    0.10    |    0.10    |
| sad        |    0.10    |    0.15    |
| not        |    0.10    |    0.15    | 
| **TOTAL**  |   **1**    |   **1**    |

Words can have many shades of emotional meaning. But for the purpose of sentiment classification, they are simplified into three categories: *neutral*, *positive*, and *negative*. All can be identified by using their conditional probabilities. These categories can be numerically estimated just by dividing the corresponding conditional probabilities of this table. Now, let's see how this ratio looks for the words in the vocabulary. 

$$
\text{ratio}(w_i) = \frac{P(w_i|pos)}{P(w_i|neg)} \approx \frac{\text{freq}(w_i, 1) + 1}{\text{freq}(w_i, 0) + 1}
$$

So the ratio for the word "I" is 0.2 divided by 0.2, or one. The ratio for the word "am" is again one. The ratio for the word "happy" is 0.14 divided by 0.1 or 1.4. For "because", "learning", and "NLP", the ratio is one. For "sad" and "not", their ratio is 0.1 divided by 0.15 or 0.6. Again, neutral words have value equals to one. Positive words have a ratio larger than one. The larger the ratio, the more positive the word's going to be. On the other hand, negative words have a ratio smaller than one. The smaller the value, the more negative the word.

| Vocabulary | PosFreq(1) | NegFreq(0) | Ratio |
| :--------- | :--------: | :--------: | :---: |
| I          |    0.19    |    0.19    |  1.0  |
| am         |    0.19    |    0.19    |  1.0  |
| happy      |    0.14    |    0.10    |  1.5  |
| because    |    0.10    |    0.05    |  2.0  |
| learning   |    0.10    |    0.10    |  1.0  |
| NLP        |    0.10    |    0.10    |  1.0  |
| sad        |    0.10    |    0.15    |  0.6  |
| not        |    0.10    |    0.15    |  0.6  |
| **TOTAL**  |   **1**    |   **1**    |  ---  |


In [5]:
def compute_ratio(vocabulary, unique):
    ratio = {}
    for word in unique:
        pb_neg = vocabulary[(word, 0)]
        pb_pos = vocabulary[(word, 1)]
        ratio[word] = pb_pos / pb_neg
    return ratio

ratios = compute_ratio(vocabulary, unique)
utils.show_vocabulary(vocabulary, new_column=ratios, clname='Ratio')

0,1,2,3
word,freqPos,freqNeg,Ratio
I,0.19047619047619047,0.19047619047619047,1.0
NLP,0.09523809523809523,0.09523809523809523,1.0
am,0.19047619047619047,0.19047619047619047,1.0
because,0.09523809523809523,0.047619047619047616,2.0
happy,0.14285714285714285,0.09523809523809523,1.5
learning,0.09523809523809523,0.09523809523809523,1.0
not,0.09523809523809523,0.14285714285714285,0.6666666666666666
sad,0.09523809523809523,0.14285714285714285,0.6666666666666666
TOTAL,0.9999999999999999,0.9999999999999998,8.833333333333332


These ratios are essential in Naive Bayes' for binary classification. I'll illustrate why using an example we have seen before. Recall earlier where we used the formula to categorize a tweet as positive if the products of the corresponding ratios of every word appears in the tweet is bigger than one. We said it was negative if it was less than one. This is called the **likelihood** (marked in blue in the equation below).

$$
\color{red}{\frac{P(pos)}{P(neg)}} \color{blue}{\prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)}} > 1
$$

If we take a ratio between the positive and negative tweets, we have what's called the **prior ratio** (marked in red in the equation above). I haven't mentioned it until now because in this small example, we had exactly the same number of positive and negative tweets, making the ratio one. 

With the addition of the prior ratio, we now have the full **Naive Bayes' formula** for binary classification, a simple, fast, and powerful method that we can use to establish a baseline quickly.

Now it's a good time to mention some other important considerations for the implementation of Naive Bayes'. Sentiments probability calculation requires multiplication of many numbers with values between zero and one. Carrying out such multiplications on their computer runs the risk of numerical underflow when the number returned is so small it can't be stored on the device. 

Luckily, there's a mathematical trick to solve this. It involves using a property of logarithms. 

$$
\log(a * b) = \log(a) + \log(b)
$$

Recall that the formula we are using to calculate a score for Naive Bayes' is the prior multiplied by the likelihood. 

$$
\frac{P(pos)}{P(neg)} \prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)} > 1
$$

The trick is to use a log of the score instead of the raw score.

$$
\log \left (\frac{P(pos)}{P(neg)} \prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)} \right ) > 1
$$

This allows us to write the previous expression as the sum of the *log prior* and the *log likelihood*, which is a sum of the logarithms of the conditional probability ratio of all unique words in the corpus. 

$$
\log \left (\frac{P(pos)}{P(neg)} \prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)} \right ) \ \ \ \rightarrow \ \ \ \log \frac{P(pos)}{P(neg)} \sum\limits_{i=1}^{m} \log \frac{P(w_i|pos)}{P(w_i|neg)}
$$

Let's use this method to classify the tweets "I'm happy because I'm learning". Remember how we use the Naive Bayes' inference condition earlier to get the sentiment score for the tweets. Now we do something very similar to get the log of the score. Consider the table below:

| Vocabulary | PosFreq(1) | NegFreq(0) |
| :--------- | :--------: | :--------: |
| I          |    0.05    |    0.05    |
| am         |    0.04    |    0.04    |
| happy      |    0.09    |    0.01    |
| because    |    0.01    |    0.01    |
| learning   |    0.03    |    0.01    |
| NLP        |    0.02    |    0.02    |
| sad        |    0.01    |    0.09    |
| not        |    0.02    |    0.03    | 

We need to calculate the log of the score, which is called the **Lambda** ($\lambda$). This is the log of the ratio of the probability that a word is positive and we divide that by the probability that the word is negative. 

$$
\lambda(w) = \log \frac{P(w|neg)}{P(w|neg)}
$$

Now let's calculate Lambda for every word in our vocabulary. So for the word "I", we get the logarithm of 0.05 divided by 0.05. Or the logarithm of one, which is equal to zero. 

$$
\lambda(\text{I}) = \log \frac{0.05}{0.05} = \log(1) = 0
$$

Remember, the tweet will be labeled positive if the product is larger than one. By this logic, "I" would be classified as neutral at zero. For "am", we take the log of 0.04 over 0.04, which again is equal to zero. 

$$
\lambda(\text{am}) = \log \frac{0.04}{0.04} = \log(1) = 0
$$

For "happy", we get a Lambda of 2.2, which is greater than zero, indicating a positive sentiment. 

$$
\lambda(\text{happy}) = \log \frac{0.09}{0.01} = 2.2
$$

From here on out, we calculate the log score of the entire corpus just by summing out the Lambdas. 

| Vocabulary | PosFreq(1) | NegFreq(0) | $\lambda$ |
| :--------- | :--------: | :--------: | :-------: | 
| I          |    0.05    |    0.05    |    0.0    |
| am         |    0.04    |    0.04    |    0.0    |
| happy      |    0.09    |    0.01    |    2.2    |
| because    |    0.01    |    0.01    |    0.0    |
| learning   |    0.03    |    0.01    |    1.1    |
| NLP        |    0.02    |    0.02    |    0.0    |
| sad        |    0.01    |    0.09    |   -2.2    |
| not        |    0.02    |    0.03    |   -0.4    | 

Let's stop here and take a quick look back at what we did so far. Words are often emotionally ambiguous but we can simplify them into three categories and then measure exactly where they fall within those categories for binary classification. We do so by dividing the conditional probabilities of the words in each category. This ratio can be expressed as a logarithm as well, called Lambda. We can use that to reduce the risk of numerical underflow.

$$
\text{Word Sentiment} \left\{\begin{matrix}
\text{ratio}(w) = \frac{P(w|pos)}{P(w|neg)}\\ 
 \lambda(w) = \log \frac{P(w|pos)}{P(w|neg)}
\end{matrix}\right.
$$

## Log Likelihood, Part 2

We have done most of the work to arrive at the log-likelihood. Now, we can calculate the log-likelihood of the tweets as the sum of the Lambdas from each word in the tweet. Consider the tweet and the table below:

> "I'm happy because I'm learning"

| Vocabulary | PosFreq(1) | NegFreq(0) | $\lambda$ |
| :--------- | :--------: | :--------: | :-------: | 
| I          |    0.05    |    0.05    |    0.0    |
| am         |    0.04    |    0.04    |    0.0    |
| happy      |    0.09    |    0.01    |    2.2    |
| because    |    0.01    |    0.01    |    0.0    |
| learning   |    0.03    |    0.01    |    1.1    |
| NLP        |    0.02    |    0.02    |    0.0    |
| sad        |    0.01    |    0.09    |   -2.2    |
| not        |    0.02    |    0.03    |   -0.4    | 

$$
\sum\limits_{i=1}^{m} \log \frac{P(w_i|pos)}{P(w_i|neg)} = \sum\limits_{i=1}^{m} \lambda(w_i)
$$

So for word "I", we add 0, for "am" we add 0, for word "happy" we add 2.2, for words "because", "I" and "am", we add 0 and for word "learning", we add 1.1. This sum is 3.3, and this value is higher than 0. 

$$
\text{log-likelihood} = 0 + 0 + 2.2 + 0 + 0 + 1.1 = 3.3
$$

Remember how previously we saw that the tweet was positive if the product was bigger than one. 

$$
\prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)} > 1\hspace{3cm} \text{(Negative)}\ \ \ 0\ \ \ \leftarrow\ \ \ 1\ \ \ \rightarrow\ \ \ \infty\ \ \ \text{(Positive)}
$$

With the log of 1 equal to 0, the positive values indicate that the tweet is positive. A value less than 0 would indicate that the tweet is negative. 

$$
\sum\limits_{i=1}^{m} \log \frac{P(w_i|pos)}{P(w_i|neg)} > 0\hspace{3cm} \text{(Negative)}\ \ \ -\infty\ \ \ \leftarrow\ \ \ 0\ \ \ \rightarrow\ \ \ \infty\ \ \ \text{(Positive)}
$$

The log-likelihood for this tweet is 3.3. Since 3.3 is bigger than 0, the tweet is positive. Notice that this score is based entirely on the words "happy" and "learning", both of which carry a positive sentiments. All the other words are neutral and do not contribute to the score. 

Summarizing, we can predict the sentiment of a tweet by summing all the Lambdas for each word that's appeared in the tweets. This score is called the log-likelihood. 

$$
\log \prod\limits_{i=1}^{m} \text{ratio}(w_i) = \sum\limits_{i=1}^{m} \lambda(w_i) > 0
$$

For the log-likelihood the decision threshold is 0 instead of 1. Positive tweets will have a positive log-likelihood above 0, and negative tweets will have a negative log-likelihood below 0. 

## Training Naïve Bayes

Now we check all the steps for building a Naive Bayes model for sentiment analysis using a corpus of tweets that we have already collected. The first step for any supervised machine learning project is to gather the data to train and test the model. 

Positive tweets:<br>
> I am happy because I am learning.
>
> I am happy, not sad @NLP.

Negative tweets:<br>
> I am sad, I am not learning NLP.
>
> I am sad, not happy.

For sentiment analysis of tweets, this step involves getting a corpus of tweets and dividing it into two groups, positive and negative tweets. 

The next step is fundamental to the model success. The preprocessing step as described in the previous module, consists of five smaller steps. 

- Lowercase the text
- Remove punctuation, URLs, and handles 
- Remove stop words
- Stemming or reducing words to their common stem
- Tokenizing or splitting the document into single words or tokens

In the real world, we might find the gathering and processing of texts takes up a big chunk of the project hours. Once we have a clean corpus of process tweets, we start computing the vocabulary for each word and class. This process will produce a table like the one shown below. 

| Word    | Pos | Neg |
| :------ | :-: | :-: | 
| happi   |  2  |  1  |
| because |  1  |  0  |
| learn   |  1  |  1  |
| NLP     |  1  |  1  |
| sad     |  1  |  2  |
| not     |  1  |  2  |

We compute the sum of words and class in each corpus in this same step. From this table of frequencies, we get the conditional probability or probability for a given class by using the Laplacian smoothing formula. 

$$
P(w|\text{class}) = \frac{\text{freq}(w, \text{class}) + 1}{N_{\text{class}} + V} \hspace{3cm} \text{where}\ \  V = 6
$$

See how the number of unique words in $V=6$. We only account for the words in the table, not the total number of words in the original corpus. This produces a table of conditional probabilities for each word and each class. This table only contains values greater than 0. 

| Word    | Pos  | Neg  |
| :------ | :--: | :--: | 
| happi   | 0.23 | 0.15 |
| because | 0.15 | 0.07 |
| learn   | 0.08 | 0.08 |
| NLP     | 0.08 | 0.08 |
| sad     | 0.08 | 0.17 |
| not     | 0.08 | 0.17 |

For the 4th step, we calculate the Lambda square for each word, which is the log of the ratio of the conditional probabilities.

$$
\lambda(w) = \log \frac{P(w|pos)}{P(w|neg)}
$$

Calculating Lambda for each word, we have the values for column $\lambda$ in table:

| Word    | Pos  | Neg  | $\lambda$ |
| :------ | :--: | :--: | :-------: |
| happi   | 0.23 | 0.15 |   0.43    |
| because | 0.15 | 0.07 |   0.60    |
| learn   | 0.08 | 0.08 |   0.00    |
| NLP     | 0.08 | 0.08 |   0.00    |
| sad     | 0.08 | 0.17 |  -0.75    |
| not     | 0.08 | 0.17 |  -0.75    |

The 5th step is the estimation of the log prior. To do this, we count the number of positive ($D_{pos}$) and negative ($D_{neg}$) tweets. And then the log prior is the log of the ratio of the number of positive tweets over the number of negative tweets. 

$$
\text{logprior} = \log \frac{D_{pos}}{D_{neg}}
$$

Here, we work with a balanced datasets. So, the log prior is equal to 0. However, for unbalanced data sets, this term will become important.

In summary, training a Naive Bayes model can be divided into six logical steps. 

1. Get to annotate a dataset with positive and negative tweets.
2. Preprocess the raw text to get a corpus of clean, standardized tokens. 
3. Compute the dictionary frequencies for each word and class
4. Compute the conditional probabilities of each word using the Laplacian smoothing formula. 
5. Compute the Lambda factor for each word.
6. Estimate the log prior of the model or how likely it is to see a positive tweet in your account. 

In [113]:
# Pipeline
import re
import numpy as np
import pandas as pd

def preprocessing(corpus):
    """ Remove punctuation and split sentences"""
    corpus = re.sub("[@|,]", "", corpus)
    stopwords = ['I', 'am', 'You', 'are']
    clean_corpus = []
    for sent in corpus.split('.')[:-1]:
        phrase = []
        for word in sent.strip().split():
            if word not in stopwords:
                phrase.append(word)
        clean_corpus.append(phrase[:])
    return clean_corpus
    
def compute_dictionary(corpus, ys):
    """ Count the number of each word appear and create a dictionary"""
    vocabulary = {}
    for sent, y in zip(corpus, ys):
        for word in sent:
            if word not in vocabulary:
                vocabulary[word] = {0: 0, 1: 0}
            vocabulary[word][y] += 1
    values = [(w, vocabulary[w][1], vocabulary[w][0]) for w in vocabulary]
    return pd.DataFrame(values, columns=['Word', 'Pos', 'Neg'])

def laplacian_smoothing(table):
    """ Perform the laplacian smoothing on elements of the table"""
    table['Pos'] = (table['Pos'] + 1.)/(table['Pos'].sum() + table['Pos'].size)
    table['Neg'] = (table['Neg'] + 1.)/(table['Neg'].sum() + table['Neg'].size)
            
def compute_lambda(table):
    """ Compute lambda score for each word"""
    lambda_score = np.log(table['Pos'] / table['Neg'])
    table['Lambda'] = lambda_score 
    
def log_prior(ys):
    """ Compute the log prior of each class"""
    dpos = np.count_nonzero(ys)
    dneg = float(len(ys) - dpos)
    return np.log(dpos/dneg)

In [114]:
pos_corpus = 'I am happy because I am learning. I am happy, not sad @NLP.' 
neg_corpus = 'I am sad, I am not learning NLP. I am sad not happy.'
ys = [1, 1, 0, 0]

# preprocessing
corpus = preprocessing(pos_corpus)
corpus.extend(preprocessing(neg_corpus))
print('Corpus = [')
for sent in corpus:
    print(' ', sent)
print(']')

Corpus = [
  ['happy', 'because', 'learning']
  ['happy', 'not', 'sad', 'NLP']
  ['sad', 'not', 'learning', 'NLP']
  ['sad', 'not', 'happy']
]


In [115]:
# compute dictionary
table = compute_dictionary(corpus, ys)
table

Unnamed: 0,Word,Pos,Neg
0,happy,2,1
1,because,1,0
2,learning,1,1
3,not,1,2
4,sad,1,2
5,NLP,1,1


In [116]:
# smooth values
laplacian_smoothing(table)
table

Unnamed: 0,Word,Pos,Neg
0,happy,0.230769,0.153846
1,because,0.153846,0.076923
2,learning,0.153846,0.153846
3,not,0.153846,0.230769
4,sad,0.153846,0.230769
5,NLP,0.153846,0.153846


In [117]:
# compute lambda scores
compute_lambda(table)
table

Unnamed: 0,Word,Pos,Neg,Lambda
0,happy,0.230769,0.153846,0.405465
1,because,0.153846,0.076923,0.693147
2,learning,0.153846,0.153846,0.0
3,not,0.153846,0.230769,-0.405465
4,sad,0.153846,0.230769,-0.405465
5,NLP,0.153846,0.153846,0.0


In [118]:
# compute log priors
lg_prior = log_prior(ys)
print('Log prior: ', lg_prior)

Log prior:  0.0


# Testing Naïve Bayes

Now, we apply the naive Bayes classifier on real test examples. Once we have trained the model, the next step is to test it. We do so by taking the derived conditional probabilities and we use them to predict the sentiments of new unseen tweets. After that, we evaluate the model performance. 

We use the test sets of annotated tweets containing the table with the Lambda score for each unique word in the vocabulary as shown below.

$$
\lambda(w) = \log \frac{P(w|pos)}{P(w|neg)}
$$

| Word    | $\lambda$ |
| :------ | :-------: |
| I       |   -0.01   |
| the     |   -0.01   |
| happi   |    0.63   |
| because |    0.01   |
| pass    |    0.50   |
| NLP     |    0.00   |
| sad     |   -0.75   |
| not     |   -0.75   |

With the estimation of the log prior, we can predict sentiments on a new tweet.

$$
\text{logprior} = \log \frac{D_{pos}}{D_{neg}} = 0
$$

Here we use a new tweet that says:

> I passed the NLP interview. 

We can use the model to predict if this is a positive or negative tweet. Before anything else, we must pre-processed this text removing the punctuation, stemming the words, and tokenizing to produce a vector of words like the one below. 

```
tweet = ['I', 'pass', 'the', 'NLP', 'interview']
```

Now we look up each word from the vector in the log-likelihood table. If the word is found, such as "I", "pass", "the", "NLP", we sum over all the corresponding Lambda terms. The values that don't show up in the table of Lambdas, like interview, are considered neutral and don't contribute anything to this score. 

$$
\text{score} = -0.01 + 0.5 - 0.01 + 0
$$

The model can only give a score for words it's seen before. Now, we add the log prior to account for the balance or imbalance of the classes in the dataset. 

$$
\text{score} = -0.01 + 0.5 - 0.01 + 0 + \text{logprior}
$$

So this course sums up to 0.48. 

$$
\text{score} = -0.01 + 0.5 - 0.01 + 0 + 0 = 0.48
$$

Remember, if this score is bigger than zero, then this tweet has a positive sentiment. Thus, in the model and in real life, passing the NLP interview is a very positive thing. 

It's time to test the performance of the classifier on unseen data. Let's quickly review that process as applied to naive Bayes. We have validation set that was set aside during training and is composed of a set of raw tweets, so $X_{val}$, and their corresponding sentiments, $Y_{val}$. To get the accuracy of the model, we compute the score of each entry in $X_{val}$, then evaluates whether each score is greater than zero. 

\text{score} = \text{predict}(X_{val}, \lambda, \text{logprior})

This produces a vector populated with zeros and ones indicating if the predicted sentiment is negative or positive respectively for each tweet in the validation sets. 

$$
\left [ \begin{matrix}
0.5 \\
-1 \\
1.3 \\
\vdots \\
\text{score}_m
\end{matrix} \right ] > 0 = \left [ \begin{matrix}
0.5 > 0 \\
-1 >0 \\
1.3 > 0\\
\vdots \\
\text{score}_m > 0
\end{matrix} \right ] > 0 = \left [ \begin{matrix}
1 \\
0 \\
1 \\
\vdots \\
\text{pred}_m
\end{matrix} \right ]
$$


With the new predictions vector, we can compute the accuracy of the model over the validation sets. 

$$
ACC = \frac{1}{m} \sum\limits_{i=1}^{m} (\text{pred}_i == Y_{val})
$$

To do this part, we compare the predictions against the true value for each observation from the validation data, $Y_{val}$. If the values are equal and the prediction is correct, we get a value of 1 and 0 if incorrect. 

$$
\left [ \begin{matrix}
0 \\
0 \\
1 \\
\vdots \\
\text{pred}_{m}
\end{matrix} \right ] == \left [ \begin{matrix}
0 \\
1 \\
1 \\
\vdots \\
Y_{val_{m}}
\end{matrix} \right ] = \left [ \begin{matrix}
1 \\
0 \\
1 \\
\vdots \\
\text{pred}_{m} == Y_{val_{m}}
\end{matrix} \right ]
$$

Once we have compared the values of every prediction with the true labels of the validation sets, we compute the accuracy as the sum of this vector divided by the number of examples in the validation sets.

## Applications of Naïve Bayes

Earlier we used a Naive Bayes method to classify tweets. But that can be used to do a number of other things like identify who's an author of a text. When we use Naive Bayes to predict the sentiments of a tweet, what we are actually doing is estimating the probability for each class by using the joint probability of the words in classes. 

$$
P(pos|\text{tweet}) \approx P(pos)P(\text{tweet}|pos) \\
P(neg|\text{tweet}) \approx P(neg)P(\text{tweet}|neg)
$$

The Naive Bayes formula is just the ratio between these two probabilities, the products of the priors and the likelihoods. 

$$
\frac{P(pos|\text{tweet})}{P(neg|\text{tweet})} = \frac{P(pos)}{P(neg)} \prod\limits_{i=1}^{m} \frac{P(w_i|pos)}{P(w_i|neg)}
$$

We can use this ratio between conditional probabilities for much more than sentiment analysis. For one, we could do author identification. If we had two large corporal, each written by different authors, we could train the model to recognize whether a new document was written by one or the other. Or if we had some works by Shakespeare and some works by Hemingway, we could calculate the Lambda for each word to predict how likely a new word is to be used by Shakespeare or alternatively by Hemingway. This method also allows to determine author identity. 

$$
\frac{P(\text{Shakespeare}|\text{book})}{P(\text{Hemingway}|\text{book})}
$$

Another common use is spam filtering. Using information taken from the sender, subject and content, we could decide whether an email is spam or not. 

$$
\frac{P(\text{spam}|\text{email})}{P(\text{nonspam}|\text{email})}
$$

One of the earliest uses of Naive Bayes was filtering between relevant and irrelevant documents in a database. Given the sets of keywords in a query, in this case, we only needed to calculate the likelihood of the documents given the query. 

$$
P(\text{document}_k|\text{query}) \propto \prod\limits_{i=0}^{|\text{query}|} P(\text{query}_i|\text{document}_k)
$$

We cannot know beforehand what's irrelevant or a relevant document looks like. So we can compute the likelihood for each document in the dataset and then store the documents based on its likelihoods. We can choose to keep the first $m$ results or the ones that have a likelihood larger than a certain threshold.

$$
\text{Retrieve document if}\ P(\text{document}_k|\text{query}) > \text{threshold}
$$

Finally, we can also use Naive Bayes for word disambiguation, which is to say, breaking down words for contextual clarity. Consider that we have only two possible interpretations of a given word within a text. Let's say we do not know if the word bank in the reading is referring to the bank of a river or to a financial institution. To disambiguate the word, calculate the score of the document, given that it refers to each one of the possible meanings. 

$$
\frac{P(\text{river}|\text{text})}{P(\text{money}|\text{text})} 
$$

In this case, if the text refers to the concept of river instead of the concept of money, then the score will be bigger than one. In summary, Bayes Rule and it's naive approximation has a wide range of applications in sentiment analysis, author identification, information retrieval and word disambiguation. It's a popular method since it is relatively simple to train, use and interpret. 

## Naïve Bayes Assumptions

Naive Bayes is a very simple model because it does not require setting any custom parameters. This method is referred to as naive because of the assumptions it makes about the data. The first assumption is independence between the predictors or features associated with each class and the second has to do with the validation sets. 

To illustrate towards independence between features looks like, let's look at the following sentence. 

> "It is sunny and hot in the Sahara desert."

Naive Bayes assumes that the words in a piece of text are independent of one another, but as we can see, this typically is not the case. The word "sunny" and "hot" often appear together as they do in this example. Taken together, they might also be related to the thing they are describing like a "beach" or a "desert". So the words in a sentence are not always necessarily independent of one another, but Naive Bayes assumes that they are. This could lead to under or over estimation of the conditional probabilities of individual words. 

When using Naive Bayes, for example, if the task was to complete the sentence

**"It's always cold and snowy in ____."**
- [ ] spring
- [ ] summer
- [ ] fall
- [ ] winter

Naive Bayes might assign equal probability to the words "spring", "summer", "fall", and "winter" even though from the context we can see that "winter" should be the most likely candidate. Another issue with Naive Bayes is that it relies on the distribution of the training data sets. A good data set will contain the same proportion of positive and negative tweets as a random sample would. However, most of the available annotated corpora are artificially balanced. In the real tweet stream, positive tweet is sent to occur more often than their negative counterparts. One reason for this is that negative tweets might contain content that is banned by the platform or muted by the user such as inappropriate or offensive vocabulary. Assuming that reality behaves as the training corpus, this could result in a very optimistic or very pessimistic model. 

Let's do a quick recap of all this new information. The assumption of independence in Naive Bayes is very difficult to guarantee, but despite that, the model works pretty well in certain situations. 

## Error Analysis

No matter what NLP method we use, we will one day find ourselves faced with an error, for example, a misclassified sentence. Let us consider some possible errors in the model prediction that can be caused by these issues. 
- Semantic meaning lost in the pre-processing step.
- How word order affects the meaning of a sentence. 
- Some quirks of languages come naturally to humans but confuse naive Bayes models. 

One of the main considerations when analyzing errors in NLP systems is what the processed version of the text actually looks like. Let's look at this tweet. 

> "My beloved grandmother :("

The sad face punctuation in this case is very important to the sentiment of the tweet because it tells what is happening. But if we are removing punctuation, then the processed tweet will leave behind only

```
processed_tweet = ['belov', 'grandmoth']
```

which looks like a very positive tweet. "My beloved grandmother!" would be a very different sentiment. So remember, always check what the actual text looks like. It is not just about punctuation either. Check out this tweet. 

> "This is not good, because your attitude is not even close to being nice."

If we remove neutral words like not and this, what is left with is the following. 

```
processed_tweet = ['good', 'attitude', 'close', 'nice']
```

From this set of words, any classifier will infer that this is something very positive. Double check what the process text looks like to make sure the model will be able to get an accurate read. The inputs pipeline isn't the only potential source of trouble. Look at these tweets. 

> "I am happy because I did not go."
>
> "I am not happy because I did go."

The first is a purely positive tweet, while the second has a negative sentiment. In this case, the "not" is important to the sentiment but gets missed by the naive Bayes classifier. So word order can be as important to spelling. There are many other factors to consider as well. 

Another problem of naive Bayes is something called an adversarial attack. The term adversarial attack describes some common language phenomenon, like *sarcasm*, *irony*, and *euphemism*. Humans pick these up quickly but machines are terrible at it. This tweet, 

> "This is a ridiculously powerful movie. The plot was gripping and I cried right through until the ending!"

contains a somewhat positive movie review, but pre-processing might suggest otherwise. If we pre-process this tweet, we will get a list of mostly negative words, 

```
processed_tweet = ['ridicul', 'power', 'movi', 'plot', 'grip', 'cry', 'end']
```

but as we can see, they were actually used to describe a movie that the author enjoyed. If we use naive Bayes on this list of words, it would end up giving a very negative score regardless.

In [120]:
a = np.array([1, 0, 0])
b = np.array([1, 0, 1])
sum(a != b)

1