# Bayes

Bayes' rule
- **assumes each word in sentence are independent from one another** 

Conditional probability
- Probability of B given that A happened OR
- Looking at elements of A, probability that they also being to B

Bayes rule
- $P(X|Y) = P(Y|X) \times \dfrac{P(X)}{P(Y)}$

Table from before but add the total sum
<table>
<tr>
    <td>Vocabulary</td>
    <td>PosFreq(1)</td>
    <td>NegFreq(0)</td>
</tr>
<tr>
    <td>I</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>am</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>happy</td>
    <td>2</td>
    <td>1</td>
</tr>
<tr>
    <td>because</td>
    <td>1</td>
    <td>0</td>
</tr>
<tr>
    <td>learning</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>NLP</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>sad</td>
    <td>1</td>
    <td>2</td>
</tr>
<tr>
    <td>not</td>
    <td>1</td>
    <td>2</td>
</tr>
<tr>
    <td>N</td>
    <td>13</td>
    <td>12</td>
<tr>
</table>

Compute conditional probabilities, for example
- $P(I|Pos) = \dfrac{3}{13}, P(I|Neg) = \dfrac{3}{12}$

Then, fill those conditional probabilties to a new table

<table>
<tr>
    <td>Vocabulary</td>
    <td>Pos</td>
    <td>Neg</td>
</tr>
<tr>
    <td>I</td>
    <td>0.24</td>
    <td>0.25</td>
</tr>
<tr>
    <td>am</td>
    <td>0.24</td>
    <td>0.25</td>
</tr>
<tr>
    <td>happy</td>
    <td>0.15</td>
    <td>0.08</td>
</tr>
<tr>
    <td>because</td>
    <td>0.08</td>
    <td>0</td>
</tr>
<tr>
    <td>learning</td>
    <td>0.08</td>
    <td>0.08</td>
</tr>
<tr>
    <td>NLP</td>
    <td>0.08</td>
    <td>0.08</td>
</tr>
<tr>
    <td>sad</td>
    <td>0.08</td>
    <td>0.17</td>
</tr>
<tr>
    <td>not</td>
    <td>0.08</td>
    <td>0.17</td>
</tr>
<tr>
    <td>Sum</td>
    <td>1</td>
    <td>1</td>
</tr>
</table>

Naive Bayes binary classification rule
- $\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|POS)}{P({w_{i}}|NEG)}$

For an example sentence "I am happy today; I am learning"
- $\dfrac{0.2}{0.2} \times \dfrac{0.2}{0.2} \times \dfrac{0.14}{0.10} \times \dfrac{0.2}{0.2} \times \dfrac{0.2}{0.2} \times \dfrac{0.1}{0.1} = 1.4$

### Laplacian smoothing
- Avoid problems of probabilities being $0$
- $P(w_{i}|class) = \dfrac{freq(w_{i},class)+1}{(N_{class}+V)}$
    - $N_{class}$ : frequency of all words in class
    - $V$ : number of unique words in vocabulary
- For example, $P(I|POS) = \dfrac{3+1}{13+8} = 0.19$
- Then, the table becomes

<table>
<tr>
    <td>Vocabulary</td>
    <td>Pos</td>
    <td>Neg</td>
</tr>
<tr>
    <td>I</td>
    <td>0.19</td>
    <td>0.20</td>
</tr>
<tr>
    <td>am</td>
    <td>0.19</td>
    <td>0.20</td>
</tr>
<tr>
    <td>happy</td>
    <td>0.14</td>
    <td>0.10</td>
</tr>
<tr>
    <td>because</td>
    <td>0.10</td>
    <td>0.05</td>
</tr>
<tr>
    <td>learning</td>
    <td>0.10</td>
    <td>0.10</td>
</tr>
<tr>
    <td>NLP</td>
    <td>0.10</td>
    <td>0.10</td>
</tr>
<tr>
    <td>sad</td>
    <td>0.10</td>
    <td>0.15</td>
</tr>
<tr>
    <td>not</td>
    <td>0.10</td>
    <td>0.15</td>
</tr>
<tr>
    <td>Sum</td>
    <td>1</td>
    <td>1</td>
</tr>
</table>

### Log likelihood

To do inference, we can compute
- $\dfrac{P(pos)}{P(neg)}\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|pos)}{P({w_{i}}|neg)} \gt 1$

To avoid numerical overflow as $m$ gets larger, we introduce "log" such that
- $\log\left(\dfrac{P(pos)}{P(neg)}\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|pos)}{P({w_{i}}|neg)}\right) = \log\left(\dfrac{P(pos)}{P(neg)}\right) + \log\left(\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|pos)}{P({w_{i}}|neg)}\right)$

We let lambda such that
- $\lambda(w) = \log\left(\dfrac{P(w|pos)}{P(w|neg)}\right)$

Log likelyhood is give by
- $\displaystyle\sum_{i=1}^{m}\lambda(w) = \displaystyle\sum_{i=1}^{m}\log\left(\dfrac{P(w|pos)}{P(w|neg)}\right)$

In [1]:
def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''

    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            # define the key, which is the word and label tuple
            pair = (word, y)

            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1

            # else, if the key is new, add it to the dictionary and set the count to 1
            else:
                result[pair] = 1
    
    return result

In [2]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:

            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[pair]

        # else, the label is negative
        else:

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs[pair]

    # Calculate D, the number of documents
    D = len(train_y)
    
    # Calculate D_pos, the number of positive documents (*hint: use sum(<np_array>))
    D_pos = 0
    for i in train_y:
        if i == 1:
            D_pos += 1

    # Calculate D_neg, the number of negative documents (*hint: compute using D and D_pos)
    D_neg = D - D_pos

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs,word,1)
        freq_neg = lookup(freqs,word,0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
        
    return logprior, loglikelihood

In [3]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    
    # process the tweet to get a list of words
    word_l = process_tweet(tweet)
    
    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior
    
    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    return p

In [4]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0  # return this properly

    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = np.mean(np.absolute(y_hats-test_y))
            
    # Accuracy is 1 minus the error
    accuracy = 1 - error
    
    return accuracy