# Naive Bayes Classifiers

-----------

_Author: Dhavide Aruliah_

### Assignment Contents

- [Question 1: Computing Discrete Probabilities](#q-club-black)
- [Question 2: Computing Conditional probabilities](#q-cond-p)
- [Question 3: Reasoning about Spam Messages](#q-spam)
- [Question 4: Preparing the SMS Messaging Data](#q-preparing)
- [Question 5: Getting priors](#q-priors)
- [Question 6: Getting likelihoods](#q-likelihoods)
- [Question 7: Computing smoothed likelihoods](#q-smoothed)
- [Question 8: Predicting spam](#q-predict)

#### EXPECTED TIME 2.0 HRS  

## Activities in this Assignment

This assignment provides an overview of *Naive Bayes classifiers* as an approach to classification problems in supervised learning. In spite of the "naive" assumptions involved, it works very well in practice particularly for text analysis in, for instance, spam filtering or document classification. As such, this assignment is built around a very simple model of spam filtering to get a sense of how naive Bayes classification really works.

The primary goals are:
+ to review notions of probability as related to Bayes' theorem (notably independent events & conditional probability).
+ to practice the application of Bayes' theorem for probabilistic reasoning.
+ to develop a (highly simplified) model of text analysis for spam classification using the naive Bayes classification framework.

---

## Reminders from Discrete Probability

For finite sets, probabilities of distinct events can be modeled using *sets*.

As an example, consider a standard deck of playing cards. There are 52 cards in a deck with four suits (clubs (♣), diamonds (♢), hearts (♡), and spades (♠)) each with one of thirteen ranks (two through ten, jack, queen, king, and ace). We can represent a deck in Python by a collection of tuples.

In [29]:
suits = ['♣', '♢', '♡', '♠']
print(suits)

ranks = ['2', '3', '4', '5', '6', '7', '8' ,'9', '10', 'J', 'Q', 'K', 'A']
deck = [ (r,s) for r in ranks for s in suits ]
deck[:5]

['♣', '♢', '♡', '♠']


[('2', '♣'), ('2', '♢'), ('2', '♡'), ('2', '♠'), ('3', '♣')]

In [30]:
# Put the deck into random order.
import random
random.shuffle(deck)
print(deck[:5])

[('6', '♠'), ('K', '♢'), ('9', '♣'), ('3', '♢'), ('7', '♣')]


An *event* is any subset of a set of possible outcomes. For instance, let $E_{\text{black}}$ is the event of drawing a black card from the deck and let $E_{\text{club}}$ be the event of drawing a club from the deck.

In [31]:
E_black = {card for card in deck if ((card[1]=='♠') or (card[1]=='♣'))}
E_club = {card for card in deck if (card[1]=='♣')}
print(len(E_black)/len(deck), len(E_club)/len(deck))

0.5 0.25


[Back to top](#Assignment-Contents)
<a id="q-club-black"></a>

---

### Question 1: Computing Discrete Probabilities

Your first task is to answer a few questions about probability and a standard deck of cards.

+ What is $p(E_{\text{black}})$, the probability of drawing a single black card (i.e., a card that is either of the club suit or the spades suit) from the deck? Assign your answer to `p_black`.
  + You can simply assign the number or you can compute it empirically using the Python sets `E_black` & `deck`.
+ What is $p(E_{\text{club}})$, the probability of drawing a single club (i.e., a card from the club suit) from the deck? Assign your answer to `p_club`.
  + You can simply assign the number or you can compute it empirically using the Python sets `E_club` & `deck`.
+ **True** or **False**:  the events $E_{\text{black}}$ & $E_{\text{club}}$ are independent. Assign your answer as a Python boolean literal `True` or `False` to `independent_club_black`.

In [32]:
### GRADED
### QUESTION 1:
### Assign values to p_black, p_club, and independent_club_black as described above.
### YOUR SOLUTION HERE:
p_black = len(E_black) / len(deck)
p_club = len(E_club) / len(deck)
independent_club_black = False
### For verifying answer:
print('p_black = {}'.format(p_black))
print('p_club = {}'.format(p_club))
print('independent_club_black = {}'.format(independent_club_black))

p_black = 0.5
p_club = 0.25
independent_club_black = False


In [33]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


---

[Back to top](#Assignment-Contents)
<a id="q-cond-p"></a>

---

### Question 2: Computing Conditional probabilities

+ Suppose I draw a card from the deck and I tell you it is a black card. What is the probability that that the card drawn is also a club? That is, what is the *conditional probability* that $p(E_{\text{club}}\,|\,E_{\text{black}})$?
  + Assign the value of $p(E_{\text{club}}\,|\,E_{\text{black}})$ to `p_club_black` as a standard Python floating-point numeric value.
+ Alternatively, suppose I draw a card from the deck and I tell you it is a club, i.e., a card from the club suit. What is the probability that that the card drawn is also black? That is, what is the *conditional probability* that $p(E_{\text{black}}\,|\,E_{\text{club}})$?
  + Assign the value of $p(E_{\text{black}}\,|\,E_{\text{club}})$ to `p_black_club` as a standard Python floating-point numeric value.

In [34]:
### GRADED
### QUESTION 2:
### Assign numeric values to p_club_black & p_black_club as described above.
### YOUR SOLUTION HERE:
p_club_int_black = len(E_club.intersection(E_black))/len(deck)
p_club_black = p_club_int_black / p_black
p_black_club = p_club_int_black / p_club
### For verifying answer:
print('p_club_black = {}'.format(p_club_black))
print('p_black_club = {}'.format(p_black_club))

p_club_black = 0.5
p_black_club = 1.0


In [35]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

Having reviewed a little about independent and conditional probabilities, remember *Bayes' theorem*:

$$\displaystyle{\boxed{p(A\,|\,B) = \frac{p(B\,|\,A) p(A)}{p(B)}}}$$

+ $p(A\,|\,B)$ is the "*posterior* probability of $A$ given $B$";
+ $p(B\,|\,A)$ is the "*likelihood* of $B$ given $A$";
+ $p(A)$ is the "*prior* probability of $A$"; and
+ $p(B)$ is the "*evidence*" (normalizing factor).

The next questions require you to apply Bayes' theorem to reason about spam messages. Remember, the goal is to identify messages as *spam* (not wanted, undesirable) or *ham* (i.e., the opposite of spam messages).

[Back to top](#Assignment-Contents)
<a id="q-spam"></a>

---

### Question 3: Reasoning about Spam Messages

Assume in the following that you have a training set of 2,000 messages known to be spam and 1,000 messages known to be ham (i.e., not spam). Suppose further that the word "bargain" occurs in 250 of the spam messages and 5 of the ham messages.

+ Assume the empirical prior probability of an incoming message being ham or spam is provided by the respective fractions of ham or spam messages in the training set.
 + Assign the (estimated) prior probability of spam to `prior_spam` (i.e., $p(\text{spam})$).
 + Assign the (estimated) prior probability of ham to `prior_ham`  (i.e., $p(\text{ham})$).
+ Assume the empirical likelihood of the word "bargain" occurring in a message known to be spam (respectively, ham) is given by the counts above.
 + Assign the (estimated) likelihood of "bargain" occurring in an incoming spam message to `likelihood_bargain_spam` (i.e., $p(\text{bargain}\,|\,\text{spam})$).
 + Assign the (estimated) likelihood of "bargain" occurring in an incoming ham message to `likelihood_bargain_ham` (i.e., $p(\text{bargain}\,|\,\text{ham})$).
+ Finally, combine the preceding computations to estimate the *posterior* probability of an incoming message being spam given that it contains the word "bargain" (that is, $p(\text{spam}\,|\,\text{bargain})$).
 + Assign the posterior probability $p(\text{spam}\,|\,\text{bargain})$ to `posterior_spam_bargain`.
+ Assign all the values computed here to Python floating-point values up to three decimal places.

In [36]:
### GRADED
### QUESTION 3:
### Assign floating-point values to prior_spam, prior_ham, likelihood_bargain_spam,
###   likelihood_bargain_ham, and posterior_spam_bargain as described above.
### Provide results accurate to at least 3 decimal places (i.e., an absolute tolerance of 1.0e-3).
### YOUR SOLUTION HERE:
n_spam = 2000
n_ham = 1000
prior_spam = n_spam / (n_spam + n_ham)
prior_ham = n_ham / (n_spam + n_ham)
likelihood_bargain_spam = 250 / n_spam
likelihood_bargain_ham = 5 / n_ham
posterior_spam_bargain = likelihood_bargain_spam * prior_spam / (likelihood_bargain_spam * prior_spam + likelihood_bargain_ham * prior_ham)

### For verifying answer:
print('prior_spam: {:5.3f}'.format(prior_spam))
print('prior_ham:  {:5.3f}'.format(prior_ham))
print('likelihood_bargain_spam: {:5.3f}'.format(likelihood_bargain_spam))
print('likelihood_bargain_ham : {:5.3f}'.format(likelihood_bargain_ham))
print('posterior_spam_bargain: {:5.3f}'.format(posterior_spam_bargain))

prior_spam: 0.667
prior_ham:  0.333
likelihood_bargain_spam: 0.125
likelihood_bargain_ham : 0.005
posterior_spam_bargain: 0.980


In [37]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

## Filtering Spam from SMS Messages

For the next questions, you'll work with a dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), namely the [*SMS Spam Collection*]( https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), a public set of labeled SMS messages that have been collected for mobile phone spam research.

[Back to top](#Assignment-Contents)
<a id="q-preparing"></a>

---

### Question 4: Preparing the SMS Messaging Data

Your task now is to load the SMS messaging data into a Pandas DataFrame.

+ The data is stored in a file whose location is provided for you as `FILE_PATH`. Use the function `pd.read_csv` with the options `sep="\t"` and `header=None`.
+ Assign the resulting `DataFrame` object to the identifier `messages`.
+ Give the DataFrame meaningful column headers by assigning the list `['target', 'msg']` to `messages.columns`.

In [41]:
### GRADED
### QUESTION 4:
### Prepare the dataframe messages as specified above. 
###
# Necessary imports
import numpy as np, pandas as pd
FILE_PATH = 'data/SMSSpamCollection.txt'
### YOUR SOLUTION HERE:
messages = pd.read_csv(FILE_PATH,sep='\t',header=None)
messages.columns = ['target','msg']
### For verifying answer:
display(messages.head())

Unnamed: 0,target,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [42]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

As usual in supervised learning, you want to split the data into training and testing data sets so that the model can be assessed after fitting it to the data. You'll use the `train_test_split` function from the Scikit-Learn module `sklearn.model_selelction` that does this work for you. Notice the use of the keyword argument `stratify` to ensure that the proportion of ham and spam messages match in the training and the testing datasets.

In [43]:
from sklearn.model_selection import train_test_split
messages_train, messages_test = train_test_split(messages, random_state=13, stratify=messages['target'])
print('There are {} training observations & {} testing observations.\n'
        .format(len(messages_train), len(messages_test)))

There are 4179 training observations & 1393 testing observations.



[Back to top](#Assignment-Contents)
<a id="q-priors"></a>

---

### Question 5: Getting priors

Estimate prior probabilities of messages being `ham` and `spam` empirically using the training set.

+ Construct a Pandas Series `priors` with Index values `'ham'` and `'spam'` and corresponding values given by the fraction of `ham` and `spam` messages (respectively) in the training set `messages_train`.
+ HINT: the Pandas Series method `value_counts` is useful here.

In [44]:
### GRADED
### QUESTION 5:
### Assign a Pandas Series to priors as described above.
### YOUR SOLUTION HERE:
priors = messages_train['target'].value_counts(normalize=True)
### For verifying answer:
print('Training priors (%):\n===================\n{}\n'.format(100 * priors))

Training priors (%):
ham     86.599665
spam    13.400335
Name: target, dtype: float64



In [45]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

## Processing SMS messages

To get a sense of how to build the Naive Bayes classifier, you can examine a few of the spam messages in the training dataset.

In [46]:
is_spam = (messages_train['target']=='spam')
display(messages_train.loc[is_spam].head())

Unnamed: 0,target,msg
3862,spam,Free Msg: Ringtone!From: http://tms. widelive....
3421,spam,"As a valued customer, I am pleased to advise y..."
5164,spam,Congrats 2 mobile 3G Videophones R yours. call...
159,spam,Customer service annoncement. You have a New Y...
3571,spam,Customer Loyalty Offer:The NEW Nokia6650 Mobil...


A number of these messages flagged as spam include the word *"customer"*. Notice that the word is sometimes capitalized, but not always. To catch this, you can write a boolean-valued utility function `has_word` that signals the presence of a word as a substring of a message in a case-insensitive way.

In [47]:
def has_word(msg, word):
    '''Returns True if string word is contained within string msg
    Ignores case, i.e., matches on lower or upper case match.'''
    return (word.lower() in msg.lower())

This is *extremely* simplified. That is, in practical filters, the message would not only be converted to lower case. It would be split into individual words, punctuation & stop words would be removed, words would likely be *stemmed* or lemmatized (to ensure, for example, matches on *organize*, *organizes*, *organized*, and *organizing* will all be found) and so on. For this assignment, you will use this very crude form of matching to avoid some of those complications. You'll find that even with this crude model, spam and ham can be flagged reasonably well.

Let's extract one of the messages found above and apply `has_word` to it.

In [48]:
k, word = 159, 'customer'
msg = messages_train.loc[k, 'msg']
# Verifying that has_word works as intended.
print(has_word(msg, word), '\n\n', msg)

True 

 Customer service annoncement. You have a New Years delivery waiting for you. Please call 07046744435 now to arrange delivery


[Back to top](#Assignment-Contents)
<a id="q-likelihoods"></a>

---

### Question 6: Computing likelihoods

Your task now is to encapsulate the logic above into another function that operates on DataFrames. That is, complete the function `get_likelihoods` that computes the *empirical likelihoods* of a word being in a message given all possible categories (`ham` and `spam` in this case).
+ The inputs are `word` (a string) and `df` (a DataFrame). The DataFrame is assumed to have columns `'target'` and `'msg'` (the first of which is assumed to be categorical, the second of which is assumed to contain strings (messages)). This is precisely the form of `messages_train`. 
+ The function returns a Pandas Series whose Index contains the categories of `df['target']` (`ham` and `spam` in this case). The values of the Series are the fractions of rows in each category in which the column `'msg'` contains `word` as a substring (case-insensitive).
+ Do *not* assume Laplace smoothing here, i.e., likelihoods of zero are possible if a given word does not appear in the message DataFrame's `'msg'` column.

In [53]:
### GRADED
### QUESTION 6:
### Complete the function get_likelihoods as described above.
###
def get_likelihoods(word, df):
    '''Computes empirical fractions of rows of *df* that have *word* in 'msg' column.
    INPUT:
      word:        String to match (case-insensitive)
      df:          DataFrame with columns 'target' (categorical) and 'msg' (string objects)
    OUTPUT:
      likelihoods: Series indexed by categories of df['target'] with empirical fraction of
                   rows in which df['msg'] includes word as a substring (case-insensitive).
                   Do *not* employ Laplace smoothing here.
    EXAMPLE:
    >>> l = get_likelihoods('customer', messages_train)
    >>> print(l)
        ham     0.002211
        spam    0.067857
        dtype: float64
    >>> l = get_likelihoods('congrats', messages_train)
    >>> print(l)
        ham     0.001934
        spam    0.016071
        dtype: float64 
    '''
###
### YOUR CODE HERE
###
    filt = lambda m: has_word(m,word)
    fractions = {}
    for category in np.unique(df.target):
        df_new = df.loc[df['target'] == category]
        counts = df_new['msg'].map(filt).sum()
        fractions[category] = counts / len(df_new)
        
    likelihoods = pd.Series(fractions)
    return likelihoods
        

### For verifying answer:
for word in ['customer', 'congrats', 'meeting']:
    print('{}:\n={}'.format(word,'='*len(word)))
    print(get_likelihoods(word, messages_train), '\n')

customer:
ham     0.002211
spam    0.067857
dtype: float64 

congrats:
ham     0.001934
spam    0.016071
dtype: float64 

meeting:
ham     0.008842
spam    0.000000
dtype: float64 



In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

## Building the Naive Bayes model

You may have noticed in Question 6 an unusual result with the input word "meeting":

```
>>> get_likelihoods('meeting', messages_train)
meeting:
========
ham     0.008842
spam    0.000000
dtype: float64 

```
That is, the empirical likelihood of finding the word "meeting" in a spam message is zero. This is an artifact of the particular corpus of training messages; that is, that word does not happen to occur in any of the finite number of spam messages in the training set. It is generally not reasonable to assume that, more generally, the likelihood of the word "meeting" occurring in *any* incoming spam message is zero.

To compensate for this kind of problem, you can use [*Laplace smoothing*](https://en.wikipedia.org/wiki/Laplacian_smoothing). That is, modify the computation of an empirical likelihood as follows:

$$ p(w\,|\,C) \simeq \frac{|C \cap w| + \color{red}{\beta}}{n_{C} + \color{red}{n_{\text{w}}} } $$

In the above, $C$ refers to any available categories in the classification problem (`ham` or `spam` here), $w$ is the sought event (in this case, a word $w$ occurring in a messages in the set $C$), $\beta$ is a *smoothing parameter* (typically 1) and $n_w$ is another parameter (in this case, the number of words in the vocabulary to scan for). This has the effect of augmenting likelihoods away from zero (that are problematic when multiplying likelihoods together as is required for naive Bayes classification).

[Back to top](#Assignment-Contents)
<a id="q-smoothed"></a>

---

### Question 7: Computing smoothed likelihoods

Your task now is to generalize the function `get_likelihoods` from Question 6 to yield a function `get_smoothed_likelihoods`. 
+ The inputs are almost the same as in Question 6. The first input is a list `words` consisting of the vocabulary of words to look for in messages. The second input `df` is a DataFrame with the same requirements as before. There is also an extra input `beta` (whose default value is one) that is used to compute the smoothed likelihoods.
+ The output is a Pandas DataFrame whose Index contains the categories of `df['target']` (`ham` and `spam` in this case) and whose columns are the words in the input `words`. The corresponding values of the DataFrame are the fractions of rows in each category in which the column `'msg'` contains the corresponding word as a substring (case-insensitive).
+ In this case, *apply Laplace smoothing* i.e., likelihoods of zero are not possible even if a given word does not appear in the message DataFrame's `'msg'` column.
+ *NOTE*: Although the construction demonstrated in the lecture video 19-4 is useful for these exercises, be aware that the DataFrame of likelihoods obtained in the video is *transposed* relative to the requirements here. That is, the comparable DataFrame computed in the lecture video has the categories `spam` and `ham` as *column* index labels and the list of words as *row* index labels. Again, this is the *opposite* of what is required here.

In [59]:
### GRADED
### QUESTION 7:
### Complete the function get_smoothed_likelihoods as described above.
###
def get_smoothed_likelihoods(words, df, beta=1):
    '''Computes empirical fractions of rows of *df* that have *word* in 'msg' column.
    INPUT:
      words:       List of strings to match (case-insensitive)
      df:          DataFrame with columns 'target' (categorical) and 'msg' (string objects)
      beta:        (default value 1) Smoothing constant to add to numerator
    OUTPUT:
      likelihoods: DataFrame with categories of df['target'] (row) Index and words as column
                   index. The entries are the empirical fractions of rows in which df['msg'] includes
                   each word as a substring (case-insensitive).
                   Empirical fractions computed using Laplace smoothing here.
    EXAMPLE:
    >>> words = ['customer', 'congrats', 'meeting']
    >>> likelihoods = get_smoothed_likelihoods(words, messages_train)
    >>> print(likelihoods)
              customer  congrats   meeting
        ham   0.002485  0.002209  0.009111
        spam  0.069272  0.017762  0.001776
    >>> likelihoods = get_smoothed_likelihoods(words, messages_train, beta=0)
    >>> print(likelihoods) # without smoothing (mostly)
              customer  congrats   meeting
        ham   0.002209  0.001933  0.008835
        spam  0.067496  0.015986  0.000000
    '''
###
### YOUR CODE HERE
###
    n_words = len(words)
    data = {}
    for word in words:
        filt = lambda m: has_word(m, word)
        fractions = {}
        for category in np.unique(df.target):
            df_new = df.loc[df['target'] == category]
            counts = df_new['msg'].map(filt).sum()
            fractions[category] = (counts + beta) / (len(df_new) + n_words)
        data[word] = pd.Series(fractions)
    likelihoods = pd.DataFrame(data=data)
    return likelihoods
    
### For verifying answer:
words = ['customer', 'congrats', 'meeting']
print(get_smoothed_likelihoods(words, messages_train), '\n')
print(get_smoothed_likelihoods(words, messages_train, beta=0)) # without smoothing (mostly)

      customer  congrats   meeting
ham   0.002485  0.002209  0.009111
spam  0.069272  0.017762  0.001776 

      customer  congrats   meeting
ham   0.002209  0.001933  0.008835
spam  0.067496  0.015986  0.000000


In [60]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

## Putting it all together

Finally, you can now try to see how to put these pieces together to classify the testing data as spam or ham. Unfortunately, given the oversimplifications in the feature extraction used, it's more instructive to consider only a small vocabulary of words and a selected subset of testing messages.

In [61]:
# this is a highly restricted vocabulary of words to help distinguish spam & ham
vocab = ['winner', 'congratulations', 'free', 'contact', 'holiday', 'price',
         'urgent', 'cost', 'credit', 'won', 'new', 'work', 'loan', 'return',
         'insurance', 'bank', 'sale', 'safe', 'red', 'alright', 'place',
         'house', 'buy', 'hello', 'hi', 'got', 'have', 'well', 'vacation',
         'thing', 'cat', 'car', 'kitchen', 'want', 'waiting', 'dog']

In [62]:
# Create a restricted set of test messages to examine
selected = [3420, 624]
test_set = messages_test.loc[selected]
test_set

Unnamed: 0,target,msg
3420,spam,Do you want a new Video phone? 600 anytime any...
624,ham,"sorry, no, have got few things to do. may be i..."


Let's extract the first message, see which words from our vocabulary it contains and then compute the smoothed likelihoods.

In [63]:
idx = 3420
test_msg = test_set.loc[idx, 'msg']
y_true = test_set.loc[idx,'target']
words = [word for word in vocab if has_word(test_msg, word)]
likelihoods = get_smoothed_likelihoods(words, messages_train)

The key assumption in [*naive Bayes classification*](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is that the *likelihoods are independent* (i.e., that the likelihood of each word occurring in a message known to be spam or ham is independent of the other words occurring). This is the "naive" assumption, but it makes the computations much simpler (particularly when the vocabulary is large, say, thousands of words).

Under this assumption, the joint likelihood of the words $w_1$ and $w_2$ occuring in category $C$ can be determined as a product:
 $$ \boxed{p(w_1 \cap w_2 \,|\,C) = p(w_1\,|\,C)\times p(w_2\,|\,C)}. $$
 This generalizes easily to arbitrarily many words/features (and even arbitrarily many classes in a multi-class classification problem): 
  $$ \boxed{p(w_1, w_2, \dotsc, w_d\,|\, C) = \prod_{k=1}^{d}p(w_{k}\,|\,C)}.$$


In the implementation developed so far, this computation can be carried out easily using the Pandas DataFrame method `prod`. Once the joint likelihoods are known, they can be combined with the priors to compute the posterior probabilities as Bayes' theorem tells us. And when the posterior probabilities are known for both the ham and spam classes, the larger of the two can be used to decide how to classify a new message instance.

In [64]:
joint_likelihoods = likelihoods.prod(axis=1)    # Use assumption of independence
evidence = (joint_likelihoods * priors).sum()
posteriors = joint_likelihoods * priors / evidence  # Bayes' theorem
print('Posteriors:\n{}\n'.format(posteriors))
y_pred = posteriors.idxmax()     # Take largest posterior probability to predict class
print('{}\n\ny_true = {}\ny_pred = {}'.format(test_msg, y_true, y_pred))

Posteriors:
ham     0.036228
spam    0.963772
dtype: float64

Do you want a new Video phone? 600 anytime any network mins 400 Inclusive Video calls AND downloads 5 per week Free delTOMORROW call 08002888812 or reply NOW

y_true = spam
y_pred = spam


Let's repeat the preceding computation on the other test message. First, let's extract the words from `vocab` that actually occur in this message.

In [65]:
idx = 624
test_msg = test_set.loc[idx, 'msg']
y_true = test_set.loc[idx,'target']
words = [word for word in vocab if has_word(test_msg, word)]

Next, let's compute the likelihoods and the evidence (remember, you computed the priors empirically from the training data in Question 5).

In [66]:
likelihoods = get_smoothed_likelihoods(words, messages_train)
joint_likelihoods = likelihoods.prod(axis=1)    # Use assumption of independence
evidence = (joint_likelihoods * priors).sum()

We now have the pieces in place to apply Bayes' theorem to compute the posteriors:

In [67]:
posteriors = joint_likelihoods * priors / evidence  # Bayes' theorem
print('Posteriors:\n{}\n'.format(posteriors))

Posteriors:
ham     0.98111
spam    0.01889
dtype: float64



Finally, the posterior probabilities for each class can be used to make a decision about classification.

In [68]:
y_pred = posteriors.idxmax()     # Take largest posterior probability to predict class
print('{}\n\ny_true = {}\ny_pred = {}'.format(test_msg, y_true, y_pred))

sorry, no, have got few things to do. may be in pub later.

y_true = ham
y_pred = ham


For the last question, you will use a larger (carefully chosen) subset of the test data to see if you can use this selection of words to classify ham and spam messages.

In [69]:
selected = [1303, 2124, 4073, 4967, 2686, 1944, 4923, 1895, 2876, 3455, 113, 4729, 2946,
            4801, 4009, 838, 3509, 3675, 3595, 5535, 4918, 4935, 1936, 1491, 3772, 4905,
            3789, 901, 3164, 5566, 5482, 4133, 4543, 2422, 3456, 2849, 4497, 2514]
test_set = messages_test.loc[selected]

[Back to top](#Assignment-Contents)
<a id="q-predict"></a>

---

### Question 8: Predicting spam

Your final task is to complete a function `predict_nb` that implements the computations just applied to the preceding two messages.
+ The mandatory input is `test_msg` (a *single message*, i.e., a Python string).
+ As optional keyword arguments, `predict_nb` accepts `word_list` (with default value `vocab` as provided above) and `data` (with default value `messages_train` as computed earlier).
+ The value returned is a category from `data['target']` (`ham` or `spam` in this case).
+ The function should be ready to map onto a Series as shown below.



In [70]:
### GRADED
### QUESTION 8:
### Complete the function predict_nb as described above
###
def predict_nb(test_msg, word_list=vocab, data=messages_train):
    '''Returns a category from data.target according to how test_msg classifies
    INPUT:
      test_msg:    String (message)
      word_list:   List of strings to match (case-insensitive); default value vocab
      data:        DataFrame with columns 'target' (categorical) and 'msg' (string objects);
                   default value messages_train
    OUTPUT:
      y_pred:      category from data.target
    EXAMPLE:
    >>> msg = messages_test.loc[624, 'msg']
    >>> print(predict_nb(msg))
        ham
    >>> msg = messages_test.loc[3420, 'msg']
    >>> print(predict_nb(msg))
        spam
    '''
###
### YOUR CODE HERE
###
    words = [word for word in word_list if has_word(test_msg,word)]
    likelihoods = get_smoothed_likelihoods(words, data)
    joint_likelihoods = likelihoods.prod(axis=1)
    evidence = (joint_likelihoods*priors).sum()
    posteriors = (joint_likelihoods*priors) / evidence
    y_pred = posteriors.idxmax()
    return y_pred

### For verifying answer:
test_set['pred'] = test_set['msg'].map(predict_nb)
result = test_set[['target', 'pred', 'msg']]
result

Unnamed: 0,target,pred,msg
1303,ham,spam,FRAN I DECIDED 2 GO N E WAY IM COMPLETELY BROK...
2124,spam,spam,+123 Congratulations - in this week's competit...
4073,spam,spam,Loans for any purpose even if you have Bad Cre...
4967,spam,spam,URGENT! We are trying to contact U. Todays dra...
2686,spam,spam,URGENT! We are trying to contact U. Todays dra...
1944,ham,ham,I got lousy sleep. I kept waking up every 2 ho...
4923,ham,ham,Hi Dear Call me its urgnt. I don't know whats ...
1895,spam,spam,"FreeMsg Hey U, i just got 1 of these video/pic..."
2876,ham,spam,"Idk. You keep saying that you're not, but sinc..."
3455,ham,ham,I dont have any of your file in my bag..i was ...


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Assignment-Contents)

As you can see, roughly 40 test messages are predicted with about 79% accuracy:

In [None]:
(result['target'] == result['pred']).sum() / len(result)

Of course, the test examples were selected to maximize overlap with the selected vocabulary. Nevertheless, this very simple implementation can be extended readily with more realistic feature extraction and a broader vocabulary. Scikit-Learn's `naive_bayes` and `feature_extraction` modules provide tools for building more useful models.

[Back to top](#Assignment-Contents)

## References

+ [*Naive Bayes Classifier* ](https://en.wikipedia.org/wiki/Naive_bayes) (Wikipedia)
+ [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) (Scikit-Learn documentation)
+ [*Python Data Science Handbook*](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake Vanderplas
