# Naive Bayes Classifier

### What is a classifier?

A classifier is a machine learning model that is used to discriminate different objects based on certain features.

In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. 

In the statistics and computer science literature, naive Bayes models are known under a variety of names, including simple Bayes and independence Bayes. All these names reference the use of Bayes' theorem in the classifier's decision rule, but naïve Bayes is not (necessarily) a Bayesian method.

## Introduction
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

### Probabilistic Model
Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector $ X = (x_1, . . . , x_n) $ representing some n features (independent variables), it assigns to this instance probabilities 
     $ p(C_k | x_1, . . . , x_n) $
for each of K possible outcomes or classes $ C_k$

The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as

$\qquad p(C_k | X) = \frac{p(C_k)p(x | C_k)}{p(X)} $

In plain English, using Bayesian probability terminology, the above equation can be written as
    
 $\qquad posterior = \frac{prior \: \times \: likelihood}{evidence} $

Above, 
<ul>
    <li> $ p(C_k | X) $ is the posterior probability of class (c, target) given predictor (x, attributes).</li>
    <li>$ p(C_k) $ is the prior probability of class.</li>
    <li>$ p(X | C_k) $ is the likelihood which is the probability of predictor given class.</li>
    <li>$ p(X) $ is the prior probability of predictor.</li>
</ul>

In simple words:
If H is some hypothesis and E is the evidence then:
$p(H|E) = \frac{p(E|H).p(H)}{p(E)}$
- Posterior $ p(H|E) $: How probable is our hypothesis given the observed evidence?
- Likelihood $ p(E|H) $: How probable is the evidence given that our hypothesis is true? 
- Prior $ p(H) $: How probable was our hypothesis before observing the evidence? 

We have the Bayes' theorem as:
$ p(C_k | X) = \frac{p(C_k)p(x | C_k)}{p(X)} $<br><br>
Now substituting the value for $X(X = (x_1, . . . , x_n) $ ) and expanding using the chain rule, we get: <br><br>
$\qquad p(C_k|x_1,...,x_n)=\frac{p(C_k)p(x_1|C_k)p(x_2|C_k)...p(x_n|C_k)}{p(x_1)p(x_2)...p(x_n)}$ <br><br>
In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on $C$ and the values of the features $x_i$ are given, so that the denominator is effectively constant. Thus: <br><br>
$\qquad p(C_k|x_1,...,x_n) \:\alpha \: p(C_k)\prod_{i=1}^np(x_i|C_k)$

### Constructing a classifier from the probability model
The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the ___maximum a posteriori___ or MAP decision rule. The corresponding classifier, a ___Bayes classifier___, is the function that assigns a class label $\hat{y}$ for some $k$ as follows: <br><br>
$\qquad\hat{y}\: = \: \underset{k \in \{1,...,K\}}\argmax p(C_k)\prod_{i=1}^np(x_i|C_k)$

### Types of Naive Bayes Classifier:
#### Multinomial Naive Bayes:
This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.

#### Bernoulli Naive Bayes:
This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

#### Gaussian Naive Bayes:
When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

### Conclusion:
Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems etc. They are fast and easy to implement but their biggest disadvantage is that the requirement of predictors to be independent. In most of the real life cases, the predictors are dependent, this hinders the performance of the classifier.

## Looking with an example

___Problem Statement:___ To predict whether a person will play cricket on a specific combination of Outlook, Temperature, Humidity, and Windy, situations.

|Day |OUTLOOK   |TEMPERATURE|HUMIDITY|WINDY|PLAY CRICKET|
|----|----------|-----------|--------|-----|------------|
|1   |Rainy     |Hot        |High    |False|No          |
|2   |Rainy     |Hot        |High    |True |No          |
|3   |Overcast  |Hot        |High    |False|Yes         |
|4   |Sunny     |Mild       |High    |False|Yes         |
|5   |Sunny     |Cool       |Normal  |False|Yes         |
|6   |Sunny     |Cool       |Normal  |True |No          |
|7   |Overcast  |Cool       |Normal  |True |Yes         |
|8   |Rainy     |Mild       |High    |False|No          |
|9   |Rainy     |Cool       |Normal  |False|Yes         |
|10  |Sunny     |Mild       |Normal  |False|Yes         |
|11  |Rainy     |Mild       |Normal  |True |Yes         |
|12  |Overcast  |Mild       |High    |True |Yes         |
|13  |Overcast  |Hot        |Normal  |False|Yes         |
|14  |Sunny     |Mild       |High    |True |No          |

##### Now let's make a frequency table for each of the attributes

__Outlook:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th colspan="2" rowspan="2" style="border: 1px solid black;
  border-collapse: collapse;">Frequency Table</th>
        <th colspan="2" style="border: 1px solid black;
  border-collapse: collapse;">Play</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th rowspan="3" style="border: 1px solid black;
  border-collapse: collapse;">Outlook</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">Sunny</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Overcast</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">0</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Rainy</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3</td>
    </tr>
</table>

__Temperature:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2" rowspan="2">Frequency Table</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2">Play</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" rowspan="3">Temperature</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">Hot</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Mild</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Cool</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">1</td>
    </tr>
</table>

__Humidity:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2" rowspan="2">Frequency Table</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2">Play</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" rowspan="2">Humidity</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">High</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Normal</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">6</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">1</td>
    </tr>
</table>

__Windy:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2" rowspan="2">Frequency Table</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2">Play</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" rowspan="2">Windy</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">False</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">6</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">True</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3</td>
    </tr>
</table>

Suppose we want to know if the person will play on a day with following values <br>
    $\qquad$Outlook = Sunny, Temperature = Hot, Humidity = Normal, Windy = False

___Now let's make the likelihood table of each attribute___

Total no of yes: 9
Total no of no: 5

__Outlook:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th colspan="2" rowspan="2" style="border: 1px solid black;
  border-collapse: collapse;">Likelihood Table</th>
        <th colspan="2" style="border: 1px solid black;
  border-collapse: collapse;">Play</th>
        <th rowspan="2" style="border: 1px solid black;
  border-collapse: collapse;"></th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th rowspan="3" style="border: 1px solid black;
  border-collapse: collapse;">Outlook</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">Sunny</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">5/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Overcast</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">0/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Rainy</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">5/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2"></th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">9/14</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">5/14</td>
    </tr>
</table>

__Temperature:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2" rowspan="2">Likelihood Table</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2">Play</th>
        <th rowspan="2" style="border: 1px solid black;
  border-collapse: collapse;"></th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" rowspan="3">Temperature</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">Hot</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Mild</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">6/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Cool</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">1/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4/14</td>
    </tr>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2"></th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">9/14</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">5/14</td>
    </tr>
</table>

__Humidity:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2" rowspan="2">Likelihood Table</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2">Play</th>
        <th rowspan="2" style="border: 1px solid black;
  border-collapse: collapse;"></th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" rowspan="2">Humidity</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">High</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">4/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">7/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Normal</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">6//9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">1/5</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">7/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2"></th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">9/14</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">5/14</td>
    </tr>
</table>

__Windy:__
<table style="border: 1px solid black;
  border-collapse: collapse;">
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2" rowspan="2">Likelihood Table</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2">Play</th>
        <th rowspan="2" style="border: 1px solid black;
  border-collapse: collapse;"></th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">Yes</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">No</th>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" rowspan="2">Windy</th>
        <th style="border: 1px solid black;
  border-collapse: collapse;">False</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">6/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">2/5</td>        
        <td style="border: 1px solid black;
  border-collapse: collapse;">8/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;">True</th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3/9</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">3/5</td>        
        <td style="border: 1px solid black;
  border-collapse: collapse;">6/14</td>
    </tr>
    <tr style="border: 1px solid black;
  border-collapse: collapse;">
        <th style="border: 1px solid black;
  border-collapse: collapse;" colspan="2"></th>
        <td style="border: 1px solid black;
  border-collapse: collapse;">9/14</td>
        <td style="border: 1px solid black;
  border-collapse: collapse;">5/14</td>
    </tr>
</table>

#### Evaluating Posterior

Likelihood of "Yes" on that day  <br>
Where $B$ equals:
- Outlook = Sunny
- Temperature = Hot
- Humidity = Normal
- Windy = False<br>

Let $A = Yes$ <br>

$\qquad p(A|B) = p(Yes|Outlook = Sunny, Temperature = Hot, Humidity = Normal, Windy = False)$<br>
$\qquad p(Outlook=Sunny|Yes)\times p(Temperature=Hot|Yes)\times p(Humidity=Normal|Yes)\times p(Windy=False|Yes)\times p(Yes)$<br>
$\qquad = 3/9 \times 2/9 \times 6/9 \times 6/9 \times 9/14$<br>
$\qquad = 0.0212$<br><br>

Likelihood of "No" on that day  <br>
Where $B$ equals:
- Outlook = Sunny
- Temperature = Hot
- Humidity = Normal
- Windy = False<br>

Let $A = No$ <br>

$\qquad p(A|B) = p(No|Outlook = Sunny, Temperature = Hot, Humidity = Normal, Windy = False)$<br>
$\qquad p(Outlook=Sunny|No)\times p(Temperature=Hot|No)\times p(Humidity=Normal|No)\times p(Windy=False|No)\times p(No)$<br>
$\qquad = 2/5 \times 2/5 \times 1/5 \times 2/5 \times 5/14$<br>
$\qquad = 0.0046$<br><br>

Probability of Yes for that day playing: <br>
$\qquad p(Yes) = 0.0212\:/\:(0.0212+0.0046) = 0.82$<br><br>
Probability of No for that day playing: <br>
$\qquad p(No) = 0.0046\:/\:(0.0212+0.0046) = 0.18$<br>

___Our model predicts that there is a 82% chance there will be game on that day.___

# Implementing Naive Bayes Classifier with Python

Going to implement the Naive Bayes Classifier from scratch in python. And then apply the classifier to detect spam emails

In [1]:
# Importing modules, glob and os, in order to find all the .txt email files and initialize variables 
# keeping text data and labels
import glob
import os
emails, labels = [], []

In [2]:
# Loading the spam email files
filepath = 'enron1/spam'
for filename in glob.glob(os.path.join(filepath, '*.txt')):
    with open(filename, 'r', encoding = "ISO-8859-1") as infile:
        emails.append(infile.read())
        labels.append(1)

In [3]:
# Loading the ham email files
filepath = 'enron1/ham'
for filename in glob.glob(os.path.join(filepath, '*.txt')):
    with open(filename, 'r', encoding = "ISO-8859-1") as infile:
        emails.append(infile.read())
        labels.append(0)

In [4]:
len(emails)

5163

In [5]:
len(labels)

5163

## Data prepocessing 

Filtering the names with Names corpus from NLTK module, and Lemmatizing the emails.

In [6]:
# Initialized the instances of Names corpus and a lemmatizer
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
def letters_only(astr):
    return astr.isalpha()
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()

In [7]:
# Text cleaning functiom that filters the doc and lemmatize it
def clean_test(docs):
    cleaned_docs = []
    for doc in docs:
        cleaned_docs.append(
        ' '.join([lemmatizer.lemmatize(word.lower())
                 for word in doc.split()
                 if letters_only(word)
                 and word not in all_names]))
    return cleaned_docs

In [8]:
cleaned_emails = clean_test(emails)

In [9]:
cleaned_emails[0]

'weekend entertainment alpha male plus the only multiple orgasm supplement for men prevent premature become the ultimate sex machine multiple orgasm with no erection loss your easy to use solution is here http hfg biz alpha utopia link below is for that people who dislike adv http hfg biz alpha o html'

## Feature extraction

This leads to removing stop words, and extracting features, which are the term frequencies from the cleaned text data:

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words="english", max_features=500)

Here the max_features parameter is set to _500_, so it only considers the 500 most frequent terms. We can tweak this parameter later on in order to achiee better accuracy.

The vectorizer turns the document matrix into a term document matrix where each row is a term frequency sparse vector for a document and an email

In [11]:
term_docs = cv.fit_transform(cleaned_emails)
print(term_docs[0])

  (0, 250)	1
  (0, 465)	1
  (0, 402)	1
  (0, 197)	2
  (0, 239)	1
  (0, 319)	1
  (0, 196)	1


In [12]:
feature_names = cv.get_feature_names()
feature_names[250]

'loss'

In [13]:
feature_names[196]

'html'

In [14]:
feature_mapping = cv.vocabulary_

With the feature matrix term_docs just generated, we can now build and train or naive Bayes model.

Starting with the prior, we first group the data by label:

In [15]:
# Group the data by label
def get_label_index(labels):
    from collections import defaultdict
    label_index = defaultdict(list)
    for index, label in enumerate(labels):
        label_index[label].append(index)
    return label_index
label_index = get_label_index(labels)

The label_index looks like $\{ 1 : [ 0, 1, 2, 3,...,], 0 :[ 1500, 1501, 1502, 1503,...,]\}$ where training sample indices are grouped by class. With this, we calculate the prior:

In [16]:
# prior
def get_prior(label_index):
    """ Compute prior based on training samples
    Args:
        label_index (grouped sample indices by class)
    Returns:
        dictionary, with class label as key, corresponding
        prior as the value
    """
    prior = {label: len(index) for label, index in label_index.items()}
    total_count = sum(prior.values())
    for label in prior:
        prior[label] /= float(total_count)
    return prior

In [17]:
prior = get_prior(label_index)

In [18]:
prior

{1: 0.2905287623474724, 0: 0.7094712376525276}

In [19]:
# likelihood
import numpy as np
def get_likelihood(term_document_matrix, label_index, smoothing=0):
    """ Compute likelihood based on training examples
    Args:
        term_document_matrix (sparse matrix)
        label_index (grouped sample indices by class)
        smoothing (integer, additive Laplace smoothing)
    Returns:
        dictionary, with class label as key, corresponding
        conditional probability P(feature|class) vector as
        value
    """
    likelihood = {}
    for label, index in label_index.items():
        likelihood[label] = term_document_matrix[index, :].sum(axis=0) + smoothing
        likelihood[label] = np.asarray(likelihood[label])[0]
        total_count = likelihood[label].sum()
        likelihood[label] = likelihood[label] / float(total_count)
    return likelihood

The smoothing parameter is set to 1 here, which can also be 0 for no smoothing and any other positive value, as long as high classification performance is achieved:

In [20]:
smoothing = 1
likelihood = get_likelihood(term_docs, label_index, smoothing)
len(likelihood[0])

500

likelihood[0] is the conditional probability $P(feature | legitimate)$ vector of length 500 (500 features) for legitimate classes. For example, the following are the probabilities for the first five features:

In [21]:
likelihood[0][:5]

array([0.00108883, 0.0009604 , 0.00088223, 0.00084314, 0.00010051])

Similarly,here are the first five elements of the conditional probability $P(feature|spam)$ vector:

In [22]:
likelihood[1][:5]

array([0.00108953, 0.00141844, 0.00456368, 0.00053448, 0.0042142 ])

We can check the corresponding terms

In [23]:
feature_names[:5]

['able', 'access', 'account', 'accounting', 'act']

With prior and likelihood ready, we can now compute the posterior. There may occur overflow error while calculationg the multiplication of hundreds of thousands of small value conditional probabilities. So, instead of multiplying we calculate the summation of their natural logarithms and then convert it back to its natural exponential value:

In [24]:
def get_posterior(term_document_matrix, prior, likelihood):
    """ Compute posterior if testing samples, based on prior 
        and likelihood
    Args:
        term_document_matrix (sparse matrix)
        prior (dictionary, with class label as key, 
        corresponding prior as the value)
        likelihood (dictionary, with class label as key, 
        correspondig conditional probability vector as value)
    Returns:
        dictionary, with class label as key, corresponding 
        posterior as value
    """
    num_docs = term_document_matrix.shape[0]
    posteriors = []
    for i in range(num_docs):
        # posterior is proportional to prior * likelihood
        # = exp(log(prior * likelihood))
        # = exp(log(prior) + log(likelihood))
        posterior = {key: np.log(prior_label)
                    for key, prior_label in prior.items()}
        for label, likelihood_label in likelihood.items():
            term_document_vector = term_document_matrix.getrow(i)
            counts = term_document_vector.data
            indices = term_document_vector.indices
            for count, index in zip(counts, indices):
                posterior[label] += np.log(likelihood_label[index]) * count
        min_log_posterior = min(posterior.values())
        for label in posterior:
            try:
                # if log value is excessively large, assign infinity
                posterior[label] = np.exp(posterior[label] - min_log_posterior, dtype=np.float128)
            except:
                posterior[label] = float('inf')
        # normalizing so that all sums up to 1
        sum_posterior = sum(posterior.values())
        for label in posterior:
            if posterior[label] == float('inf'):
                posterior[label] = 1.0
            else:
                posterior[label] /= sum_posterior
        posteriors.append(posterior.copy())
    return posteriors

The prediction function is finished. Let's verify our algorithm.

In [25]:
emails_test = ['''Subject: make $ 171
hello ,
we sent you an email a while ago , because you now qualify for a new mortgage .
you could get $ 300 , 000 for as little as $ 700 a month !
bad credit is no problem , you can pull cash out or refinance .
please click on this link for free consultation by a mortgage broker :
http : / / www . hgkkdc . com /
best regards ,
jamie higgs
no thanks : http : / / www . hgkkdc . com / rl
- - - - system information - - - -
lesser describe be problems available market actually non - java
services think collating cpp doesn ' t known c : i - 025 :
zh parameter implies in preferences behaviors used does''']

In [26]:
cleaned_test = []
cleaned_test = clean_test(emails_test)
term_docs_test = cv.transform(cleaned_test)
posterior = get_posterior(term_docs_test, prior, likelihood)
print(posterior)

[{1: 0.99999999999502214203, 0: 4.977857950431403897e-12}]


The algorithm correctly predicts the email as 99% spam.

Let's split our data to training samples and test samples.

We use the train_test_split function from scikit-learn to do the random splitting and to preserve the percentage samples for each class

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(cleaned_emails, labels, test_size=0.33, random_state=42)

Assigning a fixed random_state(here, 42) during experiments guarantees that the same training and testing sets are generated every time the program runs

In [28]:
len(X_train), len(Y_train)

(3459, 3459)

In [29]:
len(X_test), len(Y_test)

(1704, 1704)

In [30]:
term_docs_train = cv.fit_transform(X_train)
label_index = get_label_index(Y_train)
prior = get_prior(label_index)
likelihood = get_likelihood(term_docs_train, label_index, smoothing)

Predicting the posterior of the testing/new dataset:

In [31]:
term_docs_test = cv.transform(X_test)
posterior = get_posterior(term_docs_test, prior, likelihood)

In [32]:
term_docs_test

<1704x500 sparse matrix of type '<class 'numpy.int64'>'
	with 36711 stored elements in Compressed Sparse Row format>

Evaluating the model's performance via the proportion of correct prediction:

In [33]:
correct = 0.0
for pred, actual in zip(posterior, Y_test):
    if actual == 1:
        if pred[1] >= 0.5:
            correct += 1
    elif pred[0] > 0.5:
            correct += 1
print('The accuracy on {0} testing samples is: {1:.1f}%'.format(len(Y_test), correct/len(Y_test)*100))

The accuracy on 1704 testing samples is: 91.1%


The naive Bayes classifier we just developed from scratch correctly classifies 91% of emails!

# Using MultinomialNB

The scikit-learn API MultinomialNB class which can be used for naive Bayes classification

In [34]:
from sklearn.naive_bayes import MultinomialNB

We initialized a model with smoothing favtor(alpha in scikit-learn) and prior(specified as fit_prior in scikit-learn)

In [35]:
clf = MultinomialNB(alpha=1.0, fit_prior=True)

Training the classifier with the fit method:

In [36]:
clf.fit(term_docs_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Obtaining prediction results with the predict_proba method:

In [37]:
prediction_prob = clf.predict_proba(term_docs_test)
prediction_prob[0:10]

array([[1.00000000e+000, 3.36246989e-257],
       [9.97892386e-001, 2.10761350e-003],
       [1.00000000e+000, 2.63274791e-039],
       [5.54127654e-021, 1.00000000e+000],
       [9.99999969e-001, 3.12245125e-008],
       [1.00000000e+000, 1.87687745e-015],
       [1.92912323e-019, 1.00000000e+000],
       [2.24108715e-002, 9.77589128e-001],
       [1.00000000e+000, 7.35124116e-019],
       [9.99999981e-001, 1.93413647e-008]])

Directly acquiring the predicted class values with the predict method:

In [38]:
prediction = clf.predict(term_docs_test)
prediction[:10]

array([0, 0, 0, 1, 0, 0, 1, 1, 0, 0])

Measuring the accuracy by calling the score method

In [39]:
accuracy = clf.score(term_docs_test, Y_test)
print('The accuracy using MultinomialNB is: {0:.1f}%'.format(accuracy*100))

The accuracy using MultinomialNB is: 91.1%


## Performance evaluation

We have evaluated the performance of the classifier by prediction accuracy. Beyond accuracy, there are several measurements that give us more insights and avoid class imbalance effects.

__Confusion matrix__ summarizes testing instances by their predicted values and true values, presented as a contingency table:

<table>
  <tr>
    <th></th>
    <th></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td>TN</td>
    <td>FP</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td>FN</td>
    <td>TP</td>
  </tr>
</table>

TN = True Negative
FP = False Positive
FN = False Negative
TP = True Positive

Using the scikit-learn confusion_matrix function we compute the confusion matrix of our naive Bayes classifier

In [40]:
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test, prediction, labels=[0,1])

array([[1085,  103],
       [  48,  468]])

From the confusion matrix, it is clear that there are 103 false positive case (where it misinterprets a legitimate email as a spam one), and 48 false negative cases (where it fails to detect a spam email)

The classification accuracy is just the proportion of all true cases:

$ \frac{TN+TP}{TN+TP+FP+FN} = \frac{1085+468}{1704} = 91.1\%$

Precision measures the fraction of positive calls that are correct, that is 

$\frac{TP}{TP+FP} = \frac{468}{468+103} = 0.819$

Recall measures the fraction of true positives that are correctly identified

$ \frac{TP}{TP+FN}= \frac{468}{468+48} = 0.906 $ 

Recall is also called true positive rate.

The F1 score comprehensively includes both the precision and the recall, and equates to their harmonic mean: $F1 = 2 * \frac{precision*recall}{precision+recall}$

Let's compute these three measurements

In [41]:
from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(Y_test, prediction, pos_label=1)

0.819614711033275

In [42]:
recall_score(Y_test, prediction, pos_label=1)

0.9069767441860465

In [43]:
f1_score(Y_test, prediction, pos_label=1)

0.861085556577737

To obtain the precision, recall, and f1 score for each class, the quickest way is to call the classification_report function:

In [45]:
from sklearn.metrics import classification_report
report = classification_report(Y_test, prediction)
print(report)

              precision    recall  f1-score   support

           0       0.96      0.91      0.93      1188
           1       0.82      0.91      0.86       516

    accuracy                           0.91      1704
   macro avg       0.89      0.91      0.90      1704
weighted avg       0.92      0.91      0.91      1704

