# Naive Bayes

![thomas_bayes.png](attachment:thomas_bayes.png)

## Bayes Theorem Recap

<p style="line-height:1.75;font-size:16px">
Earlier in the course we talked about the Bayes theorem and how to use it when calculating conditional probability. Let's remind ourselves how it works:

![bayes_theorem_explanation.jpeg](attachment:bayes_theorem_explanation.jpeg)

<p style="line-height:1.75;font-size:16px">
So, for example:
<center>
<h3>
$P(Fire\mid Smoke)=\frac{P(Smoke\mid Fire)\cdot P(Fire)}{P(Smoke)}$
</h3>
</center>
<p style="line-height:1.75;font-size:16px">
Which can also be written as:
<center>
<h3>
$P(Fire\mid Smoke)=\frac{P(Smoke\mid Fire)\cdot P(Fire)}{P(Smoke\mid Fire)\cdot P(Fire) + P(Smoke\mid No~Fire)\cdot P(No~Fire)}$
</h3>
</center>

<div style="line-height:1.75;background:#3464a2;padding-left:20px;padding-top:5px;padding-bottom:5px;border-radius:5px 5px 0px 0px">
<i class="fa fa-question" style="font-size:40px;color:#e6f1ff;"></i>
</div>
<div>
<p style="line-height:1.75;font-size:16px;background:#e6f1ff;padding:20px;border-radius:0px 0px 5px 5px">
You are on a game show, hosted by Monty Hall, being asked to choose between three doors. Behind each door, there is either a car or a goat. You choose a door. The host, Monty Hall, picks one of the other doors, which he knows has a goat behind it, and opens it, showing you the goat. You know, by the rules of the game, that Monty will always reveal a goat. Monty then asks whether you would like to switch your choice of door to the other remaining door. Assuming you prefer having a car more than having a goat, do you choose to switch or not to switch?</p></div>

![monty_hall.png](attachment:monty_hall.png)

<p style="line-height:1.75;font-size:16px">
Let's use Bayes' theorem to try and solve this problem. Assume that we chose door 1 and Monty opened door 3. There are two options now:<br>
1\. The car is behind door number 1 and we shouldn't switch.<br>
2\. The car is behind door number 2 and we should switch.<br>
<p style="line-height:1.75;font-size:16px">
In terms of Bayes' theorem, we have two posteriors we want to calculate:<br>
1\. $P(door=1\mid opened=3)$<br>
2\. $P(door=2\mid opened=3)$<br>
<p style="line-height:1.75;font-size:16px">
<b>Prior: P(A)</b> <br>
The probability of any door being correct before we pick a door is 1/3. Prizes are randomly arranged behind doors and we have no other information. So the prior, $P(A)$, of any door being correct is 1/3:<br>
1\. $P(door=1)=\frac{1}{3}$<br>
2\. $P(door=2)=\frac{1}{3}$<br>

<p style="line-height:1.75;font-size:16px">
<b>Likelihood: P(B|A)</b> <br>
If the car is actually behind door 1, then Monty can open door 2 or 3. So the probability of opening either is 50%. If the car is actually behind door 2 then monty can only open door 3. He cannot open 1, the door we picked and he cannot open door 2 because it has the car behind it.<br>
1\. $P(opened=3\mid door=1)=\frac{1}{2}$<br>
2\. $P(opened=3\mid door=2)=1$<br>

<p style="line-height:1.75;font-size:16px">
<b>P(B)</b> <br>
We saw earlier that can write $P(B)$ as:
<center>
<h3>
$P(B)=P(B\mid A)\cdot P(A) + P(B\mid Not~A)\cdot P(Not~A)=\frac{1}{2}\cdot {1/3} + 1\cdot {1/3}=\frac{1}{2}$
</h3>
</center>
<p style="line-height:1.75;font-size:16px">
Putting it into our terms:
<center>
<h3>
$P(opened=3)=P(opened=3\mid door=1)\cdot P(door=1) + P(opened=3\mid door=2)\cdot P(door=2)=\frac{1}{2}\cdot {1/3} + 1\cdot {1/3}=\frac{1}{2}$
</h3>
</center>

<p style="line-height:1.75;font-size:16px">
<b>Posterior: P(A|B)</b> <br>
All that's left now is to plug in all of the values:<br>
1\. $P(door=1\mid opened=3)=\frac{P(opened=3|door=1)P(door=1)}{P(opened=3)}=\frac{\frac{1}{2}\cdot \frac{1}{3}}{\frac{1}{2}}=\frac{1}{3}$<br>
2\. $P(door=2\mid opened=3)=\frac{P(opened=3|door=2)P(door=2)}{P(opened=3)}=\frac{1\cdot \frac{1}{3}}{\frac{1}{2}}=\frac{2}{3}$
<p style="line-height:1.75;font-size:16px">
Switching doors doubles are chances of winning!

## The Model

<p style="line-height:1.75;font-size:16px">
Naive bayes is a supervised learning algorithm for classification so the task is to find the class of observation (data point) given the values of features. Naive bayes classifier calculates the probability of a class given a set of feature values:
<center>
<h3>
$p(y_i\mid x_1, x_2, ...,~x_n)=\frac{p(x_1,~x_2,~...,~x_n\mid y_i)\cdot p(y_i)}{p(x_1,~x_2,~...,~x_n)}$
</h3>
</center>
<p style="line-height:1.75;font-size:16px">
It is infeasible to calculate $p(x_1,~x_2,~...,~x_n\mid y_i)$ as we would need huge dataset in order to estimate the probability distribution of all the possible combinations of features. That's why we assume feature independence which enables us to use:
<center>
<h3>
$p(x_1,~x_2,~...,~x_n\mid y_i)=p(x_1\mid y_i)\cdot p(x_2\mid y_i)\cdot\cdot\cdot p(x_n\mid y_i)$
</h3>
</center>

## Naive Bayes Classification Example

<p style="line-height:1.75;font-size:16px">
You're starting your own software company BayesBook which will also provide email services. Since you don't want all of your users to suffer from spam, you decide to create a spam filtering mechanism. You have a dataset of 12 emails, 4 of them marked as 'spam' and 8 marked as 'normal'. You decide to count the different words in them to get a better sense of how you could filter spam.

<p style="line-height:1.75;font-size:16px">
<table>
<tr>
    <th>
        Word
    </th>
    <th>
        Count Normal
    </th>
    <th>
        Count Spam
    </th>
</tr>
<tr>
    <th>
        Dear
    </th>
    <th>
        8
    </th>
    <th>
        2
    </th>
</tr>
<tr>
    <th>
        Friend
    </th>
    <th>
        5
    </th>
    <th>
        1
    </th>
</tr>
<tr>
    <th>
        Lunch
    </th>
    <th>
        3
    </th>
    <th>
        0
    </th>
</tr>
<tr>
    <th>
        Money
    </th>
    <th>
        1
    </th>
    <th>
        4
    </th>
</tr>
<table>

<p style="line-height:1.75;font-size:16px">
The next step is to turn these counts into probabilities. Since we know that we want to distinguish between normal messages and spam messages, we'll calculate the probabilities of words <b>given</b> the class. For instance:
<center>
<h3>
$P(Dear\mid Normal)=\frac{8}{17}=0.47$
</h3>
</center>
<center>
<h3>
$P(Dear\mid Spam)=\frac{2}{7}=0.29$
</h3>
</center>
<p style="line-height:1.75;font-size:16px">
After calculating all of the values you get:

<p style="line-height:1.75;font-size:16px">
<table>
<tr>
    <th>
        Word
    </th>
    <th>
        Probability Normal
    </th>
    <th>
        Probability Spam
    </th>
</tr>
<tr>
    <th>
        Dear
    </th>
    <th>
        0.47
    </th>
    <th>
        0.29
    </th>
</tr>
<tr>
    <th>
        Friend
    </th>
    <th>
        0.29
    </th>
    <th>
        0.14
    </th>
</tr>
<tr>
    <th>
        Lunch
    </th>
    <th>
        0.18
    </th>
    <th>
        0
    </th>
</tr>
<tr>
    <th>
        Money
    </th>
    <th>
        0.06
    </th>
    <th>
        0.57
    </th>
</tr>
<table>

<p style="line-height:1.75;font-size:16px">
What we've calcualted here is the likelihood part ($P(B\mid A))$ of Bayes' theorem. We'll soon see how this ties in to the problem we're trying to solve. For now, let's assume you got an email with the words 'Dear Friend' and you want to say if its a normal message or spam. We'll start by estimating our prior probability $P(Normal)$. This can be done by looking at all of our training data and calculating the proportion on normal messages:
<center>
<h3>
$P(Normal)=\frac{8}{8+4}=0.67$
</h3>
</center>
<p style="line-height:1.75;font-size:16px">
Next, we'll multiply this probability by the probability that the word 'Dear' appears in the message (given that is normal) and by the probability the the word 'Friend' appears in the message (again, given that it's normal):
<center>
<h3>
$P(Dear~Friend\mid Normal)=P(Normal)\cdot P(Dear\mid Normal)\cdot P(Friend\mid Normal)=0.67\cdot 0.47\cdot 0.29=0.09$
</h3>
</center>
<p style="line-height:1.75;font-size:16px">
For now, assume that $P(Dear~Friend\mid Normal)$ is proportional to $P(Normal\mid Dear~Friend)$. This means that the higher the score of $P(Dear~Friend\mid Normal)$ the more likely it is to be a normal message.<br>
<p style="line-height:1.75;font-size:16px">
Now, let's repeat this process only this time we'll look at this message as if it's a spam message:
<center>
<h3>
$P(Spam)=\frac{4}{8+4}=0.33$
</h3>
</center>
<center>
<h3>
$P(Dear~Friend\mid Spam)=P(Spam)\cdot P(Dear\mid Spam)\cdot P(Friend\mid Spam)=0.33\cdot 0.29\cdot 0.14=0.01$
</h3>
</center>
<p style="line-height:1.75;font-size:16px">
And again we'll assume that $P(Dear~Friend\mid Spam)$ is proportional to $P(Spam\mid Dear~Friend)$.<br>
<p style="line-height:1.75;font-size:16px">
Since $P(Dear~Friend\mid Spam) \lt P(Dear~Friend\mid Normal)$ we'll classify this message as a normal message.


<div style="line-height:1.75;background:#1e7b1e;padding-left:20px;padding-top:5px;padding-bottom:5px;border-radius:5px 5px 0px 0px"><i class="fa fa-pencil" style="font-size:40px;color:#c1f0c1;"></i>
</div>
<div>
<p style="line-height:1.75;font-size:16px;background:#c1f0c1;padding:20px;border-radius:0px 0px 5px 5px">
BayesBook is doing great and you are nearing a unicorn valuation. However, you are getting many complaints about your spam filter. You decide to check things for yourself so you extract 3 random emails from your database:<br>
1\. Lunch Dear Friend<br>
2\. Friend Money Money Money Money<br>
3\. Lunch Money Money Money Money<br>
Perform the same steps we took earlier for each of these messages. Before starting, take a minute to think which of these should be classified as spam an which should not. Do the results match your thoughts?</p>
</div>

<h3>1</h3>
<center>
<h3>
$P(Lunch~Dear~Friend\mid Spam)=P(Spam)\cdot P(Lunch\mid Spam)\cdot P(Dear\mid Spam)\cdot P(Friend\mid Spam)=0.33\cdot 0 \cdot 0.29\cdot 0.14=0$
</h3>
</center>
<center>
<h3>
$P(Lunch~Dear~Friend\mid Normal)=P(Normal)\cdot P(Lunch\mid Normal)\cdot P(Dear\mid Normal)\cdot P(Friend\mid Normal)=0.67\cdot 0.18 \cdot 0.47\cdot 0.29=0.0164$
</h3>
</center>
<h3>2</h3>
<center>
<h3>
$P(Friend~Money~Money~Money~Money\mid Spam)=P(Spam)\cdot P(Friend\mid Spam)\cdot P(Money\mid Spam)\cdot P(Money\mid Spam)\cdot P(Money\mid Spam)\cdot P(Money\mid Spam)=0.33\cdot 0.14\cdot 0.57 \cdot 0.57\cdot 0.57\cdot 0.57=0.0048$
</h3>
</center>
<center>
<h3>
$P(Friend~Money~Money~Money~Money\mid Normal)=P(Normal)\cdot P(Friend\mid Normal)\cdot P(Money\mid Normal)\cdot P(Money\mid Normal)\cdot P(Money\mid Normal)\cdot P(Money\mid Normal)=0.67\cdot 0.29\cdot 0.06 \cdot 0.06\cdot 0.06\cdot 0.06= 2.5e^{-06}$
</h3>
</center>
<h3>3</h3>
<center>
<h3>
$P(Lunch~Money~Money~Money~Money\mid Spam)=P(Spam)\cdot P(Friend\mid Spam)\cdot P(Money\mid Spam)\cdot P(Money\mid Spam)\cdot P(Money\mid Spam)\cdot P(Money\mid Spam)=0.33\cdot 0\cdot 0.57 \cdot 0.57\cdot 0.57\cdot 0.57=0$
</h3>
</center>
<center>
<h3>
$P(Lunch~Money~Money~Money~Money\mid Normal)=P(Normal)\cdot P(Lunch\mid Normal)\cdot P(Money\mid Normal)\cdot P(Money\mid Normal)\cdot P(Money\mid Normal)\cdot P(Money\mid Normal)=0.67\cdot 0.18\cdot 0.06 \cdot 0.06\cdot 0.06\cdot 0.06= 1.56e^{-06}$
</h3>
</center>

<p style="line-height:1.75;font-size:16px">
In order to avoid cases where a certain word does not appear in our table, we can simply add a count of 1 to each of our words (this parameter is sometimes referred to as $\alpha$). So now, our new table will look like this:


<p style="line-height:1.75;font-size:16px">
<table>
<tr>
    <th>
        Word
    </th>
    <th>
        Count Normal
    </th>
    <th>
        Count Spam
    </th>
</tr>
<tr>
    <th>
        Dear
    </th>
    <th>
        9
    </th>
    <th>
        3
    </th>
</tr>
<tr>
    <th>
        Friend
    </th>
    <th>
        6
    </th>
    <th>
        2
    </th>
</tr>
<tr>
    <th>
        Lunch
    </th>
    <th>
        4
    </th>
    <th>
        1
    </th>
</tr>
<tr>
    <th>
        Money
    </th>
    <th>
        2
    </th>
    <th>
        5
    </th>
</tr>
<table>

## Gaussian Naive Bayes

<p style="line-height:1.75;font-size:16px">
Until this point we talked about data that was discrete so we used a variant of naive Bayes that is called <b>Multinomial Naive Bayes</b>. However, when dealing with continuous variables we need to use a different variant which is called <b>Gaussian Naive Bayes</b>. It works very similarly to its multinomial version only it uses gaussian distributions for each of the features when calculating their probabilities.

![gaussian_nb.png](attachment:gaussian_nb.png)

## Why Is It Naive?

<div style="line-height:1.75;background:#3464a2;padding-left:20px;padding-top:5px;padding-bottom:5px;border-radius:5px 5px 0px 0px">
<i class="fa fa-question" style="font-size:40px;color:#e6f1ff;"></i>
</div>
<div>
<p style="line-height:1.75;font-size:16px;background:#e6f1ff;padding:20px;border-radius:0px 0px 5px 5px">
After solving the spam filter issues on BayesBook you are now confronted with a new issue - misinformation. You already have a misinformation filter that was built similarly to the the spam filter but its dataset was misinformation articles vs. real information articles. What do you think is the problem?
</p></div>

<p style="line-height:1.75;font-size:16px">
Consider the following sentence: 'Earth is round not flat'. You run it through same process we did earlier, assuming that it's real information, and get a score of 0.3. Then you run the sentence 'Earth is flat not round' through the same process and get the same score. The naive Bayes classifier is naive because it doesn't take word order into account. In a more general context, as previously stated, naive Bayes assumes that features in our samples are independent from one another.

## Pros & Cons

<p style="line-height:1.75;font-size:16px">
<b>Pros</b>
- <p style="line-height:1.75;font-size:16px">
When assumption of independence holds, it performs better compared to other models like logistic regression and requires less training data.
- <p style="line-height:1.75;font-size:16px">
The assumption that all features are independent makes naive bayes algorithm very fast compared to complicated algorithms. 
- <p style="line-height:1.75;font-size:16px">
Works well with high-dimensional data such as text classification.

<p style="line-height:1.75;font-size:16px">
<b>Cons</b>
- <p style="line-height:1.75;font-size:16px">
The assumption that all features are independent is not usually the case in real life so it makes naive bayes algorithm less accurate than complicated algorithms.

## Naive Bayes with SKLearn

<p style="line-height:1.75;font-size:16px">
The naive bayes classifier(s) can be imported from sklearn using:<br>
`from sklearn.naive_bayes import MultinomialNB, GaussianNB`

<div style="line-height:1.75;background:#1e7b1e;padding-left:20px;padding-top:5px;padding-bottom:5px;border-radius:5px 5px 0px 0px"><i class="fa fa-pencil" style="font-size:40px;color:#c1f0c1;"></i>
</div>
<div>
<p style="line-height:1.75;font-size:16px;background:#c1f0c1;padding:20px;border-radius:0px 0px 5px 5px">
Using a naive Bayes classifier, try to detect if a woman has breast cancer or not given a set of features from sklearn's breast cancer dataset (sample code of how to load the data is provided below). Compare the results of this classifier to a logistic regression classsifier.</p>
</div>

In [37]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

breast_cancer = load_breast_cancer()
df = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
df['target'] = pd.Series(breast_cancer.target)
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [38]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train[X.columns] = scaler.transform(X_train)
X_test[X.columns] = scaler.transform(X_test)

nb = GaussianNB()
nb.fit(X_train, y_train)
print(f'Naive Bayes: {nb.score(X_test, y_test)}')

lr = LogisticRegression()
lr.fit(X_train, y_train)
print(f'Logistic Regression: {lr.score(X_test, y_test)}')

Naive Bayes: 0.9210526315789473
Logistic Regression: 0.9736842105263158


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.loc._setitem_with_indexer((slice(None), indexer), value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_array(key, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us