
# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Naive Bayes classifier
Week 8 | 2.1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Describe Naive Bayes
- Choose a Naive Bayes implementation based on your use case
- Implement a Naive Bayes model through scikit-learn

### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- Work with methods in scikit-learn
- Conceptually explain the Bayesian posterior distribution

### LESSON GUIDE
| Timing | Type | Topic |
| --- | --- | --- |
| 5 min | [Opening](#opening) | Bayes' theorem and Naive Bayes |
| 25 min | [Introduction](#introduction) | The basics of Naive Bayes |
| 25 min | [Guided Practice](#Guided)  | Using the Naive Bayes Implementation in Scikit-learn |
| 25 min | [Independent Practice](#Indy) | Apply your Naive Bayes on the data |
| 5 min |  [Conclusion](#conclusion)| Concluding Remarks |

---


### Bayes' thereom, again:


### $$P\left(\;A\;|\;B\;\right) = \frac{P\left(\;B\;|\;A\;\right)P\left(\;A\;\right)}{P(\;B\;)}$$


### $$P\left(\;model\;|\;data\;\right) = \frac{P\left(\;data\;|\;model\;\right)P\left(\;model\;\right)}{P(\;data\;)} $$


### Applying Bayes in supervised machine learning

> Check: How would you apply this in a machine learning context?



We can use this for classification problems.\* Its canonical use case is spam classification (or text classification generally).




<sub><sup>\*Or regression. But it doesn't work well.</sub></sup>

### What would our formula look like?

Let's say we're trying to predict 419 scam emails. M = 'million', S = 'is spam'.

#### $$P\left(\;S\;|\;M\;\right) = \frac{P\left(\;M\;|\;S\;\right)P\left(\;S\;\right)}{P(\;M\;)} = \frac{P\left(\;M\;|\;S\;\right)P\left(\;S\;\right)}{P(\;M\;|\;S)P(\;S\;) + P(\;M\;|\;\neg{S})P(\;\neg{S}\;)}$$



We can make some simplifying assumptions. Let's start by assuming an equal chance of spam / not spam. So:

### $$ P\left(\;S\;|\;M\;\right) =
\frac{P\left(\;M\;|\;S\;\right)}
{P(\;M\;|\;S) + P(\;M\;|\;\neg{S})}$$

$\neg{S}$ is "not spam"

But we'll use more than one feature. Really, we want to see some feature vector $X_1, X_2, ..., X_n$:

### $$P\left(\;S\;|\;X_1, X_2, ..., X_n\;\right) = \frac{P\left(\;X_1, X_2, ..., X_n\;|\;S\;\right)}{P(\;X_1, X_2, ..., X_n\;|\;S) + P(\;X_1, X_2, ..., X_n\;|\;\neg{S})}$$

Since these features can take on different values in each observation, our calculation is really:

### $$P\left(\;S\;|\;X_{1=x1}, X_{2=x2}, ..., X_{n=xn}\;\right) = \frac{P\left(\;X_{1=x1}, X_{2=x2}, ..., X_{n=xn}\;|\;S\;\right)}{P(\;X_{1=x1}, X_{2=x2}, ..., X_{n=xn}\;|\;S) + P(\;X_{1=x1}, X_{2=x2}, ..., X_{n=xn}\;|\;\neg{S})}$$


With a lot of features, calculating their joint probabilities could get hairy.

### Simplify again, naively

Joint probabilities are NBD if we *assume independence*: 
$P\left(\;X_{1=x1}, X_{2=x2}, ..., X_{n=xn}\;|\;S\;\right) = P\left(\;X_{1=x1} |\;S\;\right) * P\left(\;X_{2=x2} |\;S\;\right) ... P\left(\;X_{n=xn} |\;S\;\right)$

$$P\left(\;S\;|\;X_{1=x1}, X_{2=x2}, ..., X_{n=xn}\;\right) = \prod_{i=1}^{n}P(X_i = x_i | \;S\;) / C$$

Where C is some constant for our marginal probability of those data.

### This gives a handy decision function (generalizable to k classes)

![](./assets/images/nb_decision_rule.png)

### Using our Naive Bayes model

How do we code this and instantiate models?



How would you?

> Check: With a partner, jot down (pseudo)code for a Naive Bayes classifier. What are the inputs and outputs? How did you calculate probabilities? What implementation wrinkles do you notice?

### Moving toward a production implementation

Possible issues to contend with:




- [Underflow](http://stackoverflow.com/questions/3704570/in-python-small-floats-tending-to-zero). Probabilites may very very small, too small for floating point arithmetic. We can solve by leveraging:

$$log(ab) = log\ a + log\ b$$

$$exp(log\ x) = x$$

So $P_1\ *\ P_2\ ...\ *\ P_2 = exp(log\ P_1 + ... + log\ P_n)$


In [4]:
import math

p1 = .03
p2 = .05

print p2*p1
math.exp(math.log(p1) + math.log(p2))

0.0015


0.0014999999999999994

- '0' probabilities. What if you never saw a feature value in your training data? We can use Laplace smoothing:

$$\hat\theta_i= \frac{x_i + \alpha}{N + \alpha d}  \qquad (i=1,\ldots,d)$$

Where $\alpha > 0$ is the smoothing parameter.

- Real-valued features. This brings us to *distributions*.

### The likelihood functions

$P\left(\;X_{1=x1}, X_{2=x2}, ..., X_{n=xn}\;|\;S\;\right)$

Bayesians tend to talk in terms of distributions of belief. Rather than point estimates of probabilities, we can use distributions.

For a binary event, probability can be modeled with the **binomial distribution**.

For > 2 discrete outcomes, the **multinomial distribution**.

And if features are real-valued? **Gaussian**.
 
 

## Guided practice: Scikit-learn to the rescue

<a name = "demo"></a>
### Using the Naive Bayes Implementation in Scikit-learn (15 mins)


```python
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Import data into a numpy array
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

#Initialize a variable as the Guassian Naive Bayes classifier and fit it with the data
clf = GaussianNB()
clf.fit(X, Y)
GaussianNB()

# Predict a few instances
print(clf.predict([[-0.8, -1]]))
clf_pf = GaussianNB()
clf_pf.partial_fit(X, Y, np.unique(Y))
GaussianNB()
print(clf_pf.predict([[-0.8, -1]]))

```


In [14]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Import data into a numpy array
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

#Initialize a variable as the Guassian Naive Bayes classifier and fit it with the data
clf = GaussianNB()
clf.fit(X, Y)
GaussianNB()

# Predict a few instances
print(clf.predict([[5, -1]]))
clf_pf = GaussianNB()
clf_pf.partial_fit(X, Y, np.unique(Y))
GaussianNB()
print(clf_pf.predict([[55, -1]]))

[2]
[2]


In [12]:

print(clf_pf.predict([[-5, -1]]))

[1]


<a name = "Guided"></a>
## Independent practice: Naive-Bayes classifier with real data (25 mins)

We're going to now try our hand at classifying some SPAM.

```python
# Work here
from sklearn import naive_bayes
import numpy as np
import pandas as pd

data = pd.read_csv('./assets/datasets/spam_base.csv')
```

In [6]:

from sklearn import naive_bayes
import numpy as np
import pandas as pd

data = pd.read_csv('./assets/datasets/spam_base.csv')

In [7]:
data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.0,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,0.0,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


<a name = "Indy"></a>
## Apply your Naive Bayes on the data  (25 min)

Now we should take the results above and try our hand with Naive Bayes. Which Naive Bayes classifier should we utilize? There are 3 variants (Normal, Bernoulli, Multinomial). Could we do some conversion of the data and try one or the other? How should we think about diagnosing the model performance?

Again, we must defer to the docs:

- [Docs 1](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
- [Docs 2](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
- [Docs 3](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)

The differences can be summarized as follows
-    ***BernoulliNB*** is designed for binary/boolean features
-    The ***multinomial Naive Bayes classifier*** is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as `tf-idf` may also work
-    ***GaussianNB*** is designed for continuous features (that can be scaled between 0,1) and is assumed to be normally distributed

In [None]:
# Work here

# We need to separate the features from the target.
feature_set = numpy_data_mat[:, :-1]
target = numpy_dat_mat[:, -1]

# Define several different feature sets. Do we get more or better accuracy? Is more always better?

# Discuss... and think about what kind of diagnosis metrics we could utilize for the model
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)
MultinomialNB(alpha=, class_prior=None, fit_prior=True)
print(clf.predict(X[]))

In [None]:
rom sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np

# create data frame containing your data, each column can be accessed # by df['column   name']
df = pd.read_csv('/your/path/yourFile.csv')

target_names = np.array(['Positives','Negatives'])

# add columns to your data frame
df['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
df['Type'] = pd.Factor(targets, target_names)
df['Targets'] = targets

# define training and test sets
train = df[df['is_train']==True]
test = df[df['is_train']==False]

trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)

# columns you want to model
features = df.columns[0:7]

# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()

# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])

<a name = "conclusion"></a>
## Conclusion (5 min)


How does Naive Bayes fit into your toolkit? What are the pros and cons? How do you choose between variants?

#### Additional Resources

- [An interesting slide from a Stanford MOOC which had a section on Naive Bayes](https://web.stanford.edu/class/cs124/lec/naivebayes.pdf)
- [A much more technical paper comparing Naive Bayes to Logistics Regressions](https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf)
- [More exposition on Naive Bayes](http://blog.yhat.com/posts/naive-bayes-in-python.html)