### Naive Bayes in Sklearn

We will again use the iris data.

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Load the data, which is included in sklearn.
iris = load_iris()
print 'Iris target names:', iris.target_names
print 'Iris feature names:', iris.feature_names
X, Y = iris.data, iris.target

# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

# Split into train and test.
train_data, train_labels = X[:100], Y[:100]
test_data, test_labels = X[100:], Y[100:]

Iris target names: ['setosa' 'versicolor' 'virginica']
Iris feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Create a bernoulli distribution using the mean

In [44]:
a= sum(X[:])/len(X)
X2 = X > a
train_data2 = X2[:100]
test_data2 = X2[100:]

In [40]:
b = np.median(X, axis=0)
X2 = X > b
train_data2 = X2[:100]
test_data2 = X2[100:]  

In [45]:
# using standard deviation method
b = np.std(X, axis=0)
X4 = abs(X - a) > b
train_data2 = X4[:100]
test_data2 = X4[100:] 

Sklearn has three types of Naive Bayes: gaussian, beroulli, and multinomial.

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

What is the difference between them? These are the assumed ditributional form of each component of P(X|Y); the distribution of each individual features.

Try using each of these on the iris data, you will have to prepare the data for multinomial and bernoulli.

In [21]:
gau = GaussianNB()
gau.fit(train_data, train_labels)

print 'gaussian accuracy: %3.2f' %gau.score(test_data, test_labels)

gaussian accuracy: 0.96


In [46]:
ber = BernoulliNB()
ber.fit(train_data2, train_labels)

print 'bernoulli accuracy: %3.2f' %ber.score(test_data2, test_labels)

bernoulli accuracy: 0.78


What choices did you make with manipulating the features above? Try tuning these choices, can you improve the accuracy?

Investigate what effect alpha has on the bernoulli and multinomial classifiers. What happens when alpha is very high? Is there an optimal value?

Does increasing alpha add bias or variance to our estimator?