# Naive Bayes with Sklearn

This notebook creates and measures a [Naive Bayes classifier with Sklearn](http://scikit-learn.org/stable/modules/naive_bayes.html).

* Method: Gaussian
* Dataset: Breast Cancer Wisconsin Diagnostic Database

There are three types of Naive Bayes models available:

* Gaussian: the likelihood of the features is assumed to be Gaussian
* Multinomial: for multinomially distributed data; one of the two classic naive Bayes variants used in text classification
* Bernoulli: for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable

## Imports

In [None]:
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

## Load and Prepare the Data

In [None]:
# Load the dataset
data = load_breast_cancer()

In [None]:
# Get information on the dataset
print(data.DESCR)

In [None]:
# Split the data into labels (targets) and features
label_names = data['target_names']
labels = data['target']

feature_names = data['feature_names']
features = data['data']

In [None]:
# View the data
print(label_names)
print(labels[0])
print("")
print(feature_names)
print(features[0])

In [None]:
# Create test and training sets
X_train, X_test, Y_train, Y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.33,
                                                    random_state=42)

## Fit a Naive Bayes Model

In [None]:
# Create an instance of the GaussianNB classifier
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, Y_train)

## Create Predictions

In [None]:
# Create predictions
predictions = gnb.predict(X_test)
print(predictions)

## Model Evaluation

### Accuracy

The accuracy score is either the fraction (default) or the count (normalize=False) of correct predictions.

In [None]:
print("Accuracy Score: %.2f" % accuracy_score(Y_test, predictions))

### K-Fold Cross Validation

This estimates the accuracy of a Gaussian Naive Bayes model by splitting the data, fitting a model and computing the score 5 consecutive times. The result is a list of the scores from each consecutive run.

In [None]:
# Get scores for 5 folds over the data
clf = GaussianNB()
scores = cross_val_score(clf, data.data, data.target, cv=5)
print(scores)