# Naive Bayes Classification

**Basic Description**

Naive Bayes models are a group of extremely fast and simple classiciation algorithms that are often suitable for high-dimensional datasets. Because they are so fast and have so few tunable parameters, they are useful as a quick-and-dirty baseline for a classification problem.

In Bayesian classification, we're interested in finding the probability of a label given some observed features. As a generative model, Naive Bayes specifies the hypothetical random process that generates the data. The "naive" in Naive Bayes comes from the fact that naive assumptions are made about the generative model for each label. Here I choose a Gaussian Naive Bayes Classifier because our model features are continuous.

**Bias-Variance Tradeoff** 

**Upsides**
- Fast for training and prediction
- Straightforward probabilistic prediction
- Easily interprettable
- Few, if any, tuning parameters

**Downsides**
- Strong assumptions often not met

**Other Notes**


## Load Packages and Prep Data

In [1]:
# custom utils
import utils
print(utils.__file__)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import RFECV
from sklearn.linear_model import Lasso
from sklearn.naive_bayes import GaussianNB

/Users/shea/Projects/ancient_toolmaking/utils.py


In [2]:
GaussianNB?

[0;31mInit signature:[0m [0mGaussianNB[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mpriors[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mvar_smoothing[0m[0;34m=[0m[0;36m1e-09[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Gaussian Naive Bayes (GaussianNB).

Can perform online updates to model parameters via :meth:`partial_fit`.
For details on algorithm used to update feature means and variance online,
see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:

    http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

Read more in the :ref:`User Guide <gaussian_naive_bayes>`.

Parameters
----------
priors : array-like of shape (n_classes,)
    Prior probabilities of the classes. If specified, the priors are not
    adjusted according to the data.

var_smoothing : float, default=1e-9
    Portion of the largest variance of all features that is added to
    variances for calculation stability.

    .. versionadded:: 0.20

Attributes

In [3]:
# load data
X_train, y_train, X_test, y_test = utils.load_data()

X_train (62889, 42)
y_train (62889,)
X_test (15723, 42)
y_test (15723,)


## Model 1
- Defaults

In [4]:
# fit model
classifier = GaussianNB()
nb_1 = classifier.fit(X_train, y_train)

In [5]:
# cross validation with f1 scoring
score = utils.f1_cv(nb_1, X_train, y_train)

[0.1327 0.14   0.1371 0.1426 0.1424]
0.139


## Model 2
- Select important features

### Feature Selection

In [None]:
## recursive feature elimination via lasso regression
model_rfe = RFECV(Lasso(alpha = 0.0001), cv = 5)
x = model_rfe.fit(X_train, y_train)

In [71]:
# feature ranking
rfe = model_rfe.ranking_
features = X_train.columns
rfe_df = pd.DataFrame({'features': features, 'rfe_rank': rfe})
rfe_df.sort_values(by = 'rfe_rank', ascending = True)

Unnamed: 0,features,rfe_rank
0,da,1
36,fiber_width,1
34,ellipticity,1
33,angularity,1
30,t_w_ratio,1
29,w_t_ratio,1
28,w_l_ratio,1
25,curvature,1
24,transparency,1
19,circularity,1


In [72]:
# features ranked 1 are the most important
# select only the more important features as a means of regularization
selected_features = rfe_df[rfe_df['rfe_rank'] == 1]['features'].values
X_train_new = X_train[selected_features]
X_test_new = X_test[selected_features]

### Fit Model

In [73]:
# fit model
classifier = GaussianNB()
nb_2 = classifier.fit(X_train_new, y_train)

### Cross-Validation

In [74]:
# cross validation with f1 scoring
score = utils.f1_cv(nb_2, X_train, y_train)

[0.1327 0.14   0.1371 0.1426 0.1424]
0.139


## Test
- Test selected model

In [75]:
# predict on test
y_pred = model_2.predict(X_test_new)

# scores
utils.pred_metrics(y_test, y_pred)

# confusion matrix
utils.cm_plot(y_test,y_pred)