# Naive Bayes Classifier

[Naive Bayes classifiers](https://en.wikipedia.org/wiki/Naive_Bayes_classifier), also known 'simple Bayes models' are classification models based on standard statistical theory. 

The model assumes that features are independent of one another, and as such shouldn't be used for all data sets, but is appropriate for our data.

In this notebook we load in our training data, transform that data to feature vectors using the feature extraction technique developed in noteboook [01-feature-engineering](01-feature-engineering.ipynb), and train a model. We evaluate the perfomance of that model, and save it to file. 

In [None]:
import pandas as pd
import numpy as np

import os.path

training_data = pd.read_parquet(os.path.join("data", "training.parquet"))

In [None]:
training_data.sample(10)

## Feature Engineering 

In [None]:
import cloudpickle as cp

feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))

In [None]:
train_vecs = feature_pipeline.fit_transform(training_data["Text"])
train_vecs

# Model Training

In [None]:
from sklearn import naive_bayes

In [None]:
nb = naive_bayes.MultinomialNB()

nb.fit(train_vecs, training_data["Category"])

## Evaluating model performance

Now that we have trained a model we can evaluate its performance on both the training and testing set. 

In [None]:
nb.score(train_vecs, training_data["Category"])

In [None]:
testing_data = pd.read_parquet(os.path.join("data", "testing.parquet"))
testing_vecs = feature_pipeline.transform(testing_data["Text"])

In [None]:
testing_vecs

In [None]:
nb.score(testing_vecs, testing_data["Category"])

The score suggests that the model is performing slightly better on the training set than on the testing set, so may have overfit. However, representing the performance of a classifier by just one number is pretty uninformative - it's not possible to really understand how the model is performing through just this one metric.  Instead, we consider the confusion matrix for this data. This interactive plot illustrates the performance of the model when classifying samples from each of the classes:


In [None]:
from mlworkflows import plot

In [None]:
df, chart = plot.confusion_matrix(testing_data.Category, nb.predict(testing_vecs))

In [None]:
chart

✅ Do you notice anything interesting about the misclassifications made by the model? 


✅ In what situations would you be happy with summarising the model performance by a single 'score' value, rather than a more robust visualisation or set of values?

We can look at individual metrics for the classes like so:

In [None]:
from sklearn.metrics import classification_report
print(classification_report(testing_data.Category, nb.predict(testing_vecs)))

In [None]:
from mlworkflows import util

util.serialize_to(nb, "model.sav")