# Naive Bayes Classifiers

Principle : Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models discussed in the previous section. However, they tend to be even faster in training. The price paid for this efficiency is that naive Bayes models often provide generalization performance that is slightly worse than that of linear classifiers like LogisticRegression and LinearSVC.

In [2]:
%load_ext autoreload
%autoreload
from utils import feature_selection

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

In [4]:
gt = pd.read_csv('../dumps/2020.01.13-14.25.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.33
Test set accuracy: 0.34
Test set accuracy: 0.34
Test set accuracy: 0.40
Test set accuracy: 0.46
Test set accuracy: 0.51


As we can see, no matter of big the test size is, the performances with the Gaussian classifier are really bad for this dataset. Let's try with more samples.

In [5]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = GaussianNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

Test set accuracy: 0.13
Test set accuracy: 0.17
Test set accuracy: 0.21
Test set accuracy: 0.24
Test set accuracy: 0.32
Test set accuracy: 0.45


This is even worse ! The reaseon why this algorithm is so fast but also so bad at generalization is because it learns parameters by looking at each feature individually and collect simple per-class statistics from each feature. Since we have a huge diversity in our dataset, the GaussianNB gives quite bad results.

Let's look at the Bernouilli distribution for different test sizes.

In [50]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
print(data.shape)
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = BernoulliNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

(7977, 119)
Test set accuracy: 0.72
Test set accuracy: 0.70
Test set accuracy: 0.71
Test set accuracy: 0.70
Test set accuracy: 0.73
Test set accuracy: 0.76


The performances are not that bad but one has to know that BernouilliNB is assumes binary data (opposite to the GaussianNB which works for any kind of continuous data). We should therefore perform some tuning and only keep boolean values in our dataset.

And for the Multinomial distribution (Note that this distribution only accepts non-negative values, therefore we have to parse our dataset and remove all rows where feature values are below 0) :

In [8]:
gt = pd.read_csv('../dumps/2020.02.10-12.14.csv')
cols = [col for col in gt.columns if col not in ['label']]
for col in cols:
    gt = gt.drop(gt[gt[col] < 0 ].index)
data = gt[cols]
print(data.shape)
target = gt['label']

values = [0.1,0.2,0.4,0.6,0.8,0.9]

for i in values:
    data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = i, random_state = 0)

    gnb = MultinomialNB()

    gnb.fit(data_train, target_train)
    print("Test set accuracy: {:.2f}".format(gnb.score(data_test, target_test)))

(3771, 119)
Test set accuracy: 0.77
Test set accuracy: 0.80
Test set accuracy: 0.83
Test set accuracy: 0.87
Test set accuracy: 0.92
Test set accuracy: 0.95


On a dataset reduced by more than 50%, we managed to reach quite acceptable values.

Conclusion : not suited for our problem of classification between really sparse features except Multinomial

### Features relevance

Since Naive Bayes assumes independence and outputs class probabilities most feature importance criteria are not a direct fit. The feature importance should be no different from the skewness of the feature distribution in the set.

### Test with Thomas datasets

In [12]:
gt = pd.read_csv("../dumps/2019-08.Merged_thomas.csv")
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)

tree = GaussianNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

tree = BernoulliNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

for col in cols:
    gt = gt.drop(gt[gt[col] < 0 ].index)
data = gt[cols]
target = gt['label']

data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)

tree = MultinomialNB()

tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

Accuracy on training set: 0.087
Accuracy on test set: 0.086
Accuracy on training set: 0.829
Accuracy on test set: 0.829
Accuracy on training set: 0.248
Accuracy on test set: 0.244


In [13]:
gt = pd.read_csv("../dumps/2019-09.Merged_thomas.csv")
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)

tree = GaussianNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

tree = BernoulliNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

for col in cols:
    gt = gt.drop(gt[gt[col] < 0 ].index)
data = gt[cols]
target = gt['label']
data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)

tree = MultinomialNB()
tree.fit(data_train, target_train)
print("Accuracy on training set: {:.3f}".format(tree.score(data_train, target_train))) 
print("Accuracy on test set: {:.3f}".format(tree.score(data_test, target_test)))

Accuracy on training set: 0.173
Accuracy on test set: 0.172
Accuracy on training set: 0.897
Accuracy on test set: 0.887
Accuracy on training set: 0.223
Accuracy on test set: 0.218
