# A Notebook to Use Naïve Bayes Classifiers

This notebook shows how to train Naïve Bayes classifiers to classify unseen instances.

For those of you interested in understanding the code, it uses predefined functions from the [sklearn](http://scikit-learn.org) library of machine learning primitives. 

Naïve Bayes Classifer is a probabilistic classifier which is based on Bayes Theorem.

In particular, assume that the data are stored in a table with attributes $x_1,x_2,\ldots,x_m$ and label $c$.
Then the NBC is based on approximating the posterior probability of the class given the attributes as the product of the marginal conditional probabilities of each attribute given the class:

$
P(c|x_1,x_2,\ldots,x_m) \propto P(c)\prod_{j=1}^m P(x_j|c)
$

First import the data and load custom functions

In [9]:
#!wget https://raw.githubusercontent.com/khider/INF549/master/Homework%20Assignments/Homework%204/Dataset/lenses.csv
#!wget https://raw.githubusercontent.com/khider/INF549/master/Homework%20Assignments/Homework%204/Dataset/iris.csv 

In [2]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB,GaussianNB,MultinomialNB
from sklearn.model_selection import cross_val_score

# for this exercise we use
# gaussian_NB.csv or multinomial_NB.csv
# in the homework you will have to use lenses.csv and iris.csv

def loadDataSet(dataset): 
    with open(dataset) as f:
        data=f.readlines()
        attributes=data[0].rstrip().split(',')[:-1]
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        for i in range(len(instances[0])):
            dataArray.append([float(instance[i]) for instance in instances])
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return attributes,instances,labels



def predict(testset):
    if "clf_G" in globals():
        prediction=clf_G.predict(testset)
        print("GaussianNB: ",prediction)
    if "clf_M" in globals():
        prediction=clf_M.predict(testset)
        print("MultinomialNB: ",prediction)

## Building and Evaluating Naïve Bayes Classifiers

We will be looking at the performance of two different Naïve Bayes Classifier. 

* Multinomial Naïve Bayes: suitable for classification with discrete features.
* Gaussian Naïve Bayes: suitable for classification with continuous features.

### Gaussian Naïve Bayes Classifier

In [3]:
#dataset=input('Please Enter Your Dataset:')
dataset = "Dataset/gaussian_NB.csv"
attributes,instances,labels=loadDataSet(dataset)
clf_G = GaussianNB()
clf_G.fit(instances, labels)
print("Gaussian Naïve Bayes is used.")

Gaussian Naïve Bayes is used.


Display table

In [4]:
df=pd.read_csv(dataset)
display(df)

Unnamed: 0,x1,x2,x3,x4,x5,label
0,-1.413704,-0.077863,1.091286,-0.705300,-0.932561,0
1,-0.800891,-0.576351,0.239464,0.266842,-0.695066,0
2,0.869034,-0.358989,-1.961448,-0.647346,-1.548050,0
3,1.188028,1.732099,1.205512,-0.059275,0.434703,0
4,-0.015910,-0.487646,0.267645,0.102246,0.796127,0
...,...,...,...,...,...,...
295,3.923467,5.072645,5.030650,4.450783,4.968945,2
296,4.656775,3.570640,3.053496,5.304592,5.847753,2
297,2.950417,4.103848,4.357519,4.547723,2.969420,2
298,3.228452,3.367882,4.276642,5.247006,3.992887,2


In [5]:
#n_foldCV=int(input("Please Enter the Number of Folds:"))
n_foldCV = 4
attributes,instances,labels=loadDataSet(dataset)
clf_G = GaussianNB()
scores = cross_val_score(clf_G, instances, labels, cv=n_foldCV)
print("======GaussianNB======")
print(np.array2string(scores,separator=","))
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[1.        ,1.        ,0.96      ,0.98666667]
Accuracy: 0.99 (+/- 0.03)


### Multinomial Naïve Bayes Classifier





In [6]:
#dataset=input('Please Enter Your Dataset:')
dataset = "Dataset/multinomial_NB.csv"
attributes,instances,labels=loadDataSet(dataset)
clf_M = MultinomialNB()
clf_M.fit(instances, labels)
print("Multinomial Naïve Bayes is used.")

Multinomial Naïve Bayes is used.


Display the data

In [7]:
df=pd.read_csv(dataset)
display(df)

Unnamed: 0,x1,x2,x3,x4,x5,label
0,3,0,0,2,0,0
1,1,2,0,1,1,0
2,1,4,0,0,0,0
3,1,1,1,1,1,0
4,3,0,1,1,0,0
...,...,...,...,...,...,...
295,3,1,0,0,1,2
296,1,3,1,0,0,2
297,2,2,1,0,0,2
298,1,2,0,0,2,2


In [8]:
#n_foldCV=int(input("Please Enter the Number of Folds:"))
n_foldCV = 4
attributes,instances,labels=loadDataSet(dataset)
clf_M = MultinomialNB()
scores = cross_val_score(clf_M, instances, labels, cv=n_foldCV)
print("======MultinomialNB======")
print(np.array2string(scores,separator=","))
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.58666667,0.62666667,0.58666667,0.69333333]
Accuracy: 0.62 (+/- 0.09)
