## Supplement 4: Classification

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd

np.set_printoptions(precision=4)
pd.set_option('display.precision',4)


### 4.1 Programming Task: Gaussian Naive-Bayes Classifier
The Iris dataset, containing measurements of the flower parts obtained from 3 different species of the Iris plant, is provided in the file __iris.csv__. The first four columns of the dataset contain the measurement values representing input features for the model and the last column contains class labels of the plant species: Iris-versicolor, Iris-versicolor, and Iris-virginica.
The goal of this task is to implement a Gaussian Naive-Bayes classifier for the Iris dataset.

i\. What are the assumptions on the dataset required for the Gaussian Naive-Bayes model?

ii\. Split the dataset into train and test by the 80:20 ratio.


In [2]:
dataset = pd.read_csv('iris.csv')
dataset.head()


Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
# Class labels present in this dataset
class_labels = list(dataset['Species'].unique())
print(class_labels)

input_features = list(dataset.columns[:-1])
print(input_features)

['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']


In [4]:
# Shuffle the dataset
dataset = dataset.sample(frac=1).reset_index(drop=True)
dataset.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.0,2.0,3.5,1.0,Iris-versicolor
1,7.2,3.0,5.8,1.6,Iris-virginica
2,7.9,3.8,6.4,2.0,Iris-virginica
3,5.9,3.0,4.2,1.5,Iris-versicolor
4,4.7,3.2,1.3,0.2,Iris-setosa


In [5]:
# Split the dataset
trainset_size = int(len(dataset) * 0.8)

trainset = dataset[ :trainset_size]
testset = dataset[trainset_size: ]

print('Train set size:',len(trainset))
print('Test set size: ',len(testset))

Train set size: 120
Test set size:  30


iii\. Estimate the parameters of the Gaussian Naive-Bayes classifier using the train set.


In [6]:

def gaussian_probability_function(x, mean, std):
    arg = -0.5 * ((x - mean)/std)**2
    prob = 1/(std * (np.sqrt(2*np.pi))) * np.exp(arg)
    return prob



In [7]:

def get_posterior(test_sample, class_name):
    
    # Get features from test sample
    test_sepal_l = test_sample['SepalLengthCm']     
    test_sepal_w = test_sample['SepalWidthCm']     
    test_petal_l = test_sample['PetalLengthCm']    
    test_petal_w = test_sample['PetalWidthCm']     



    # Get train samples relevant to class setosa
    trainset_given_class = trainset[trainset['Species']==class_name]


    # Get prior
    prior = len(trainset_given_class) / len(trainset)

    # Get mean and std for each feature in trainset
    mean_given_class = trainset_given_class[input_features].mean()
    std_given_class = trainset_given_class[input_features].std()

    # Model p( feature | class) for each feature as a gaussian
    prob_sepal_l_given_class = gaussian_probability_function(test_sepal_l, mean_given_class['SepalLengthCm'],std_given_class['SepalLengthCm'])

    prob_sepal_w_given_class  = gaussian_probability_function(test_sepal_w, mean_given_class['SepalWidthCm'],std_given_class['SepalWidthCm'])

    prob_petal_l_given_class  = gaussian_probability_function(test_petal_l, mean_given_class['PetalLengthCm'],std_given_class['PetalLengthCm'])

    prob_petal_w_given_class  = gaussian_probability_function(test_petal_w, mean_given_class['PetalWidthCm'],std_given_class['PetalWidthCm'])

    # Assuming features are independent
    posterior_class =  prob_sepal_l_given_class * prob_sepal_w_given_class * prob_petal_l_given_class * prob_petal_w_given_class * prior

    return posterior_class

 

iv\. Using the learned parameters, predict the classes for the samples in the test set.


In [8]:

posterior_setosa = get_posterior(testset,class_name='Iris-setosa')
posterior_versicolor = get_posterior(testset,class_name='Iris-versicolor')
posterior_virginica = get_posterior(testset,class_name='Iris-virginica')

posterior = pd.concat((posterior_setosa,posterior_versicolor,posterior_virginica),axis=1,)
posterior.columns = class_labels
predicted_labels = posterior.idxmax(axis=1)

print(predicted_labels)




120     Iris-virginica
121    Iris-versicolor
122     Iris-virginica
123        Iris-setosa
124    Iris-versicolor
125    Iris-versicolor
126     Iris-virginica
127    Iris-versicolor
128    Iris-versicolor
129        Iris-setosa
130     Iris-virginica
131     Iris-virginica
132        Iris-setosa
133        Iris-setosa
134     Iris-virginica
135    Iris-versicolor
136     Iris-virginica
137    Iris-versicolor
138     Iris-virginica
139    Iris-versicolor
140        Iris-setosa
141    Iris-versicolor
142    Iris-versicolor
143    Iris-versicolor
144        Iris-setosa
145     Iris-virginica
146        Iris-setosa
147    Iris-versicolor
148     Iris-virginica
149     Iris-virginica
dtype: object


What is the accuracy of the model on the test set?

In [9]:
ground_truth = testset['Species']

correct_predictions = np.sum([predicted_labels == ground_truth])
print(correct_predictions)

accuracy = correct_predictions/len(testset)
print(accuracy)


28
0.9333333333333333


In [10]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(trainset[input_features], trainset['Species'])

print(model.score(testset[input_features],testset['Species']))

0.9333333333333333
