#**There are 3 classes of species namely setosa, versicolor and the virginica. This dataset was originally introduced in 1936 by Ronald Fisher. Using the various features of the flower (independent variables), we have to classify a given flower using Naive Bayes Classification model.**

---

### 1. Importing the Libraries

As always, the first step will always include importing the libraries which are the NumPy, Pandas and the Matplotlib.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### 2. Importing Dataset Iris

In this step, we shall import the Iris Flower dataset which is stored in my github repository as IrisDataset.csv and save it to the variable dataset. After this, we assign the 4 independent variables to X and the dependent variable ‘species’ to Y. The first 5 rows of the dataset are displayed.

In [4]:
dataset = pd.read_csv('/content/drive/MyDrive/Pendata/iris.csv')
X = dataset.iloc[:,:4].values
y = dataset['variety'].values
dataset.head(10)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa


### 3. Splitting the dataset into the Training set and Test set

Once we have obtained our data set, we have to split the data into the training set and the test set. In this data set, there are 150 rows with 50 rows of each of the 3 classes. As each class is given in a continuous order, we need to randomly split the dataset. Here, we have the test_size=0.2, which means that 20% of the dataset will be used for testing purpose as the test set and the remaining 80% will be used as the training set for training the Naive Bayes classification model.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)

### 4. Feature Scaling

The dataset is scaled down to a smaller range using the Feature Scaling option. In this, both the X_train and X_test values are scaled down to smaller values to improve the speed of the program.

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### 5. Training the Naive Bayes Classification model on the Training Set

In this step, we introduce the class GaussianNB that is used from the sklearn.naive_bayes library. Here, we have used a Gaussian model, there are several other models such as Bernoulli, Categorical and Multinomial. Here, we assign the GaussianNB class to the variable classifier and fit the X_train and y_train values to it for training purpose.

In [7]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
classifier.class_count_

array([26., 28., 21.])

### 6. Predicting the Test set results

Once the model is trained, we use the the classifier.predict() to predict the values for the Test set and the values predicted are stored to the variable y_pred.

In [8]:
y_pred = classifier.predict(X_test) 
y_pred

array(['Virginica', 'Setosa', 'Setosa', 'Versicolor', 'Setosa',
       'Versicolor', 'Virginica', 'Setosa', 'Virginica', 'Virginica',
       'Setosa', 'Versicolor', 'Setosa', 'Virginica', 'Setosa',
       'Versicolor', 'Versicolor', 'Virginica', 'Versicolor', 'Virginica',
       'Versicolor', 'Versicolor', 'Versicolor', 'Versicolor',
       'Virginica', 'Virginica', 'Versicolor', 'Virginica', 'Setosa',
       'Virginica', 'Setosa', 'Versicolor', 'Virginica', 'Virginica',
       'Setosa', 'Setosa', 'Versicolor', 'Setosa', 'Versicolor',
       'Versicolor', 'Virginica', 'Virginica', 'Setosa', 'Setosa',
       'Setosa', 'Setosa', 'Virginica', 'Versicolor', 'Virginica',
       'Virginica', 'Virginica', 'Versicolor', 'Versicolor', 'Setosa',
       'Setosa', 'Virginica', 'Setosa', 'Setosa', 'Versicolor', 'Setosa',
       'Versicolor', 'Versicolor', 'Versicolor', 'Versicolor',
       'Virginica', 'Virginica', 'Versicolor', 'Virginica', 'Setosa',
       'Virginica', 'Setosa', 'Virginica', 'Vir

### 7. Confusion Matrix and Accuracy

This is a step that is mostly used in classification techniques. In this, we see the Accuracy of the trained model and plot the confusion matrix.

The confusion matrix is a table that is used to show the number of correct and incorrect predictions on a classification problem when the real values of the Test Set are known. It is of the format

In [9]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from sklearn.metrics import accuracy_score 
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm

Accuracy :  0.9733333333333334


array([[24,  0,  0],
       [ 0, 22,  0],
       [ 0,  2, 27]])

### 8. Comparing the Real Values with Predicted Values

In this step, a Pandas DataFrame is created to compare the classified values of both the original Test set (y_test) and the predicted results (y_pred).

In [10]:
df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
df

Unnamed: 0,Real Values,Predicted Values
0,Virginica,Virginica
1,Setosa,Setosa
2,Setosa,Setosa
3,Virginica,Versicolor
4,Setosa,Setosa
...,...,...
70,Setosa,Setosa
71,Virginica,Virginica
72,Virginica,Virginica
73,Virginica,Virginica
