## About:
In this notebook we have explored the Iris dataset and used Naive bayes algorithm for prediction. We will be predicting the different species of Iris flower using the sepal, petal length and width. Our iris dataset is a balanced one with each species having 50 records.

#### Imorting all the required packages. We will be using sklearn to import Gaussian naive Bayes algorithm.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

#### Reading the Iris dataset through pandas library

In [3]:
data = pd.read_csv("Iris.csv",encoding = 'Latin-1')

In [4]:
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


#### we will be removing the Id column as the column does not add information to identifying the Species of iris flower.

In [7]:
data.drop('Id',axis = 1,inplace = True)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


#### Summary of data

In [9]:
data.describe()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [10]:
data["Species"].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

**We have now prepared our dataset. We now need to split the dataset for training and testing. The train data would contain all the features and its value and the test data will contain the Species or the target value. We will be using train_test_split from Sklearn to randomly split the dataset into ratio of 70:30 **

#### Making the training data X using Sepal and Petal characteristics

In [11]:
X = data[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]

#### making the test data keeping only the Species column

In [12]:
y = data['Species']

In [13]:
X.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [14]:
y.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object

In [17]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42,test_size = 0.30)

#### Initializing the classifier

In [15]:
gnb = GaussianNB()

#### fitting the gaussian classifier to train data

In [18]:
gnb.fit(X_train,y_train)

GaussianNB(priors=None)

#### we will now use the trained model gnb to predict the X_test . the prediction will be stored in the vaiable y_pred.

In [19]:
y_pred = gnb.predict(X_test)

#### now we will calculate accuracy of the model by comparing the original label of the test data and the predicted label.

In [20]:
acc = accuracy_score(y_test,y_pred)

In [24]:
print("accuracy of the model: ",round(acc*100,2),"%")

accuracy of the model:  97.78 %


**We trained our model and calculated the accuracy. We saw that our model is 97.77% accuracte in predicting the Iris flower Species. 97.77 % is very good accuracy and we see that naives Bayes is also pretty decent and fast algoritm to use for prediction.**