## **Iris Flower Classification using Classification Machine Learning algorithms**

---
In this project, we are going to use classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics.

We will use some of the algorithms, specifically:
1) Linear Regression
2) K-Nearest Neighbors(KNN)
3) Decision Trees

We will evaluate our models using: Accuracy Score

Finally, we will try to use data visualization techniques to show which model works the best for our given dataset.

### About the Dataset

Iris flower has three species; setosa, versicolor, and virginica, which differs according to their measurements. Now assume that you have the measurements of the iris flowers according to their species, and here the task is to train a machine learning model that can learn from the
measurements of the iris species and classify them.

### Importing necessary modules

In [44]:
import warnings
warnings.filterwarnings('ignore')

In [45]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
import sklearn.metrics as metrics
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
%matplotlib inline

### Loading the Dataset

Dataset: [Link](https://www.canva.com/link?target=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fsaurabh00007%2Firiscsv&design=DAFSZMvBiCI&accessRole=viewer&linkSource=document)

Reading the CSV file into a Pandas dataframe

In [46]:
df = pd.read_csv("iris.csv")

In [47]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [48]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [49]:
# statistical summary of dataframe
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [50]:
# shape of pandas dataframe
df.shape

(150, 6)

**One Hot Encoding**

I used get_dummies function on dataframe to convert categorical variable into dummy/indicator variables.

Link to the official documentation: [**get_dummies**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [51]:
# make sure to use dtype=int to have 1/0 value which
# indicates presence or absence of a species
encoded_df = pd.get_dummies(df, columns = ['Species'], dtype=int)
encoded_df.head(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
0,1,5.1,3.5,1.4,0.2,1,0,0
1,2,4.9,3.0,1.4,0.2,1,0,0
2,3,4.7,3.2,1.3,0.2,1,0,0
3,4,4.6,3.1,1.5,0.2,1,0,0
4,5,5.0,3.6,1.4,0.2,1,0,0
5,6,5.4,3.9,1.7,0.4,1,0,0
6,7,4.6,3.4,1.4,0.3,1,0,0
7,8,5.0,3.4,1.5,0.2,1,0,0
8,9,4.4,2.9,1.4,0.2,1,0,0
9,10,4.9,3.1,1.5,0.1,1,0,0


**Creating train and test dataset**

Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After this, we will train the model with the training set and test with the testing set

Here, 80% of the entire dataset will be used for training and 20% for testing. We create a mask to select random rows using the np.random.rand() function.

In [52]:
mask = np.random.rand(len(encoded_df)) < 0.75
train = encoded_df[mask]
test = encoded_df[~mask]

In [53]:
test.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
8,9,4.4,2.9,1.4,0.2,1,0,0
20,21,5.4,3.4,1.7,0.2,1,0,0
30,31,4.8,3.1,1.6,0.2,1,0,0
31,32,5.4,3.4,1.5,0.4,1,0,0
33,34,5.5,4.2,1.4,0.2,1,0,0


In [54]:
train.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
0,1,5.1,3.5,1.4,0.2,1,0,0
1,2,4.9,3.0,1.4,0.2,1,0,0
2,3,4.7,3.2,1.3,0.2,1,0,0
3,4,4.6,3.1,1.5,0.2,1,0,0
4,5,5.0,3.6,1.4,0.2,1,0,0


### Linear Regression

In [55]:
regr = linear_model.LinearRegression()
x_train1 = np.asanyarray(train[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']])
y_train1 = np.asanyarray(train[['Species_Iris-setosa','Species_Iris-versicolor','Species_Iris-virginica']])

regr.fit(x_train1,y_train1)

In [56]:
print('Coefficients: ', regr.coef_)
print('Intercept: ', regr.intercept_)

Coefficients:  [[ 0.07232429  0.2383981  -0.2096338  -0.10332349]
 [-0.03192034 -0.42732746  0.1913708  -0.395534  ]
 [-0.04040395  0.18892935  0.018263    0.49885749]]
Intercept:  [ 0.09724414  1.57669503 -0.67393917]


In [57]:
y_hat1 = regr.predict(test[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']])
x_test1 = np.asanyarray(test[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']])
y_test1 = np.asanyarray(test[['Species_Iris-setosa','Species_Iris-versicolor','Species_Iris-virginica']])


print("MAE: %.2f" % np.mean(np.absolute( y_hat1 - y_test1 )))
print("MSE: %.2f" % np.mean( ( y_hat1 - y_test1 ) ** 2 ))
print("R2 score: %.2f" % r2_score( y_hat1, y_test1 ))
print('Variance score: %.2f' % regr.score( x_test1, y_test1 ))

MAE: 0.23
MSE: 0.09
R2 score: -0.22
Variance score: 0.62


### KNN

In [58]:
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [59]:
df['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

In [60]:
encoded_df.value_counts()

Id   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  Species_Iris-setosa  Species_Iris-versicolor  Species_Iris-virginica
1    5.1            3.5           1.4            0.2           1                    0                        0                         1
95   5.6            2.7           4.2            1.3           0                    1                        0                         1
97   5.7            2.9           4.2            1.3           0                    1                        0                         1
98   6.2            2.9           4.3            1.3           0                    1                        0                         1
99   5.1            2.5           3.0            1.1           0                    1                        0                         1
                                                                                                                                      ..
51   7.0            3.2           4.7         

In [61]:
encoded_df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
count,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667,0.333333,0.333333,0.333333
std,43.445368,0.828066,0.433594,1.76442,0.763161,0.472984,0.472984,0.472984
min,1.0,4.3,2.0,1.0,0.1,0.0,0.0,0.0
25%,38.25,5.1,2.8,1.6,0.3,0.0,0.0,0.0
50%,75.5,5.8,3.0,4.35,1.3,0.0,0.0,0.0
75%,112.75,6.4,3.3,5.1,1.8,1.0,1.0,1.0
max,150.0,7.9,4.4,6.9,2.5,1.0,1.0,1.0


In [62]:
encoded_df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
0,1,5.1,3.5,1.4,0.2,1,0,0
1,2,4.9,3.0,1.4,0.2,1,0,0
2,3,4.7,3.2,1.3,0.2,1,0,0
3,4,4.6,3.1,1.5,0.2,1,0,0
4,5,5.0,3.6,1.4,0.2,1,0,0


In [63]:
# using scikit-learn library to convert pandas dataframe to a Numpy array
x2 = encoded_df[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
x2[0:5]

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [64]:
y2 = encoded_df[['Species_Iris-setosa','Species_Iris-versicolor','Species_Iris-virginica']]
y2[0:5]

Unnamed: 0,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [65]:
# normalizing data, data normalization gives the data zero mean and unit variance
x2_ = preprocessing.StandardScaler().fit(x2).transform(x2.astype(float))
x2_[0:5]

array([[-0.90068117,  1.03205722, -1.3412724 , -1.31297673],
       [-1.14301691, -0.1249576 , -1.3412724 , -1.31297673],
       [-1.38535265,  0.33784833, -1.39813811, -1.31297673],
       [-1.50652052,  0.10644536, -1.2844067 , -1.31297673],
       [-1.02184904,  1.26346019, -1.3412724 , -1.31297673]])

In [66]:
# train test split
x_train2, x_test2, y_train2, y_test2 = train_test_split( x2,y2, test_size= 0.2, random_state= 5 )
print('Train set: ', x_train2.shape, y_train2.shape)
print('Test set: ', x_test2.shape, y_test2.shape)

Train set:  (120, 4) (120, 3)
Test set:  (30, 4) (30, 3)


In [67]:
k = 4

# train model and predict
neigh = KNeighborsClassifier(n_neighbors=k).fit(x_train2, y_train2)
neigh

In [68]:
yhat2 = neigh.predict(x_test2)
yhat2[0:5]

array([[0, 1, 0],
       [0, 0, 1],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1]])

In [69]:
print('Train set accuracy: ', metrics.accuracy_score(y_train2, neigh.predict(x_train2)))
print('Test set accuracy: ', metrics.accuracy_score(y_test2, yhat2))

Train set accuracy:  0.975
Test set accuracy:  0.9333333333333333


In [70]:
k = 2
neigh2 = KNeighborsClassifier(n_neighbors = k).fit(x_train2, y_train2)
yhat_2 = neigh2.predict(x_test2)
print('Train set accuracy: ', metrics.accuracy_score(y_train2, neigh2.predict(x_train2)))
print('Test set accuracy: ', metrics.accuracy_score(y_test2, yhat_2))

Train set accuracy:  0.975
Test set accuracy:  0.8666666666666667


In [71]:
k = 4
neigh4 = KNeighborsClassifier(n_neighbors = k).fit(x_train2, y_train2)
yhat_4 = neigh4.predict(x_test2)
print('Train set accuracy: ', metrics.accuracy_score(y_train2, neigh4.predict(x_train2)))
print('Test set accuracy: ', metrics.accuracy_score(y_test2, yhat_4))

Train set accuracy:  0.975
Test set accuracy:  0.9333333333333333


In [72]:
k = 6
neigh6 = KNeighborsClassifier(n_neighbors = k).fit(x_train2, y_train2)
yhat_6 = neigh6.predict(x_test2)
print('Train set accuracy: ', metrics.accuracy_score(y_train2, neigh6.predict(x_train2)))
print('Test set accuracy: ', metrics.accuracy_score(y_test2, yhat_6))

Train set accuracy:  0.9833333333333333
Test set accuracy:  0.9333333333333333


In [73]:
k = 8
neigh8 = KNeighborsClassifier(n_neighbors = k).fit(x_train2, y_train2)
yhat_8 = neigh8.predict(x_test2)
print('Train set accuracy: ', metrics.accuracy_score(y_train2, neigh8.predict(x_train2)))
print('Test set accuracy: ', metrics.accuracy_score(y_test2, yhat_8))

Train set accuracy:  0.975
Test set accuracy:  0.9666666666666667


In [74]:
k = 10
neigh10 = KNeighborsClassifier(n_neighbors = k).fit(x_train2, y_train2)
yhat_10 = neigh10.predict(x_test2)
print('Train set accuracy: ', metrics.accuracy_score(y_train2, neigh10.predict(x_train2)))
print('Test set accuracy: ', metrics.accuracy_score(y_test2, yhat_10))

Train set accuracy:  0.9833333333333333
Test set accuracy:  0.9666666666666667


In [75]:
k = 12
neigh12 = KNeighborsClassifier(n_neighbors = k).fit(x_train2, y_train2)
yhat_12 = neigh12.predict(x_test2)
print('Train set accuracy: ', metrics.accuracy_score(y_train2, neigh12.predict(x_train2)))
print('Test set accuracy: ', metrics.accuracy_score(y_test2, yhat_12))

Train set accuracy:  0.9833333333333333
Test set accuracy:  0.9666666666666667


### Decision Tree

In [76]:
x3 = x2_

In [77]:
y3 = y2

In [78]:
x_train3, x_test3, y_train3, y_test3 = train_test_split( x3, y3, test_size= 0.3, random_state= 3 )
print('Train set: ', x_train3.shape, y_train3.shape)
print('Test set: ', x_train3.shape, y_test3.shape)

Train set:  (105, 4) (105, 3)
Test set:  (105, 4) (45, 3)


In [79]:
x_train3

array([[ 6.74501145e-01,  3.37848329e-01,  4.21564419e-01,
         3.96171883e-01],
       [-9.00681170e-01,  5.69251294e-01, -1.17067529e+00,
        -9.18557817e-01],
       [ 1.89829664e-01, -8.19166497e-01,  7.62758643e-01,
         5.27644853e-01],
       [-5.25060772e-02, -8.19166497e-01,  7.62758643e-01,
         9.22063763e-01],
       [-5.37177559e-01, -1.24957601e-01,  4.21564419e-01,
         3.96171883e-01],
       [-1.26418478e+00, -1.24957601e-01, -1.34127240e+00,
        -1.18150376e+00],
       [-1.02184904e+00,  3.37848329e-01, -1.45500381e+00,
        -1.31297673e+00],
       [-5.25060772e-02, -8.19166497e-01,  7.62758643e-01,
         9.22063763e-01],
       [ 1.40150837e+00,  3.37848329e-01,  5.35295827e-01,
         2.64698913e-01],
       [ 3.10997534e-01, -5.87763531e-01,  5.35295827e-01,
         1.75297293e-03],
       [-1.14301691e+00,  1.06445364e-01, -1.28440670e+00,
        -1.44444970e+00],
       [ 5.53333275e-01,  8.00654259e-01,  1.04708716e+00,
      

In [80]:
dtree = DecisionTreeClassifier(criterion="entropy", max_depth= 4)
dtree

In [81]:
dtree.fit(x_train3, y_train3)

In [82]:
predtree = dtree.predict(x_test3)

In [83]:
predtree

array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1]])

In [84]:
print("Decision Tree's accuracy: ", metrics.accuracy_score(y_test3, predtree))

Decision Tree's accuracy:  0.9333333333333333
