# The Iris classification Problem
![](http://wildadirondacks.org/images/Adirondack-Wildflowers-Blue-Flag-Iris-Iris-versicolor-Cemetery-Road-Wetlands-12-June-2018-71.jpg)

#### Hello Dear Kagglers !!

This is a very basic tutorial to learn Machine Learning from scratch using the Iris Dataset. Here you learn how to implement a machine learning to a given dataset by following this notebook. I try to explain everything related to the implement ML in detail. Hope you find it useful learning material.

For a more advanced notebook that covers some more detailed concepts, have a look at this notebook.

If this notebook to be useful, **Please Upvote**!!!

## The Iris classification Problem
Imagine you are a botanist seeking an automated way to categorize each Iris flower you find. Machine learning provides many algorithms to classify flowers statistically. For instance, a sophisticated machine learning program could classify flowers based on photographs. Our ambitions are more modest—we're going to classify Iris flowers based on the length and width measurements of their sepals and petals.

The Iris genus entails about 300 species, but our program will only classify the following three:

* Iris setosa
* Iris virginica
* Iris versicolor


In [None]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

### Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
import matplotlib.pyplot as plt
%matplotlib inline

### Import or upload dataset

In [None]:
#Load Data
iris = pd.read_csv('../input/Iris.csv')

#### Preview of Data

In [None]:
iris.shape

* There are 150 observations with 4 features each (sepal length, sepal width, petal length, petal width).
* There are no null values, so we don't have to worry about that.
* There are 50 observations of each species (setosa, versicolor, virginica).

### Let's view top 5 and bottom 5 records from dataset

In [None]:
# View Top 5 records
iris.head()

In [None]:
# View bottom 5 records
iris.tail()

#### Let's check, If there is any inconsistency in the dataset

In [None]:
iris.info()

As we see above, there are no null values in the dataset, so the data can be processed

### Exploratory Data Analysis (EDA)

#### Let's check some statistical facts

In [None]:
iris.describe()

In [None]:
iris['Species'].value_counts()

#### Data Visualization

* After graphing the features in a pair plot, it is clear that the relationship between pairs of features of a iris-setosa (in pink) is distinctly different from those of the other two species.
* There is some overlap in the pairwise relationships of the other two species, iris-versicolor (brown) and iris-virginica (green).

In [None]:
iris1 = iris.drop('Id', axis=1)
g = sns.pairplot(iris1, hue='Species', markers='+')
plt.show()

#### Visualization1 : Sepal Length VS Width
This graph shows relationship between the **sepal length** and **sepal width**.

In [None]:
fig = iris1[iris1.Species=='Iris-setosa'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='orange', label='Setosa')
iris1[iris1.Species=='Iris-versicolor'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='blue', label='versicolor',ax=fig)
iris1[iris1.Species=='Iris-virginica'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='green', label='virginica', ax=fig)
fig.set_xlabel("Sepal Length")
fig.set_ylabel("Sepal Width")
fig.set_title("Sepal Length VS Width")
fig=plt.gcf()
fig.set_size_inches(12,8)
plt.show()

#### Visualization2 : Sepal Length VS Width
> This graph shows relationship between the **petal length** and **petal width**.

In [None]:
fig = iris1[iris1.Species=='Iris-setosa'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='orange', label='Setosa')
iris1[iris1.Species=='Iris-versicolor'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='blue', label='versicolor',ax=fig)
iris1[iris1.Species=='Iris-virginica'].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='green', label='virginica', ax=fig)
fig.set_xlabel("Petal Length")
fig.set_ylabel("Petal Width")
fig.set_title(" Petal Length VS Width")
fig=plt.gcf()
fig.set_size_inches(12,8)
plt.show()

As we can see that the Petal Features are giving a better cluster division compared to the Sepal features. This is an indication that the Petals can help in better and accurate Predictions over the Sepal. We will check that later.

#### Now let us see how are the length and width are distributed

In [None]:
iris1.hist(edgecolor='black')
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()

#### Now let us see how the length and width vary according to the species

In [None]:
g = sns.violinplot(y='Species', x='SepalLengthCm', data=iris1, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='SepalWidthCm', data=iris1, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalLengthCm', data=iris1, inner='quartile')
plt.show()
g = sns.violinplot(y='Species', x='PetalWidthCm', data=iris1, inner='quartile')
plt.show()

As we all know that the given problem is a classification problem. Thus we will be using the classification algorithms to build a model.

Before we start, we need to clear some ML notations.

**Attributes-->** An attribute is a property of an instance that may be used to determine its classification. In the following dataset, the attributes are the petal and sepal length and width. It is also known as Features.

**Target Variable-->**, In the machine learning context is the variable that is or should be the output. Here the target variables are the 3 flower species.

Now, when we train any algorithm, the number of features and their correlation plays an important role. If there are features and many of the features are highly correlated, then training an algorithm with all the featues will reduce the accuracy. Thus features selection should be done carefully. This dataset has less featues but still we will see the correlation.

In [None]:
plt.figure(figsize=(10,8)) 
sns.heatmap(iris1.corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

The Sepal Width and Length are not correlated The Petal Width and Length are highly correlated

We will use all the features for training the algorithm and check the accuracy.

Then we will use 1 Petal Feature and 1 Sepal Feature to check the accuracy of the algorithm as we are using only 2 features that are not correlated. Thus we can have a variance in the dataset which may help in better accuracy. We will check it later.

## Build Model with Scikit-learn

#### Steps To Be followed When Applying an Algorithm
**Step 1:** Split the dataset into training and testing dataset. The testing dataset is generally smaller than training one as it will help in training the model better.

**Step2:** Select any algorithm based on the problem (classification or regression) whatever you feel may be good.

**Step3:** Then pass the training dataset to the algorithm to train it. We use the **.fit()** method

**Step4:** Then pass the testing data to the trained algorithm to predict the outcome. We use the .predict() method.

**Step5:** We then check the accuracy by passing the predicted outcome and the actual output to the model.

### Spilit the training and test dataset

In [None]:
X = iris.drop(['Id', 'Species'], axis=1)
y = iris['Species']
# print(X.head())
print(X.shape)
# print(y.head())
print(y.shape)

#### Advantages
* By splitting the dataset pseudo-randomly into a two separate sets, we can train using one set and test using another.
* This ensures that we won't use the same observations in both sets.
More flexible and faster than creating a model using all of the dataset for training.

#### Disadvantages
* The accuracy scores for the testing set can vary depending on what observations are in the set.
* This disadvantage can be countered using k-fold cross-validation.

#### Notes
* The accuracy score of the models depends on the observations in the testing set, which is determined by the seed of the pseudo-random number generator (random_state parameter).
* As a model's complexity increases, the training accuracy (accuracy you get when you train and test the model on the same data) increases.
* If a model is too complex or not complex enough, the testing accuracy is lower.
* For KNN models, the value of k determines the level of complexity. A lower value of k means that the model is more complex.

In [None]:
from sklearn.model_selection import train_test_split  #to split the dataset for training and testing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Let's check the Train and Test Dataset

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

#### Classification Algorithms: which we used with this IRIS (structured) dataset

1. Logistic Regression
2. Decision Tree
3. Support Vector Machine (SVM)
4. K-Nearest Neighbours

In [None]:
# Importing alll the necessary packages to use the various classification algorithms

from sklearn.linear_model import LogisticRegression  # for Logistic Regression algorithm
from sklearn.tree import DecisionTreeClassifier #for using Decision Tree Algoithm
from sklearn import svm  #for Support Vector Machine (SVM) Algorithm
from sklearn.neighbors import KNeighborsClassifier  # for K nearest neighbours
from sklearn import metrics #for checking the model accuracy

### Logistic Regression

In [None]:
logr = LogisticRegression()
logr.fit(X_train,y_train)
y_pred = logr.predict(X_test)
acc_log = metrics.accuracy_score(y_pred,y_test)
print('The accuracy of the Logistic Regression is', acc_log)

### Decision Tree

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)
acc_dt = metrics.accuracy_score(y_pred,y_test)
print('The accuracy of the Decision Tree is', acc_dt)

### Support Vector Machine (SVM)

In [None]:
sv = svm.SVC() #select the algorithm
sv.fit(X_train,y_train) # we train the algorithm with the training data and the training output
y_pred = sv.predict(X_test) #now we pass the testing data to the trained algorithm
acc_svm = metrics.accuracy_score(y_pred,y_test)
print('The accuracy of the SVM is:', acc_svm)

### K-Nearest Neighbours

In [None]:
knc = KNeighborsClassifier(n_neighbors=3) #this examines 3 neighbours for putting the new data into a class
knc.fit(X_train,y_train)
y_pred = knc.predict(X_test)
acc_knn = metrics.accuracy_score(y_pred,y_test)
print('The accuracy of the KNN is', acc_knn)

Let's check the accuracy for various values of n for K-Nearest nerighbours

In [None]:
a_index = list(range(1,11))
a = pd.Series()
x = [1,2,3,4,5,6,7,8,9,10]
for i in list(range(1,11)):
    kcs = KNeighborsClassifier(n_neighbors=i) 
    kcs.fit(X_train,y_train)
    y_pred = kcs.predict(X_test)
    a=a.append(pd.Series(metrics.accuracy_score(y_pred,y_test)))
plt.plot(a_index, a)
plt.xticks(x)

In [None]:
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Support Vector Machines',
              'K-Nearest Neighbours'],
    'Score': [acc_log, acc_dt, acc_svm, acc_knn]})
models.sort_values(by='Score', ascending=False)

![](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png)

Here, We have implemented some of the common Machine Learning classification algorithms. Since given dataset is small with very few features, I didn't cover some concepts as they would be relevant when we have many features.

I hope this kernal is useful to you to learn machine learning from the scratch with IRIS dataset.

If find this notebook help you to learn, **Please Upvote.**

##### Thank You!!