# ML From Scratch With IRIS Dataset

### In this Notebook we will Learn:-
* Basic EDA.
* Plotly Visualisation.
* Spliting the Dataset into training set and test set.
* Dealing with Categorical Dataset.
* K-Cross validation to check accuracy.
* ML's Classification Models like:-
             * Logistic Regression
             * Support Vector Machine (SVM) with Linear kernel
             * Support Vector Machine (SVM) with Gaussian kernel
             * K-Nearest Neighbour (KNN)
             * Naive Bayes
             * Decision Tree
             * Random Forest
* Prediction on new Values.             

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot, download_plotlyjs
import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import warnings
warnings.filterwarnings('ignore')

import os
print(os.listdir("../input"))
print()
print("The files in the dataset are:-- ")
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

['database.sqlite', 'Iris.csv']

The files in the dataset are:-- 
Iris.csv
database.sqlite



In [2]:
# IMporting the dataset
df = pd.read_csv("../input/Iris.csv")

In [3]:
# Checking the top 5 entries
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


# BASIC EDA

In [4]:
print(f"The number of rows and columns in the dataset are \t {df.shape}")

The number of rows and columns in the dataset are 	 (150, 6)


In [5]:
# Let's check the unique Species in the dataset, which we will predict in the end.
print(df['Species'].unique())
print("There are 3 species .")

['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
There are 3 species .


In [6]:
# Let us check whether we have null values in the dataset or not.
print(df.isnull().sum())
print()
print()
print("As one can see there is No Null Values in the dataset.")

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


As one can see there is No Null Values in the dataset.


In [7]:
# Let us remove the unwanted columns/features which will not help us to predict the Species of the Flower
df.drop('Id', axis=1, inplace=True)

#### Let us get Some Statistical Knowledge about the dataset

In [8]:
# Let us see the distribution of the SepalLength,SepalWidth,PetalLength, PetalWidth
# And get the Statistical Knowledge of the dataset
temp_df = df[['SepalLengthCm','SepalWidthCm','PetalLengthCm', 'PetalWidthCm']]
temp_df.iplot(kind='box', title='Distribution of Length and Width of Sepal and Petal in Cm', yTitle='Frequency')

#### Let's see the correlation between the random Variable/ different features

In [9]:
df.corr().iplot(kind='heatmap', )

#### Observation:-
* SepalLength and SepalWidth are less correlated.
* SepalLength is higly correlated with PetalLength and PetalWidth.
* SepalWidth is average correlated with PetalLength and PetalWidth
* And finally PetalLength and PetalWidth are highly Correlated.
* But for making prediction we will take all the features.

#### =====================================================================================================

# PREDICTION WITH ML MODELS

#### Here we will use 7 Algoritms/Models of Classification of machine learning.
#### They are as follow:-
* Logistic Regresion
* Support Vector Machine (SVM) with Linear kernel
* Support Vector Machine (SVM) with Gaussian kernel
* K-Nearest Neibhour (K-NN)
* Naive Bayes
* Decision Tree
* Random Forest Model

#### We will use all the models one by one  and then we will check the accuracy in each model and compare them, then we will choose the best model for our dataset.

In [10]:
# Let us Import the Important Libraries  to train our Model for Machine Learning 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # To deal with Categorical Data in Target Vector.
from sklearn.model_selection import train_test_split  # To Split the dataset into training data and testing data.
from sklearn.model_selection import cross_val_score   # To check the accuracy of the model.

In [11]:
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#### Let us do data preprocessing step 
* Here we will deal with four concept 
#### MCSS
* 'M' for missing, means dealing with the missing data.
* 'C' for Categorical, means dealing with the Categorical data.
* 'S' for Spliting, means spliting the dataset into training set and test set.
* 'S' for Scaling , means scaling the features so that we can compare many variable on the same scale.

In [12]:
# Creating Feature Matric and Target Vector.
X = df.iloc[:,:-1].values
Y = df.iloc[:,-1].values

In [13]:
# Let us check whether we have null values in the dataset or not.
print(df.isnull().sum())
print()
print()
print("As one can see there is No Null Values in the dataset.")

SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


As one can see there is No Null Values in the dataset.


In [14]:
# Now we have Categorical data in our Target vector and we need to convert 
# it into values, So that we can easyly perform Mathmethical operations.

label_y = LabelEncoder()
Y = label_y.fit_transform(Y)

In [15]:
Y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [16]:
df['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

* 0 means Iris-setosa, 1 means Iris-versicolor, 2 means Iris-virginica

#### Let us split the dataset into training set and Test set so that we can check accuracy of model.

In [17]:
x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size=0.2)

In [18]:
# There is no need of Scaling the features.

#### =================================================================================================
### Let us make Models one by one.

### 1). Logistic Regression

In [19]:
# First step is to train our model .

classifier_logi = LogisticRegression()
classifier_logi.fit(x_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [20]:
# Let's Predict our model on test set.
y_pred = classifier_logi.predict(x_test)

In [21]:
# Let us check the accuracy of the model
accuracy = cross_val_score(estimator=classifier_logi, X=x_train, y=y_train, cv=10)
print(f"The accuracy of the Logistic Regressor Model is \t {accuracy.mean()}")
print(f"The deviation in the accuracy is \t {accuracy.std()}")

The accuracy of the Logistic Regressor Model is 	 0.9405594405594405
The deviation in the accuracy is 	 0.066212366375176


* Here we are getting the accuracy of 96% which is more than enough.
* Let us check the accuracy of other models.

#### ====================================================================================================

### 2). Support Vector Machine (SVM) with Linear kernel.

In [22]:
# Let us tran model
classifier_svm1 = SVC(kernel='linear')
classifier_svm1.fit(x_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [23]:
# Let's predict on test dataset.
y_pred = classifier_svm1.predict(x_test)

In [24]:
# Check the accuracy.
accuracy = cross_val_score(estimator=classifier_svm1, X=x_train, y=y_train, cv=10)
print(f"The accuracy of the SVM linear kernel Model is \t {accuracy.mean()}")
print(f"The deviation in the accuracy is \t {accuracy.std()}")

The accuracy of the SVM linear kernel Model is 	 0.9923076923076923
The deviation in the accuracy is 	 0.02307692307692306


* Here we get the accuracy of 97%.

#### =====================================================================================================

### 3). SVM with Gaussian kernel

In [25]:
# Train the model
classifier_svm2 = SVC(kernel='rbf', )
classifier_svm2.fit(x_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [26]:
# Predict on test set.
y_pred = classifier_svm2.predict(x_test)

In [27]:
# Check the accuracy.
accuracy = cross_val_score(estimator=classifier_svm2, X=x_train, y=y_train, cv=10)
print(f"The accuracy of the SVM Gaussian kernel Model is \t {accuracy.mean()}")
print(f"The deviation in the accuracy is \t {accuracy.std()}")

The accuracy of the SVM Gaussian kernel Model is 	 0.9823076923076923
The deviation in the accuracy is 	 0.03575889015129063


#### =================================================================================================


### 4). K- Nearest Neighbour (KNN)

In [28]:
# Train model
classifier_knn = KNeighborsClassifier()
classifier_knn.fit(x_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [29]:
# predict on test set
y_pred = classifier_knn.predict(x_test)

In [30]:
# Check the accuracy.
accuracy = cross_val_score(estimator=classifier_knn, X=x_train, y=y_train, cv=10)
print(f"The accuracy of the KNN Model is \t {accuracy.mean()}") 
print(f"The deviation in the accuracy is \t {accuracy.std()}")

The accuracy of the KNN Model is 	 0.9732167832167834
The deviation in the accuracy is 	 0.04124101788348962


#### ================================================================================================

### 5). Naive Bayes Model.

In [31]:
# Train Model
classifier_bayes = GaussianNB()
classifier_bayes.fit(x_train,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [32]:
# Predict on test set.
y_pred = classifier_bayes.predict(x_test)

In [33]:
# Check the accuracy and deviation in the accuracy
accuracy = cross_val_score(estimator=classifier_bayes, X=x_train, y=y_train, cv=10)
print(f"The accuracy of the Naive Bayes Model is \t {accuracy.mean()}") 
print(f"The deviation in the accuracy is \t {accuracy.std()}")

The accuracy of the Naive Bayes Model is 	 0.965034965034965
The deviation in the accuracy is 	 0.059172575667989294


#### ====================================================================================================

### 6). Decision Tree Model

In [34]:
# Train model
classifier_deci = DecisionTreeClassifier()
classifier_deci.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [35]:
# Predict on test set
y_pred = classifier_deci.predict(x_test)

In [36]:
# Check the accuracy and deviation in the accuracy
accuracy = cross_val_score(estimator=classifier_deci, X=x_train, y=y_train, cv=10)
print(f"The accuracy of the Decision Tree Model is \t {accuracy.mean()}") 
print(f"The deviation in the accuracy is \t {accuracy.std()}")

The accuracy of the Decision Tree Model is 	 0.9832167832167833
The deviation in the accuracy is 	 0.033711806414528526


#### =====================================================================================================

### 7). Random Forest Model


In [37]:
# Train Model
classifier_ran = RandomForestClassifier()
classifier_ran.fit(x_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [38]:
# Predict on test set.
y_pred = classifier_ran.predict(x_test)

In [39]:
# Check the accuracy and deviation in the accuracy
accuracy = cross_val_score(estimator=classifier_ran, X=x_train, y=y_train, cv=10)
print(f"The accuracy of the Random Forest Model is \t {accuracy.mean()}") 
print(f"The deviation in the accuracy is \t {accuracy.std()}")

The accuracy of the Random Forest Model is 	 0.9832167832167833
The deviation in the accuracy is 	 0.033711806414528526


#### Observation:-
* As we have completed all the models of classification.
* Let us choose the best one among them on the basis of accuracy and their deviation.
* Out all the models SVM with linear and SVM with gaussian kernel are best as both give the same accuracy and deviation of 97% and 5% respectively.
* It means when we make prediction with SVM linear kernel and SVM gaussian kernel, then our accuracy will vary in range of 92% to 100%.

## Now Let us make Prediction on new values of SepalLength, SepalWidth, PetalLength, PetalWidth.

In [40]:
# Let's make prediction on new values.
try:
    sepalLength = float(input("Enter Sepal Length:\t"))
    sepalWidth = float(input("Enter Sepal Width:\t"))
    petalLength = float(input("Enter Petal Length:\t"))
    petalWidth = float(input("Enter Petal Width:\t"))

    new_values = [[sepalLength,sepalWidth,petalLength,petalWidth],]  # Making 2-D array.

    species = classifier_svm2.predict(new_values) # Using SVM Gaussian kernel

    if species[0]==0:
        flag = 'Iris-setosa'
    elif species[0]==1:
        flag = 'Iris-versicolor'
    else:
        flag = 'Iris-virginica'

    print()
    print()
    print(f"*** The Species is: \t {flag} ****")    
    
except Exception as e:
    print("Run this code with Python")



Run this code with Python


#### =====================================================================================================
#### =====================================================================================================
#### =====================================================================================================
#### =====================================================================================================

# IF THIS KERNEL IS HELPFUL, THEN PLEASE UPVOTE.
<img src='https://drive.google.com/uc?id=1qihsaxx33SiVo5dIw-djeIa5SrU_oSML' width=400 >