# Iris Flower Classification using Machine Learning
<hr>

![Iris Flowers on unsplash](https://raw.githubusercontent.com/Dutta-SD/Images_Unsplash/master/Kaggle/olga-mandel-gK6f8bKKic0-unsplash.jpg)

Iris Flower Dataset is a very useful datset. It is usually the first dataset that one usually comes across when starting on their ML journey.

It consists of dimensions of various flowers and we are required to predict what is the species of the flower.

Let's explore this dataset and apply ML on it.

Specifically what we will be using : 
* **_Repeated Stratified K Fold Cross Validation_**
* **Removing Feature interdependence using _PCA_**
* **_Naive Bayes_ Model**
* **Random Forest Model**

## _Initial Kaggle Environment Setup_

This is the initial cell for most Kaggle Notebooks. The output of the below cell will give you the path for the dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

* As we see, the dataset is a `.csv` file named *_IRIS.csv_*. 
We import the dataset using <b><u>pandas(pd)</u></b> and store it in a DataFrame object.

In [None]:
# Importing the data
iris_data = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv")
# See the top columns of the datset
iris_data.head()

As we see the dataset consists of 5 columns.

Let us use `DataFrame.info()` to get insights about various datatypes of the data.

In [None]:
# Information about DataType
iris_data.info()

Some insights which we derive from the info:
* There are 5 columns, 4 have datatype `float64` and 1 has datatype `object`(Can consider it to be string)
* There are 150 entries in each column. 
* As we can see, all entries are non-null. This means that every cell of DataFrame has some value. Usually in real world data, we have `NaN` values(which serves as placeholder for missing data)
* `sepal_length`, `sepal_width`, `petal_length`, `petal_width` are the features we have. We have to use it to predict `species`.

### Converting species column to numeric form

We will convert the species to Label Encoded form using `DataFrame.map()`

In [None]:
# as species is string, this will create problem for the model. convert to integer data
iris_data['species'] = iris_data['species'].map({  
                                                'Iris-setosa' : 0, 
                                                'Iris-versicolor' : 1,
                                                'Iris-virginica' : 2
                                                  })
iris_data.head()

In [None]:
# Get dependent and independent features
X, y = iris_data.iloc[:, :-1], iris_data.iloc[:, -1]   ### The last feature is 'species'. This is dependent feature
X.shape, y.shape

# Data Visualisation

Data visualisation is an important step in Data Science. We can use it for:
* **Outlier Detection** - See if any datapoint is very different from the others. If this happens, we need to remove it as otherwise it can skew our observations heavily. Visualising can help in this case


* **Class Imbalance Detection** - Often in real world data, you can have observations from one class in a large number and other class in small numbers. This is problematic as if we have a dumb model that predicts the majority class all the time, it is going to get a good accuracy.<br><br>**If we have 99 'YES' and 1 'NO' in the dataset then the model predicting 'YES' all the time will get a 99% Accuracy. Visualisation helps us in removing this.**

In [None]:
# Visualisation Libraries
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### 1. PairPlot of all data Points
Pairplot generates plots between all pairs of data. This is helpful for visualising the whole data in one go.

In [None]:
### hue = species colours plot as per species
### It will give 3 colours in the plot

sns.set_style('whitegrid')   ### Sets grid style
sns.pairplot(data = iris_data,
             hue = 'species',
             palette = sns.color_palette("husl"),
             markers=["o", "s", "D"],
             corner = True,
             diag_kind = 'kde'             
             ); 

### PairPlot insights into the data
1. `petal-length` and `petal-width` seem to be positively correlated(seem to be having a linear relationship). 
2. Iris-Setosa seems to have smaller petal length and petal width as compared to others. 
3. Looking at the overall scenario, it seems to  be the case that Iris-Setosa has smaller dimensions than other flowers.

### 2. BarPlot for Class Distribution
We will generate a bar plot to see of there is some data imbalance or not (is some flower species more frequent than the other)

In [None]:
# Bar plot
sns.set_style('darkgrid')
sns.barplot(x = iris_data['species'].unique(),
            y = iris_data['species'].value_counts(),
            palette=sns.color_palette(["#e74c3c", "#34495e", "#2ecc71"]));

### Bar Plot Inference
1. There are exactly 50 observations in each class. So the model has no class imbalance. (Lucky for us! :P)

### 3. HeatMap

HeatMap of the _correlation matrix_ will give the amount of correlation between two variables. **Correlation** measures the amount by which two variables vary together. So variables which vary with the dependent variable will be a better predictor of the dependent variables.

In [None]:
# Correlation matrix will be generated by DataFrame.corr() function
plt.figure(figsize = (8, 8))
sns.heatmap(iris_data.corr(), 
            annot=True,
            linewidths=3,
           cmap="terrain",
           square = True);

### Heatmap Inference

* Petal Length, Petal width, and Sepal length have the most stong positive correlation to species.
Sepal width has small negative correlation to species
This indicates that petal length, petal width, sepal width are enough predictors. 

* However, we see that the features are intercorrelated to each other. So we have multicollinearity, that is multiple features giving the same information. So our model is being fed the same data. 

This means we need to remove it by **PCA**

# Plotly Plots

In [None]:
import plotly.express as px
fig = px.sunburst(iris_data,
                  path = ['species', 'petal_length'],
                  color ='petal_width',
                 title = 'Petal Dimension Distribution')
fig.show()

## Removing Interdependency by PCA

PCA reduces dimensions as well as is used to reduce interdependency of features

* **NOTE** : PCA assumes data is centered. That is mean of data points is 0. We need to ensure this.

In [None]:
## PCA
from sklearn.decomposition import PCA

pca_cleaner = PCA(4, whiten=True) ## First one is number of components

## Remove the mean of the data, to center it.
X -= X.mean()

iris_data_X = pca_cleaner.fit_transform(X)
iris_data_X.shape    ### This becomes a numpy array

## Data Splitting(Repeated Stratified K-Fold)

Now we split the data into training and validation set. This is a small dataset, so let us use $Repeated\ Stratified\ K\ Fold\ Cross\ Validation$ for this.

Let us understand what is Repeated Stratified K Fold Cross Validation:
* **K Fold Cross Validation** - We take $k$ splits of the data, that is split data into $k$ equal parts. Take one part as validation set and other as training data. Then train model further by taking any other of the $k$ splits and the remaining part as training data.

* **Stratified** - We will split the k folds in such a way that each part will have roughly equal number of classes.$$[Here,\ each\ split\ must\ have\ roughly\ equal\ number\ of\ flowers\ of\ same\ species]$$

* **Repeated** - We will repeat this k fold splitting process a number of times.

This technique is to ensure we get maximum amount of training data for our model

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold

rskfSplitter = RepeatedStratifiedKFold(random_state = 40)    ### RSKF object declaration. Use it when training model

## Model Training

We will select multiple models to check accuracy and see what gives us the best results.


<hr>

## Evaluation metric
The evaluation metric will be cross validation score


### 1. Naive Bayes Model
Now we will make a `Naive Bayes` Model for predicting the flower type.

Naive Bayes outputs probabilites for the data point to belong to each class. The class with the highest probability is taken to be the class of the flower.

It uses Bayes Theorem to calculate the probabilities for a flower having particular characteristics to belong to a class. When it sees new data, it uses its previous calculations to predict the class of new data.

It assumes that the features of the data are independent from one another.

In [None]:
# import Naive Bayes Model
from sklearn.naive_bayes import GaussianNB

# import cross validation score
from sklearn.model_selection import cross_val_score

naiveBayesModel = GaussianNB()    ### Predicting Model

### 2. Get cross Validation Scores

In [None]:
# get Cross Validation Scores
scores = cross_val_score(naiveBayesModel, iris_data_X, y, cv = rskfSplitter, n_jobs = 4) #### n_jobs - for parallelization

In [None]:
scores

## Get final Score

We also plot the cross validation scores using pyplot of matplotlib.

In [None]:
### Plotting using Matplotlib
### We customise each element like this
plt.figure(figsize = (20, 2))
plt.title("Cross Validation Scores in Naive Bayes")
plt.xlabel("Number of Tests")
plt.ylabel("Model Accuracy")
plt.plot(scores);
plt.show();

# Final Score
print(f"Final Accuracy of Gaussian Naive Bayes Model is:  {scores.mean()}")

## Random Forest Model

We are going to use `Random Forest Model` for prediction next. 

Idea of Random Forest Model :
* Makes many small decision trees which do not classify all that well on their own
* Takes a vote of the predictions of all the decision trees to classify.

This is a very high level description of Random Forest.

In [None]:
## This code will be very similar to the Naive Bayes model code.

#---------------------------------------------------------------

# import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

rForestModel = RandomForestClassifier()    ### Predicting Model

# get Cross Validation Scores
scoresRF = cross_val_score(rForestModel, iris_data_X, y, cv = rskfSplitter, n_jobs = 4) #### n_jobs - for parallelization

### Plotting using Matplotlib
plt.figure(figsize = (20, 2))
plt.title("Cross Validation Scores in Random Forest")
plt.xlabel("Number of Tests")
plt.ylabel("Model Accuracy")
plt.plot(scoresRF);
plt.show();

# Final Score
print(f"Final Accuracy of Random Forest Model is:  {scoresRF.mean()}")

## Final Conclusion

We achieved a cv-score of **0.952** with the Random Forest Model. Tuning Hyperparameters can improve the score maybe.

This is one analysis of the iris datset. It can be imporved in many ways.
<hr>
<center>!!Any suggestions to improve is sincerely welcome!!</center>