<a href="https://colab.research.google.com/github/rajathAgalkote/OIBSIP_IrisFlowerClassification/blob/main/Iris_Flower_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Iris Flower Classification



##### **Project Type**    - Classification
##### **By**    - Rajathadri A S

# **Project Summary -**

The Iris Flower Species Prediction Project aimed to develop a machine learning model that could accurately classify Iris flowers into their respective species based on their measurement specifications. The dataset consisted of 150 instances of Iris flowers, with four features: sepal length, sepal width, petal length, and petal width, and the target variable was the species of the flower.

The project began with exploratory data analysis, which indicated that all four features were uniquely distributed for different species, making it easier for ML models to predict the species more accurately. Two classification models were developed, and their performance was evaluated using various metrics such as confusion matrix and classification reports.

The results showed that the K-Nearest Neighbors (KNN) algorithm without hyperparameter tuning achieved the best accuracy, precision, recall, and F1-score for all classes, yielding a perfect score of 100%. The model was able to generalize well to new data and classify the Iris flowers accurately into their respective species categories.

In conclusion, the project successfully developed a machine learning model that accurately predicted the species of Iris flowers based on their measurement specifications, which has practical applications in botany, ecology, and horticulture. The KNN algorithm without hyperparameter tuning yielded the best results, and the model's performance was validated using various metrics, indicating optimal performance.

# **GitHub Link -**

https://github.com/rajathAgalkote/OIBSIP_IrisFlowerClassification

# **Problem Statement**


The goal of this project is to develop a machine learning classifier that accurately predicts the species of the Iris flower based on its measurement specifications. The dataset consists of 150 instances of Iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The target variable is the species of the flower, which can be one of three categories: Setosa, Versicolor, and Virginica.

The project requires developing an efficient classification algorithm that can accurately classify Iris flowers into their respective species. The algorithm should be trained on a subset of the available data and then validated on a separate subset to ensure that it generalizes well to new data. The project's success will be measured based on the accuracy of the classification algorithm's predictions on the validation set.

The solution to this problem will have several practical applications, including botany, ecology, and horticulture, as it will help classify Iris flowers based on their morphological characteristics. The problem is important because accurate classification of Iris flowers is essential for understanding their ecological roles, conserving their biodiversity, and developing new cultivars for horticulture.

## ***1. Exploring the Dataset***

### Importing Libraries

In [None]:
# Import Libraries


# Data Wrangling Libraries
import numpy as np
import pandas as pd

# Graphing/Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# ML libraries
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# Miscellaneous libraries
import warnings
from google.colab import drive

In [None]:
# Visualization style 
sns.set_style('whitegrid')
plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (7,4)
plt.rcParams['figure.facecolor'] = '#00000000'
plt.rcParams["figure.autolayout"] = True

In [None]:
# Ignoring all warnings

warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
# mount Drive

drive.mount('/content/drive')

In [None]:
raw_data = pd.read_csv("/content/drive/MyDrive/OIBSIP/Iris_Classifier/Iris.csv")

df = raw_data

### Dataset First View

In [None]:
# Dataset First Look

df.head()

In [None]:
df.shape

In [None]:
df.columns

### Dataset Information

In [None]:
# Dataset Info

df.info()

Our target variable is an Object Datatype

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

No duplicates.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values

plt.figure(figsize=(10,3))
sns.heatmap(df.isnull(), cbar=False)

## ***2. Understanding the Variables***

In [None]:
# Dataset Describe

df.describe().T

### Check Unique Values of species

In [None]:
# Check Unique Values for species

df['Species'].value_counts()

## 3. ***Data Wrangling***

We don't require ID feature. Hence, we will drop it.

In [None]:
# Dropping unwanted features
df=df.drop(columns=['Id'])

df.head()

Species is an Object dtype. We will convert it to Numerical variable.

Iris-setosa      -->  2

Iris-versicolor  -->  3

Iris-virginica   -->  4

In [None]:
# Replacing Object with Numbers 2,3 and 4

df["Species"].replace({"Iris-setosa": 2, "Iris-versicolor": 3, "Iris-virginica": 4}, inplace = True)

In [None]:
df

## ***4. EDA AND Visualizations***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
species_counts = df['Species'].value_counts()
plt.pie(species_counts, labels=species_counts.index, autopct='%1.1f%%')
plt.title('Species Distribution')
plt.show()

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
# Create the Distribution plot
sns.displot(x=df['Species'], y= df['SepalLengthCm'])
plt.xlabel('Species')
plt.ylabel('Sepal Length')
plt.show()

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# Create the Distribution plot
sns.displot(x=df['Species'], y= df['SepalWidthCm'])
plt.xlabel('Species')
plt.ylabel('Sepal Width')
plt.show()

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
# Create the distribution plot
sns.displot(x=df['Species'], y= df['PetalLengthCm'])
plt.xlabel('Species')
plt.ylabel('Petal Length')
plt.show()

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
# Create the distribution plot
sns.displot(x=df['Species'], y= df['PetalWidthCm'])
plt.xlabel('Species')
plt.ylabel('Petal Width')
plt.show()

#### Chart - 6 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
# Checking for multi-collinearity
correlation = df.corr()

plt.figure(figsize=[15, 5])
sns.heatmap(correlation, annot=True, annot_kws={'fontsize': 10})
plt.show()

## ***5. Feature Engineering & Data Pre-processing***

### Data Scaling

In [None]:
# Scaling the data

# Defining X and y

X = df.drop(['Species'], axis = 1)
y = df['Species']

In [None]:
# Scaling values of X

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

### Data Splitting

In [None]:
# Splitting dataset into test and train dataframes

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.20, random_state = 42)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

## ***6. ML Model Implementation***

### ML Model - 1
### Logistic Regression

In [None]:
# Logistic Regression Model

lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
# Prediction

y_pred_test = lr.predict(X_test)
y_pred_train = lr.predict(X_train)

In [None]:
# Evaluation Metrics for Train data

# Classification Report
print('Classification report for Logistic Regression (Train set) = \n')
print(classification_report(y_pred_train, y_train))

In [None]:
# Evaluation Metrics for test data

# Classification Report
print('Classification report for Logistic Regression (Test set)= \n')
print(classification_report(y_pred_test, y_test))

In [None]:
# Use GridSearchCV to find the best result

lr = LogisticRegression()

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(lr, param_grid, cv=5)
grid.fit(X_train, y_train)

In [None]:
# predict on test data

y_pred = grid.predict(X_test)

In [None]:
# print classification report for test data

print("Classification report for test data: \n")
print(classification_report(y_test, y_pred))

In [None]:
# Generating the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

**Logistic Regression Model Report:**

Logistic Regression Classifier Model had the following metrics:

1. Train Set Accuracy score : 93%
2. Test Set Accuracy score : 97%
3. Test Set Accuracy score after Hyperparameter tuning : 100%

Hence, our LR Model is perfoeming well on our Iris Classification Dataset.

### ML Model - 2

In [None]:
# K Nearest Neighbors Model

k=7

clf=KNeighborsClassifier(k)

clf.fit(X_train,y_train)

In [None]:
# Making the predications 

y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

In [None]:
# Evaluation Metrics for Train data

# Classification Report
print('Classification report for KNN Classifier (Train set) = \n')
print(classification_report(y_pred_train, y_train))

In [None]:
# Evaluation Metrics for test data

# Classification Report
print('Classification report for KNN Classifier (Test set)= \n')
print(classification_report(y_pred_test, y_test))

In [None]:
# Generate the confusion matrix

cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

**KNN Classifier Model Report:**

K-Nearest Neighbors Classifier Model had the following metrics:

1. Train Set Accuracy score : 97%
2. Test Set Accuracy score : 100%
3. Test Set Accuracy score after Hyperparameter tuning : -

We did not even need to perform Hyperparameter tuning as our KNN Classifier displayed perfect accuracy in predictions on the Test Dataset.

# **Conclusion**

For our Iris Flower Sprcies prediction project, the goal was to build a predictive model that can help us estimate the species of Iris Flower based on its specifications.

Our EDA indicates that all the four specifications/features are distributed uniquely for different species, which assists our ML models to predict the species more accurately.

We then developed two Classifier models and observed their validations using mainly confusion matrix and classification reports. From our observations, we conclude that KNN algorithm without hyperparameter tuning yielded the best results in predicting the species of Iris Flowers.

The KNN Classifier model achieved a 100% accuracy score, precision, recall, and F1- score for all classes, which indicates that the model is performing very well on the test set and able to generalize well to all classes.

I would like to conclude by claiming that our model displayed optimal performance and is able to classify the Iris Flowers accurately into their respective Species category.

### ***KNN Classifier Model displayed best Classification Accuracy.***