Project Name: Iris Species

Date: 15/10/2025

Tool Used: Google Colab | Python | Pandas | NumPy | Scikit-learn | Matplotlib | Seaborn

Problem Statement:

The Iris dataset is a well known dataset in pattern recognition and classification.

It was introduced by R.A. Fisher in 1936 in his paper “The Use of Multiple Measurements in Taxonomic Problems.

The objective of this project is to classify iris flowers into one of the three species —> Setosa, Versicolor, and Virginica —> based on four measured features:

	•	Sepal Length (cm)
	•	Sepal Width (cm)
	•	Petal Length (cm)
	•	Petal Width (cm)


Project Objectives:

	•	To understand and explore the Iris dataset using statistical and visual methods.
	•	To perform data preprocessing and exploratory data analysis (EDA).
	•	To develop and train a machine learning classification model.
	•	To evaluate model performance using suitable metrics such as accuracy and confusion matrix.
	•	To identify the most important features influencing the classification.


Dataset Source: https://www.kaggle.com/datasets/uciml/iris

Dataset Description:
The dataset includes 150 samples of iris flowers divided equally among three species:

	1.	Iris Setosa

	2.	Iris Versicolor
  
	3.	Iris Virginica


It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

Species


#### Data Preprocessing and Exploratory Data Analysis

In [None]:
# Importing all the required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

In [None]:
data = pd.read_csv("/content/drive/MyDrive/Me/My project/Iris Species/Iris.csv")

In [None]:
# Checking the top 5 data
data.head()

We can see there are serval columns, (Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species).

In these columns Id column is not required. so we are going to drop that column. Rest of the columns are important for training the model.

In [None]:
data.shape

In [None]:
data.drop('Id', axis=1, inplace=True)

In [None]:
# Checking for null values
data.isnull().sum()

There is no null values, so our dataset is clean. now we will check for duplicated values in our dataset.

In [None]:
# Checking for duplicated values
data.duplicated().sum()

In [None]:
# Dropping the duplicated values
data = data.drop_duplicates()

In [None]:
# Checking for duplicated values
data.duplicated().sum()

We have successfully dropped the duplicated values from our dataset.

In [None]:
# Display statistical summary of the dataset (mean, std, min, max, quartiles).
data.describe()

In [None]:
data.head()

In [None]:
# Checking for value counts in Species column
data['Species'].value_counts()

In [None]:
list(data['Species'].unique())

The ‘Species’ column contains categorical values — [‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’].

Since most machine learning algorithms work with numerical data, we will convert these text labels into numeric codes for model training:

	•	Iris-setosa → 0
	•	Iris-versicolor → 1
	•	Iris-virginica → 2

In [None]:
def convert_text_to_numerical(data):
  if data == "Iris-setosa":
    return 0
  elif data == "Iris-versicolor":
    return 1
  else:
    return 2

data['Species'] = data['Species'].apply(convert_text_to_numerical)

In [None]:
data.head()

In [None]:
# Checking for value counts in Species column
data['Species'].value_counts()

In [None]:
data.info()

In [None]:
sns.pairplot(data, hue='Species')

In [None]:
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')

we can observe that PetalWidthCm and PetalLengthCm is having the strongest correlation with our target column (Species), along with them SepalLengthCm also have moderate correlation with the target column (Species).

By seeing these we can extract some new feature here like:

PetalAreaCm column, using PetalWidthCm, PetalLengthCm.

SepalLengthCm column using SepalLengthCm and SepalWidthCm.

PetalSepalRatio columns using PetalLengthCm and SepalLengthCm.

This could improve the model performance, especially for overlapping classes.

In [None]:
#Creating new features
data['PetalAreaCm'] = data['PetalLengthCm'] * data['PetalWidthCm']
data['SepalArea'] = data['SepalLengthCm'] * data['SepalWidthCm']
data['PetalSepalRatio'] = data['PetalLengthCm'] / data['SepalLengthCm']

In [None]:
data.head()

In [None]:
features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'PetalAreaCm', 'SepalArea', 'PetalSepalRatio']
for feature in features:
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)
    IQR = Q3 - Q1
    outliers = data[(data[feature] < Q1 - 1.5*IQR) | (data[feature] > Q3 + 1.5*IQR)]
    print(f"{feature} - Number of outliers: {outliers.shape[0]}")

There are not much outliers here, now we can go to futher steps by spitting the data and training our model.

In [None]:
X = data.drop('Species', axis=1)
y = data['Species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training data shape:", X.shape, X_train.shape, X_test.shape)
print("Training data shape:", y.shape, y_train.shape, y_test.shape)

We have now split the dataset into training and testing sets.

We will standardize the data, for optimizing the data, so that the model can learn in better way.

We are going to use two models here, Support vector machine and RandomForestClassifier, and we will use the test data sets to evaluate its performance.

In [None]:
# Loading the scaler model
scaler = StandardScaler()

#Training and Transforming the data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Loading the models
svm_model = SVC(kernel='rbf', random_state=42)
random_forest_classifier_model = RandomForestClassifier(n_estimators=100, random_state=42)

In [None]:
# Training the data
svm_model.fit(X_train, y_train)
random_forest_classifier_model.fit(X_train, y_train)

In [None]:
# Predicting the results on the unseen data

svm_predict = svm_model.predict(X_test)
rf_predict = random_forest_classifier_model.predict(X_test)

In [None]:
# Evaluating the model

print("SVM Model Accuracy:", accuracy_score(y_test, svm_predict))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_predict))

print("\nSVM Classification Report:\n", classification_report(y_test, svm_predict))
print("\nRandom Forest Classification Report:\n", classification_report(y_test, rf_predict))

SVM Model Accuracy: 0.9666666666666667
Random Forest Accuracy: 0.9333333333333333

SVM Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.90      0.95        10
           2       0.90      1.00      0.95         9

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30


Random Forest Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.90      0.90      0.90        10
           2       0.89      0.89      0.89         9

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30



The SVM model achieved an accuracy of 96.67%, while the Random Forest model achieved an accuracy of 93.33% on the Iris dataset.

This indicates that the dataset is relatively easy to classify due to clear separability between classes and well-balanced features.

Although both SVM and Random Forest use different learning mechanisms, the SVM model slightly outperformed the Random Forest model, suggesting that SVM was better at capturing the underlying class boundaries in this dataset.