# Iris flower has three species; setosa, versicolor, and virginica, which differs according to their measurements. Now we assume that we have the measurements of the iris flowers according to their species, and here our task is to train a machine learning model that can learn from the measurements of the iris species and classify them.

Code by VIKRAM RAWAT

# 1. Import necessary libraries

In [1]:
import numpy as np
import pandas  as pd

In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# 2. Load the Iris dataset

In [3]:
iris = pd.read_csv("Iris Flower - Iris.csv") 
#Here we use the Pandas library to read the Iris Flower csv file and it create a dataframe for the Iris Dataset

In [4]:
iris

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


# 3. Data Preprocesssing

In [5]:
iris.isnull().sum() #check whether is there any null value or not

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [6]:
iris.drop_duplicates() #to drop duplicates row if any

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [7]:
iris.keys() #gives the list which contains the columns name

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [8]:
a = list(iris.Species.unique()) #give the list which contains unique values present in the Species column
a

['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

In [9]:
label_encoder = LabelEncoder()

iris['Species_Encoded'] = label_encoder.fit_transform(iris['Species']) #transform the categorical column into the numerical column

In [10]:
iris

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Species_Encoded
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0
...,...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica,2
146,147,6.3,2.5,5.0,1.9,Iris-virginica,2
147,148,6.5,3.0,5.2,2.0,Iris-virginica,2
148,149,6.2,3.4,5.4,2.3,Iris-virginica,2


# 4. Feature Selection

Extract the Target Column and Data Columns From the Iris DataFrame and also convert these DataFrames to NumPy array.

For Data we take 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm' columns

In [11]:
Data = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].to_numpy()
Data #Extract the data column and convert it into arrays

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [12]:
Data.shape

(150, 4)

For Target we take 'Species_Encoded' column

In [13]:
Target = iris['Species_Encoded'].to_numpy()
Target #Extract the traget column and convert it into array

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [14]:
Target.shape

(150,)

# 5. Train-Test-Split

Now Divide the target rows and data rows into two parts in the ratio 7:3 mean 70% rows for training and 30% rows for testing.

In [15]:
Data_train , Data_test , Target_train , Target_test = train_test_split(Data , Target , test_size = 0.3 , random_state = 1)

In [16]:
Data_train.shape

(105, 4)

In [17]:
Target_train.shape

(105,)

In [18]:
Data_test.shape

(45, 4)

In [19]:
Target_test.shape

(45,)

# 6. Apply Classification Algorithms

# A. Decision Tree Classification

In [20]:
DTC = DecisionTreeClassifier(random_state = 0) 
DTC

Random_state=0, it ensures that the randomness in your algorithm (like splitting data) will be the same every time you run your code. This is crucial for reproducibility, so that others (or you) can obtain the exact same results when rerunning the model.The choice of 0 is arbitrary; you could set it to any integer (e.g., random_state=42 is also common). The key is that using the same value will produce the same results each time.

In [21]:
DTC.fit(Data_train , Target_train)

In [22]:
DTC_prediction = DTC.predict(Data_test)
DTC_prediction

array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       2, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 1, 2,
       1])

In [23]:
DTC_prediction.shape

(45,)

In [24]:
a1 = accuracy_score(DTC_prediction , Target_test)
print(f"Decision Tree Classifier Accuracy: {a1 * 100:.2f}%")

Decision Tree Classifier Accuracy: 95.56%


# B. SVM Classification

In [25]:
svm_class = svm.SVC(kernel = "linear")
svm_class

In [26]:
svm_class.fit(Data_train , Target_train)

In [27]:
svm_class_prediction = svm_class.predict(Data_test)
svm_class_prediction

array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 2, 2,
       1])

In [28]:
a2 = accuracy_score(svm_class_prediction , Target_test)
print(f"SVM Accuracy: {a2 * 100:.2f}%")

SVM Accuracy: 100.00%


# C. Gaussian NB Classification

In [29]:
g = GaussianNB()
g

In [30]:
g.fit(Data_train , Target_train)

In [31]:
g_prediction = g.predict((Data_test))
g_prediction

array([0, 1, 1, 0, 2, 2, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       2, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 1, 2,
       1])

In [32]:
a3 = accuracy_score(g_prediction , Target_test)
print(f"GaussianNB Accuracy: {a3 * 100:.2f}%")

GaussianNB Accuracy: 93.33%


# D. Random Forest Classification

In [33]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(Data_train, Target_train)

In [34]:
RFC_pred = model.predict(Data_test)
RFC_pred

array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       2, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 1, 2,
       1])

In [35]:
a4 = accuracy_score(RFC_pred , Target_test)
print(f"Random Forest Classifier Accuracy: {a4 * 100:.2f}%")

Random Forest Classifier Accuracy: 95.56%


# Important Points To Remember

# Random state

Random_state=0, it ensures that the randomness in your algorithm (like splitting data) will be the same every time you run your code. This is crucial for reproducibility, so that others (or you) can obtain the exact same results when rerunning the model.
The choice of 0 is arbitrary; you could set it to any integer (e.g., random_state=42 is also common). The key is that using the same value will produce the same results each time.

# Acurracy Scores of different classifiers

1.Decision Tree (95.56%): The decision tree performs well, but may slightly overfit or underfit in certain cases compared to SVM.

2.SVM (100%): SVM achieves perfect accuracy, meaning it generalizes extremely well for this dataset, likely finding the optimal decision boundaries between classes.

3.GaussianNB (93.33%): Naive Bayes performs decently but slightly lower, which indicates that it may not fully capture the relationships between features as well as the other methods.

4.Random Forest (95.56%): Random Forest performs as well as the decision tree, showing strong performance through ensembling but not quite perfect like SVM.

# Methods Rank Based On Typical Perfromance on Iris Flower Classification

1.Support Vector Machines (SVM): Effective in high-dimensional spaces and with clear margin separation.

2.Random Forest: Often achieves high accuracy due to ensemble learning. It tends to handle overfitting well.

3.Decision Trees: Easy to interpret, but prone to overfitting without proper pruning.

4.Gaussian Naive Bayes: Generally performs well for text classification but may not capture complex relationships.

# Conclusions

The goal of this project is to classify iris flowers into three species—Iris-setosa, Iris-versicolor, and Iris-virginica—based on their sepal and petal measurements using various machine learning algorithms. The Iris dataset is used for this purpose, containing features like SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm.

**SVM** achieved perfect accuracy (100%), making it the best model for the Iris dataset, followed closely by **Decision Tree** and **Random Forest** (95.56% each).
**GaussianNB** performed slightly lower at 93.33%, indicating it was less effective at capturing the relationships between features.

This project demonstrates the effectiveness of various machine learning models for simple classification tasks and highlights the importance of choosing the right model for the given data.