***About Dataset***

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his
1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because
Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data
set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four 
features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

This dataset became a typical test case for many statistical classification techniques in machine learning such as support
vector machines

Content
The dataset contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and 
Class(Species).

# Importing Neccesary Libraries

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading Data

In [2]:
iris = pd.read_csv(r"C:\Users\yoges\Downloads\archive\IRIS.csv")

In [3]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
iris.shape

(150, 5)

# Data Cleaning & Data Preprocessing¶

In [5]:
iris.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [6]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [7]:
iris.species.value_counts()

species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

In [8]:
iris.species.replace({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}, inplace=True)

In [9]:
iris.species.value_counts()

species
0    50
1    50
2    50
Name: count, dtype: int64

In [10]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


# Samplimg Train & Test Data

In [11]:
from sklearn.model_selection import train_test_split

In [12]:

iris_train , iris_test = train_test_split(iris , test_size=0.2)

In [13]:
iris_train_x = iris_train.iloc[: ,0:-1]
iris_train_y = iris_train.iloc[:, -1]

In [14]:
iris_test_x = iris_test.iloc[: ,0:-1]
iris_test_y = iris_test.iloc[:, -1]

# Random Forest

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
ran_iris = RandomForestClassifier()

In [17]:
ran_iris.fit(iris_train_x, iris_train_y)

In [18]:
pred = ran_iris.predict(iris_test_x)
#pred = np.argmax(iris_test_x, axis=1)

In [19]:
pred

array([1, 1, 1, 0, 2, 2, 1, 0, 2, 1, 0, 1, 2, 0, 2, 1, 2, 1, 0, 0, 2, 1,
       0, 2, 0, 0, 2, 2, 0, 0], dtype=int64)

In [20]:
from sklearn.metrics import confusion_matrix

In [21]:
iris_tab = confusion_matrix(iris_test_y, pred)

In [22]:
iris_tab

array([[11,  0,  0],
       [ 0,  9,  1],
       [ 0,  0,  9]], dtype=int64)

In [23]:
from sklearn.metrics import accuracy_score

In [24]:
accuracy_score(iris_test_y, pred) * 100

96.66666666666667

In [25]:
# Recall (TPR)
# Tp / Tp + Fn

In [26]:
from sklearn.metrics import recall_score

In [27]:
recall_score(iris_test_y, pred , average='micro') * 100

96.66666666666667

In [28]:
from sklearn.metrics import precision_score

In [29]:
precision_score(iris_test_y, pred,average='micro') * 100

96.66666666666667

In [30]:
from sklearn.metrics import f1_score

In [31]:
f1_score(iris_test_y, pred, average='micro') *100

96.66666666666667

In [32]:
#pred_prob = ran_iris.predict_proba(iris_test_x)
#pred_prob

In [33]:
from sklearn.metrics import classification_report
classification_report(iris_test_y, pred)

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00        11\n           1       1.00      0.90      0.95        10\n           2       0.90      1.00      0.95         9\n\n    accuracy                           0.97        30\n   macro avg       0.97      0.97      0.96        30\nweighted avg       0.97      0.97      0.97        30\n'

# SVM (Support Vector Machines)

In [34]:
from sklearn.svm import SVC

In [35]:
svm_iris = SVC()

In [36]:
svm_iris.fit(iris_train_x, iris_train_y)

In [37]:
pred = svm_iris.predict(iris_test_x)

In [38]:
pred

array([1, 1, 1, 0, 2, 2, 1, 0, 2, 1, 0, 1, 2, 0, 2, 1, 2, 1, 0, 0, 2, 1,
       0, 2, 0, 0, 2, 2, 0, 0], dtype=int64)

In [39]:
from sklearn.metrics import accuracy_score

In [40]:
accuracy_score(iris_test_y, pred) * 100

96.66666666666667

In [41]:
from sklearn.metrics import recall_score

In [42]:
recall_score(iris_test_y, pred , average='micro') * 100

96.66666666666667

In [43]:
from sklearn.metrics import precision_score

In [44]:
precision_score(iris_test_y, pred,average='micro') *100

96.66666666666667

In [45]:
from sklearn.metrics import f1_score

In [46]:
f1_score(iris_test_y, pred,average='micro') * 100

96.66666666666667

In [47]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

Random Forest:

Achieved 96.66% accuracy.
Robust performance, suitable for high-dimensional data.
Provides feature importance scores.

SVM:

Achieved 96.66% accuracy.
Effective in high-dimensional space.
Requires careful hyperparameter tuning.

Comparison:

Both models performed well.
Random Forest is simple and robust, while SVM offers effective separation.
Choice depends on interpretability, efficiency, and dataset characteristics.