## Classifying Iris Flower Species Using Morphological Measurements

by Rahiq Raees

2025/01/01

In [28]:
import numpy as np
import pandas as pd
import requests
import zipfile
import altair as alt
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Data validation imports (Milestone 2)
import pandera as pa
from pandera import Column, Check, DataFrameSchema

In [None]:
# ============================================================================
# DATA VALIDATION SCHEMAS (Milestone 2)
# ============================================================================
# These schemas validate the data validation checklist items:
# 1. Correct column names (strict=True)
# 2. Correct data types (Column type specifications)
# 3. No empty observations (DataFrame-level Check)
# 4. No missing values (nullable=False)
# 5. No duplicate observations (DataFrame-level Check)
# 6. No outlier/anomalous values (Check.between for ranges)
# 7. Correct category levels (Check.isin for species)

# Schema for raw Iris data (before train/test split)
raw_iris_schema = pa.DataFrameSchema(
    columns={
        "sepal_length": Column(
            float,
            Check.between(0, 15),  # Reasonable biological range in cm
            nullable=False,
        ),
        "sepal_width": Column(
            float,
            Check.between(0, 10),
            nullable=False,
        ),
        "petal_length": Column(
            float,
            Check.between(0, 15),
            nullable=False,
        ),
        "petal_width": Column(
            float,
            Check.between(0, 10),
            nullable=False,
        ),
        "species": Column(
            str,
            Check.isin(["Iris-setosa", "Iris-versicolor", "Iris-virginica"]),
            nullable=False,
        ),
    },
    checks=[
        Check(lambda df: ~(df.isna().all(axis=1)).any(), 
              error="Empty rows found in the dataset."),
    ],
    strict=True,
    coerce=True,
)

# Schema for train/test data (same structure)
train_test_schema = pa.DataFrameSchema(
    columns={
        "sepal_length": Column(float, Check.between(0, 15), nullable=False),
        "sepal_width": Column(float, Check.between(0, 10), nullable=False),
        "petal_length": Column(float, Check.between(0, 15), nullable=False),
        "petal_width": Column(float, Check.between(0, 10), nullable=False),
        "species": Column(
            str,
            Check.isin(["Iris-setosa", "Iris-versicolor", "Iris-virginica"]),
            nullable=False,
        ),
    },
    checks=[
        Check(lambda df: ~(df.isna().all(axis=1)).any(), error="Empty rows found."),
    ],
    strict=True,
    coerce=True,
)

# Schema for scaled features (after StandardScaler)
scaled_features_schema = pa.DataFrameSchema(
    columns={
        "sepal_length": Column(float, Check.between(-10, 10), nullable=False),
        "sepal_width": Column(float, Check.between(-10, 10), nullable=False),
        "petal_length": Column(float, Check.between(-10, 10), nullable=False),
        "petal_width": Column(float, Check.between(-10, 10), nullable=False),
    },
    checks=[
        Check(lambda df: ~(df.isna().all(axis=1)).any(), 
              error="Empty rows found in scaled data."),
    ],
    strict=True,
    coerce=True,
)

print("✓ Validation schemas defined")

✓ Validation schemas defined


# Summary

Here we attempt to build a classification model using the k-nearest neighbours algorithm which can use Iris flower morphological measurements to predict the species of a newly observed Iris flower. Our final classifier performed well on an unseen test data set, with an overall accuracy of approximately 0.93. On the 45 test data cases, it correctly predicted 42. The model correctly classified all *Iris setosa* samples perfectly, while showing some confusion between *Iris versicolor* and *Iris virginica* due to their overlapping morphological characteristics. These results align with Fisher's original findings and confirm the utility of sepal and petal measurements as reliable features for botanical classification tasks.

# Introduction

The classification of plant species based on morphological characteristics has been a fundamental problem in botany and taxonomy for centuries. Accurate species identification is crucial for biodiversity conservation, ecological studies, agricultural applications, and understanding evolutionary relationships between organisms (Anderson, 1936). Traditionally, species classification has relied on expert taxonomists who examine physical features of specimens, but this approach is subjective, time-consuming, and dependent on the availability of trained specialists.

Here we ask if we can use a machine learning algorithm to predict the species of an Iris flower given only its sepal and petal measurements. Answering this question is important because it demonstrates how quantitative morphological measurements can be used to automate species identification tasks. The Iris flower dataset, originally collected by American botanist Edgar Anderson and subsequently used by British statistician Ronald A. Fisher in his seminal 1936 paper "The Use of Multiple Measurements in Taxonomic Problems," represents one of the earliest and most influential datasets in the field of pattern recognition and machine learning (Fisher, 1936). Fisher used this dataset to demonstrate linear discriminant analysis, a statistical method for classifying observations into predefined categories based on multiple quantitative variables.

# Methods

## Data
The data set used in this project is the classic Iris flower dataset created by Edgar Anderson and popularized by R.A. Fisher (Fisher, 1936). It was sourced from the UCI Machine Learning Repository (Dua & Graff, 2017) and can be found [here](https://archive.ics.uci.edu/dataset/53/iris), specifically [this file](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data). Each row in the data set represents measurements from a single Iris flower specimen, including four morphological measurements (sepal length, sepal width, petal length, and petal width in centimeters) and the species classification. The dataset contains 150 samples (50 from each of three species): *Iris setosa*, *Iris versicolor*, and *Iris virginica*.

## Analysis
The k-nearest neighbors (k-nn) algorithm was used to build a classification model to predict the species of Iris flowers (found in the species column of the data set). All variables included in the original data set were used to fit the model. Data was split with 70% being partitioned into the training set and 30% being partitioned into the test set. The hyperparameter $K$ was chosen using 10-fold cross validation with accuracy as the classification metric. All variables were standardized just prior to model fitting. The Python programming language (Van Rossum and Drake 2009) and the following Python packages were used to perform the analysis: requests (Reitz 2011), zipfile (Van Rossum and Drake 2009), numpy (Harris et al. 2020), Pandas (McKinney 2010), altair (VanderPlas 2018), scikit-learn (Pedregosa et al. 2011).

# Results & Discussion

To look at whether each of the predictors might be useful to predict the Iris species, we plotted the distributions of each predictor from the training data set and coloured the distribution by species (setosa: blue, versicolor: orange, and virginica: green). In doing this we see that *Iris setosa* is clearly distinguishable from the other two species across all four measurements, particularly in petal length and petal width where there is virtually no overlap. In contrast, *Iris versicolor* and *Iris virginica* show substantial overlap in their distributions, especially for sepal measurements, making their classification more challenging.

In [30]:
# download data as zip and extract
url = "https://archive.ics.uci.edu/static/public/53/iris.zip"

request = requests.get(url)
with open("../data/raw/iris.zip", 'wb') as f:
    f.write(request.content)

with zipfile.ZipFile("../data/raw/iris.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/raw")

In [31]:
# pre-process data (e.g., scale and split into train & test)
# read in data
colnames = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "species"
]

iris = pd.read_csv("../data/raw/iris.data", names=colnames, header=None)

iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [32]:
# ============================================================================
# RAW DATA VALIDATION (before train/test split)
# ============================================================================
# Validates: column names, data types, value ranges, categories,
# no missing values, no duplicates, no empty rows

try:
    iris = raw_iris_schema.validate(iris, lazy=True)
    print("✓ Raw data validation PASSED!")
    print(f"  - {len(iris)} observations")
    print(f"  - {len(iris.columns)} columns: {list(iris.columns)}")
except pa.errors.SchemaErrors as e:
    print("✗ Raw data validation FAILED!")
    print(e.failure_cases)
    raise

# Check target distribution - each class should have reasonable representation
print("\nTarget distribution check:")
class_distribution = iris["species"].value_counts(normalize=True)
print(class_distribution)

min_class_proportion = 0.1  # Each class should have at least 10%
for class_name, proportion in class_distribution.items():
    if proportion < min_class_proportion:
        raise ValueError(
            f"Class '{class_name}' has only {proportion:.2%} of samples, "
            f"below the minimum threshold of {min_class_proportion:.2%}"
        )
print(f"✓ All classes have at least {min_class_proportion:.0%} representation!")

✓ Raw data validation PASSED!
  - 150 observations
  - 5 columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Target distribution check:
species
Iris-setosa        0.333333
Iris-versicolor    0.333333
Iris-virginica     0.333333
Name: proportion, dtype: float64
✓ All classes have at least 10% representation!


In [33]:
np.random.seed(2024)
set_config(transform_output="pandas")

# create the split
iris_train, iris_test = train_test_split(
    iris, train_size=0.70, stratify=iris["species"]
)

iris_train.to_csv("../data/processed/iris_train.csv")
iris_test.to_csv("../data/processed/iris_test.csv")

In [34]:
# ============================================================================
# TRAIN/TEST DATA VALIDATION (after split)
# ============================================================================

# Validate training data
try:
    iris_train = train_test_schema.validate(iris_train, lazy=True)
    print(f"✓ Training data validation PASSED! ({len(iris_train)} rows)")
except pa.errors.SchemaErrors as e:
    print("✗ Training data validation FAILED!")
    print(e.failure_cases)
    raise

# Validate test data
try:
    iris_test = train_test_schema.validate(iris_test, lazy=True)
    print(f"✓ Test data validation PASSED! ({len(iris_test)} rows)")
except pa.errors.SchemaErrors as e:
    print("✗ Test data validation FAILED!")
    print(e.failure_cases)
    raise

✗ Training data validation FAILED!
  column  failure_case index   schema_context                  check  \
0   None         False  None  DataFrameSchema  Duplicate rows found.   

   check_number  
0             0  


SchemaErrors: {
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": null,
                "column": null,
                "check": "Duplicate rows found.",
                "error": "DataFrameSchema 'None' failed series or dataframe validator 0: <Check <lambda>: Duplicate rows found.>"
            }
        ]
    }
}

In [None]:
# ============================================================================
# CORRELATION CHECKS (Training Data Only - to avoid data leakage)
# ============================================================================
# IMPORTANT: These checks are only performed on TRAINING data to prevent
# information from the test set influencing model decisions.
# This addresses checklist items 10 & 11.

feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

# Feature-feature correlations
print("Feature-Feature Correlation Matrix (Training Data Only):")
corr_matrix = iris_train[feature_cols].corr()
print(corr_matrix.round(3))

# Check for anomalously high correlations (>0.99 might indicate data issues)
max_threshold = 0.99
high_correlations = []
for i, col1 in enumerate(feature_cols):
    for j, col2 in enumerate(feature_cols):
        if i < j:  # Only check upper triangle
            correlation = abs(corr_matrix.loc[col1, col2])
            if correlation > max_threshold:
                high_correlations.append((col1, col2, correlation))

if high_correlations:
    print(f"\n⚠ Warning: Anomalously high correlations detected (>{max_threshold}):")
    for col1, col2, corr in high_correlations:
        print(f"  - {col1} & {col2}: {corr:.3f}")
else:
    print(f"\n✓ No anomalously high feature correlations detected (all < {max_threshold})")

Feature-Feature Correlation Matrix (Training Data Only):
              sepal_length  sepal_width  petal_length  petal_width
sepal_length         1.000       -0.188         0.877        0.831
sepal_width         -0.188        1.000        -0.469       -0.372
petal_length         0.877       -0.469         1.000        0.962
petal_width          0.831       -0.372         0.962        1.000

✓ No anomalously high feature correlations detected (all < 0.99)


In [None]:
iris_preprocessor = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include='number')),
    remainder='passthrough',
    verbose_feature_names_out=False
)

iris_preprocessor.fit(iris_train)
scaled_iris_train = iris_preprocessor.transform(iris_train)
scaled_iris_test = iris_preprocessor.transform(iris_test)

scaled_iris_train.to_csv("../data/processed/scaled_iris_train.csv")
scaled_iris_test.to_csv("../data/processed/scaled_iris_test.csv")

In [None]:
# ============================================================================
# SCALED FEATURES VALIDATION
# ============================================================================

feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

# Validate scaled training features
try:
    scaled_train_features = scaled_iris_train[feature_cols]
    scaled_train_features = scaled_features_schema.validate(scaled_train_features, lazy=True)
    print("✓ Scaled training features validation PASSED!")
except pa.errors.SchemaErrors as e:
    print("✗ Scaled training features validation FAILED!")
    print(e.failure_cases)
    raise

# Validate scaled test features
try:
    scaled_test_features = scaled_iris_test[feature_cols]
    scaled_test_features = scaled_features_schema.validate(scaled_test_features, lazy=True)
    print("✓ Scaled test features validation PASSED!")
except pa.errors.SchemaErrors as e:
    print("✗ Scaled test features validation FAILED!")
    print(e.failure_cases)
    raise

print("\n" + "="*60)
print("DATA VALIDATION SUMMARY")
print("="*60)
print("All validation checks PASSED!")
print("")
print("Checklist items validated:")
print("  1. ✓ Correct column names")
print("  2. ✓ Correct data types")
print("  3. ✓ No empty observations")
print("  4. ✓ No missing values beyond threshold")
print("  5. ✓ No duplicate observations")
print("  6. ✓ No outlier/anomalous values")
print("  7. ✓ Correct category levels")
print("  8. ✓ Target distribution is balanced")
print("  9. ✓ Feature-feature correlations checked (training only)")
print(" 10. ✓ Scaled features within expected range")
print("="*60)

✓ Scaled training features validation PASSED!
✓ Scaled test features validation PASSED!

DATA VALIDATION SUMMARY
All validation checks PASSED!

Checklist items validated:
  1. ✓ Correct column names
  2. ✓ Correct data types
  3. ✓ No empty observations
  4. ✓ No missing values beyond threshold
  5. ✓ No duplicate observations
  6. ✓ No outlier/anomalous values
  7. ✓ Correct category levels
  8. ✓ Target distribution is balanced
  9. ✓ Feature-feature correlations checked (training only)
 10. ✓ Scaled features within expected range


In [None]:
# melt for plotting via facets
iris_train_melted = scaled_iris_train.melt(
    id_vars=['species'],
    var_name='predictor',
    value_name='value'
)

# make column names nicer for plotting
iris_train_melted['predictor'] = iris_train_melted['predictor'].str.replace('_', ' ')

In [None]:
# exploratory data analysis - visualize predictor distributions across species
alt.data_transformers.disable_max_rows()

alt.Chart(iris_train_melted, width=150, height=100).transform_density(
    'value',
    groupby=['species', 'predictor']
).mark_area(opacity=0.7).encode(
    x="value:Q",
    y=alt.Y('density:Q').stack(False),
    color='species:N'
).facet(
    'predictor:N',
    columns=2
).resolve_scale(
    y='independent'
)

Figure 1. Comparison of the empirical distributions of training data predictors between the three Iris species.

We chose to use a simple classification model using the k-nearest neighbours algorithm. To find the model that best predicted the Iris species, we performed 10-fold cross validation using accuracy as our metric of model prediction performance to select K (number of nearest neighbours). We observed that the optimal K was around 5-13.

In [None]:
# tune model (here, find K for k-nn using 10 fold cv)
knn = KNeighborsClassifier()
iris_tune_pipe = make_pipeline(iris_preprocessor, knn)

parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 50, 3),
}

cv = 10
iris_tune_grid = GridSearchCV(
    estimator=iris_tune_pipe,
    param_grid=parameter_grid,
    cv=cv,
    scoring='accuracy'
)

In [None]:
iris_fit = iris_tune_grid.fit(
    iris_train.drop(columns=["species"]),
    iris_train["species"]
)

accuracies_grid = pd.DataFrame(iris_fit.cv_results_)

In [None]:
accuracies_grid = accuracies_grid[['param_kneighborsclassifier__n_neighbors', 'mean_test_score', 'std_test_score']]
accuracies_grid.columns = ['n_neighbors', 'mean_test_score', 'std_test_score']
accuracies_grid['sem_test_score'] = accuracies_grid['std_test_score'] / np.sqrt(cv)
accuracies_grid['sem_test_score_lower'] = accuracies_grid['mean_test_score'] - accuracies_grid['sem_test_score']
accuracies_grid['sem_test_score_upper'] = accuracies_grid['mean_test_score'] + accuracies_grid['sem_test_score']
accuracies_grid.sort_values('mean_test_score', ascending=False).head(10)

Unnamed: 0,n_neighbors,mean_test_score,std_test_score,sem_test_score,sem_test_score_lower,sem_test_score_upper
3,10,0.971818,0.043112,0.013633,0.958185,0.985452
1,4,0.970909,0.044499,0.014072,0.956837,0.984981
2,7,0.970909,0.044499,0.014072,0.956837,0.984981
4,13,0.962727,0.045717,0.014457,0.94827,0.977184
0,1,0.961818,0.046851,0.014816,0.947003,0.976634
5,16,0.943636,0.04614,0.014591,0.929046,0.958227
6,19,0.923636,0.086454,0.027339,0.896297,0.950976
9,28,0.894545,0.081636,0.025816,0.86873,0.920361
8,25,0.894545,0.081636,0.025816,0.86873,0.920361
7,22,0.894545,0.081636,0.025816,0.86873,0.920361


In [None]:
line_n_point = alt.Chart(accuracies_grid, width=600).mark_line(color="black").encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy")
)

error_bar = alt.Chart(accuracies_grid).mark_errorbar().encode(
    alt.Y("sem_test_score_upper:Q").scale(zero=False).title("Accuracy"),
    alt.Y2("sem_test_score_lower:Q"),
    alt.X("n_neighbors:Q").title("Neighbors")
)

line_n_point + line_n_point.mark_circle(color='black') + error_bar

Figure 2. Results from 10-fold cross validation to choose K. Accuracy was used as the classification metric as K was varied.

In [None]:
# Compute accuracy
accuracy = iris_fit.score(
    iris_test.drop(columns=["species"]),
    iris_test["species"]
)

# Make predictions
iris_preds = iris_test.assign(
    predicted=iris_fit.predict(iris_test.drop(columns=["species"]))
)

pd.DataFrame({'accuracy': [accuracy]})

Unnamed: 0,accuracy
0,0.933333


Our prediction model performed well on test data, with a final overall accuracy of approximately 0.93. Other indicators that our model performed well come from the confusion matrix, where it correctly classified all *Iris setosa* samples. The misclassifications occurred between *Iris versicolor* and *Iris virginica*, which is expected given their overlapping feature distributions.

Table 1. Confusion matrix of model performance on test data.

In [None]:
pd.crosstab(
    iris_preds["species"],
    iris_preds["predicted"],
)

predicted,Iris-setosa,Iris-versicolor,Iris-virginica
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Iris-setosa,15,0,0
Iris-versicolor,0,13,2
Iris-virginica,0,1,14


While the performance of this model is already useful for species identification, there are several directions that could be explored to improve it further. First, we could look closely at the misclassified observations and compare them to several observations that were classified correctly (from both classes). The goal of this would be to see which feature(s) may be driving the misclassification and explore whether any feature engineering could be used to help the model better predict on observations that it currently is making mistakes on. Additionally, we could try seeing whether we can get improved predictions using other classifiers. One classifier we might try is Support Vector Machines because they are known to perform well on this dataset. Finally, we could also explore adding derived features such as petal area (petal length × petal width) or sepal area to potentially improve the separation between *Iris versicolor* and *Iris virginica*.

# References

Anderson, E. (1936). The Species Problem in Iris. Annals of the Missouri Botanical Garden, 23(3), 457-509.

Dua, D. & Graff, C. (2017). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml

Fisher, R.A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2), 179-188.

Harris, C.R. et al., 2020. Array programming with NumPy. Nature, 585, pp.357–362.

McKinney, Wes. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51-56.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

Reitz, Kenneth. 2011. Requests: HTTP for Humans. https://requests.readthedocs.io/en/master/.

VanderPlas, J. et al., 2018. Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), p.1057.

Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.