# Iris Species Classification using KNN
This notebook uses the Kaggle Iris dataset (`Iris.csv`), applies KNN classification, and evaluates performance under measurement noise.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Kaggle Iris dataset
df = pd.read_csv('Iris.csv')

# Rename columns for consistency
df.rename(columns={
    'SepalLengthCm': 'sepal_length',
    'SepalWidthCm': 'sepal_width',
    'PetalLengthCm': 'petal_length',
    'PetalWidthCm': 'petal_width',
    'Species': 'species'
}, inplace=True)
df.head()

## Data Exploration
Scatter plot of petal_length vs petal_width.

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='petal_length', y='petal_width', hue='species')
plt.title('Petal Length vs Petal Width by Species')
plt.show()

## KNN Classification
Split into train/test and evaluate with Euclidean and Manhattan distances.

In [None]:
X = df.drop(['Id','species'], axis=1)
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Euclidean
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
acc_euclidean = knn_euclidean.score(X_test, y_test)

# Manhattan
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
acc_manhattan = knn_manhattan.score(X_test, y_test)

acc_euclidean, acc_manhattan

## Hyperparameter Tuning
Test k values from 1 to 10 using Euclidean distance.

In [None]:
accuracies = []
k_values = range(1, 11)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    knn.fit(X_train, y_train)
    acc = knn.score(X_test, y_test)
    accuracies.append(acc)

plt.figure(figsize=(8,6))
plt.plot(k_values, accuracies, marker='o')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Test Accuracy')
plt.title('KNN Accuracy vs k (Euclidean)')
plt.show()

best_k = k_values[np.argmax(accuracies)]
best_accuracy = max(accuracies)
best_k, best_accuracy

## Reflection
- Euclidean vs Manhattan: Euclidean often performs slightly better on the Iris dataset due to its continuous, rounded feature space and clear separation between species. Manhattan can be more sensitive to axis-aligned noise and outliers.
- Preprocessing suggestion: **Scaling (StandardScaler or MinMaxScaler)** is important for KNN because distances are affected by feature magnitudes. Scaling ensures that features like petal length (larger values) don’t dominate smaller ones like petal width.