# KNN Classifier for penguins

The following exercise aims to apply a KNN classifier to distinguish different species of penguins based on the Palmer Penguins dataset imported via Seaborn.

#### 1) Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline


#### 2) Load data

In [None]:
# Load penguins dataset
penguins = sns.load_dataset("penguins")

# Show first rows
penguins.head()

#### 3) Inspect the data
Apply the fundamental tools of EDA (summary statistics, visualisation!) and see which features might be well suited for classification!

#### 4) Data cleaning
Drop rows with missing numerical values or missing labels (Hint: ideally with subset! Don't forget to comment!), then separate the features (X) and targets (y).

In [None]:
penguins = penguins.dropna()

In [None]:
# Features (X) and target (y)
X = penguins[[
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g"
]]

y = penguins["species"]

#### 5) Train-test split
Check later, how a different split ratio influences the prediction!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

#### 6) Feature scaling
kNN = distance based algorithm, scaling is essential!

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### 7) Train the model
For a start, use k=5, then try higher and lower values and see how the prediction changes.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train_scaled, y_train)

#### 8) Predictions
Predict the label for both the training and the test set!

In [None]:
y_test_pred = knn.predict(X_test_scaled)
y_train_pred = knn.predict(X_train_scaled)

#### 9) Evaluation of the model

In [None]:
print("Accuracy on training set:", accuracy_score(y_train, y_train_pred))
print("Accuracy on test set:", accuracy_score(y_test, y_test_pred))
print("\n------------------------------------------------")
print("\nClassification Report:\n")
print(classification_report(y_test, y_test_pred))

In [None]:
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt="d",
            xticklabels=knn.classes_,
            yticklabels=knn.classes_,
            cmap="Blues")

plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

#### 10) Explore
- Try different k
- See what happens if you omit the scaling

#### 11) Predict labels for unknown data

- Predict the labels for the (generated) penguins in "penguins_mystery.csv".
- Invent your own penguin and see which class it would be according to the KNN. You can exaggerate a bit ;-) 
- Bonus: Use an AI tool to create an image for your fantasy penguin and share it on the teams channel.