# Unsupervised Clustering Case Study â€“ Penguin Segmentation

## Objective
Use unsupervised learning (K-Means) to identify natural groupings within penguin measurements.

## What this notebook covers
- Data loading and quick audit
- Preprocessing (scaling numeric features + encoding categorical features)
- Selecting the number of clusters using Elbow + Silhouette
- Clustering with K-Means
- PCA-based visualisation
- Cluster summary statistics


In [None]:
# Core libraries
import pandas as pd
import numpy as np

# Visualisation
import matplotlib.pyplot as plt

# ML
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score


## 1. Load dataset

In [None]:
df = pd.read_csv("penguins.csv")
df.head()

In [None]:
df.info()

In [None]:
# Missing values overview
df.isna().sum()

## 2. Basic cleaning
We drop rows with missing values for simplicity (common for demo/portfolio notebooks).

> If you want to be more advanced, we can replace this with an imputation step.

In [None]:
df_clean = df.dropna().copy()
df_clean.shape

## 3. Preprocessing pipeline
Numeric features are scaled; categorical features (e.g., `sex`) are one-hot encoded.

This mirrors best practice for clustering on mixed data types.

In [None]:
X = df_clean.copy()

numeric_features = X.select_dtypes(include=["number"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

numeric_features, categorical_features

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), categorical_features)
    ]
)

# Transform the dataset into a numeric matrix suitable for clustering
X_prepared = preprocessor.fit_transform(X)
X_prepared.shape

## 4. Select K (number of clusters)
We use:
- **Elbow Method** (inertia)
- **Silhouette Score** (higher is better)

This avoids hardcoding `k=4` and makes the notebook more credible.

In [None]:
k_values = range(2, 11)
inertias = []
sil_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_prepared)
    inertias.append(kmeans.inertia_)
    sil_scores.append(silhouette_score(X_prepared, labels))

best_k = k_values[int(np.argmax(sil_scores))]
best_k

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(list(k_values), inertias, marker="o")
plt.title("Elbow Method (Inertia) to Select K")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Inertia")
plt.xticks(list(k_values))
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(list(k_values), sil_scores, marker="o")
plt.title("Silhouette Score to Select K (Higher is Better)")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Silhouette Score")
plt.xticks(list(k_values))
plt.grid(True)
plt.show()

print(f"Best K by silhouette score: {best_k}")

## 5. Fit final K-Means model

In [None]:
final_k = best_k  # you can manually override this if needed
kmeans_final = KMeans(n_clusters=final_k, random_state=42, n_init=10)
clusters = kmeans_final.fit_predict(X_prepared)

df_result = df_clean.copy()
df_result["Cluster"] = clusters
df_result.head()

## 6. Visualise clusters using PCA (2D)
We reduce the prepared feature matrix to 2 dimensions and plot the clusters.

In [None]:
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_prepared)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters)
plt.title(f"K-Means Clusters Visualised with PCA (K={final_k})")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.show()

## 7. Cluster summary statistics
We summarise numeric features by cluster to interpret how clusters differ.

In [None]:
summary = df_result.groupby("Cluster")[numeric_features].mean().round(2)
summary

## 8. Quick interpretation guide
- Look for clusters separated mainly by **body_mass_g** and **flipper_length_mm** (often strong differentiators).
- If one cluster has consistently higher body mass and flipper length, it may represent a distinct group.
- If clusters overlap in PCA space, the species/segments may have similar measurements.


## Next Improvements (Optional)
- Add imputation rather than dropping missing values
- Try alternative clustering methods (GMM, DBSCAN)
- Add cluster profiling including categorical proportions (e.g., sex distribution)
- Save plots to `/visualisations/` for the README
