<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/K-Nearest%20Neighbours%20(KNN)%20Classification/Case%20Study%3A%20K-Nearest%20Neighbours%20Classification%20with%20Wine%20Dataset%20(UCI).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study: KNN Classification with Wine Dataset (UCI)

K‑Nearest Neighbors (KNN) is a simple yet powerful machine learning algorithm that classifies new examples based on similarity to known examples. In this case study, we’ll step through a practical example using the **Wine recognition** dataset (from the UCI Machine Learning Repository) to illustrate key concepts and best practices of KNN classification. This dataset contains chemical analysis results for wines from three cultivars (classes), with 13 continuous features (e.g. alcohol content, acidity, magnesium, phenols, color intensity, etc.). We simulate the scenario of predicting a wine’s cultivar from its chemical properties, akin to a chemist identifying origin by lab measurements.

## What we’ll cover
- **Data exploration and preparation:** Understanding feature scales and splitting data into training, validation, and test sets.  
- **Impact of feature scaling:** Demonstrating how scaling features affects KNN performance.  
- **Choosing the number of neighbors (K):** Tuning K to balance model complexity (bias vs. variance).  
- **Distance metric considerations:** How the choice of distance measure can affect KNN.  
- **Model evaluation:** Evaluating the final model on a test set to ensure it generalizes well to unseen data.


## Learning Objectives

By the end of this case study, you will be able to:

1. **Explain why feature scaling is critical for KNN** and demonstrate its impact on classification accuracy
2. **Apply proper data splitting strategies** (train/validation/test) to avoid data leakage and obtain unbiased performance estimates
3. **Tune hyperparameters systematically** by evaluating K values and distance metrics on validation data
4. **Interpret the bias-variance tradeoff** in the context of KNN's K parameter and identify signs of overfitting
5. **Evaluate classification models comprehensively** using accuracy, balanced accuracy, F1 scores, and confusion matrices
6. **Understand when KNN is appropriate** for real-world classification problems and recognize its limitations

These skills form the foundation for applying distance-based learning algorithms to practical classification tasks.

## Exploring the Dataset
Before diving into modeling, let's load the dataset and examine its features. The dataset has 178 samples, each with 13 features. The target `class` is an integer (0, 1, or 2) representing the wine cultivar.

**Typical feature ranges (intuition):**  
- Alcohol ~ 11–15  
- Malic acid ~ 0.7–6  
- Alcalinity of ash ~ 10–30  
- Magnesium ~ 70–160  
- Color intensity ~ 1–13  
- Proline ~ 280–1700  

Large differences in magnitude (e.g., *Proline* in hundreds vs *Malic acid* single digits) motivate **scaling** before using distance-based models.


In [None]:
# Imports and data loading
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score, f1_score,
    precision_recall_fscore_support, confusion_matrix, classification_report
)
from sklearn.metrics import pairwise_distances
from collections import Counter

# Load the wine dataset
data = load_wine()
X = data.data
y = data.target
feature_names = data.feature_names

# Create DataFrame for exploration
df = pd.DataFrame(X, columns=feature_names)
df['class'] = y
df.head()



Let's examine the class distribution and feature statistics before splitting our data.

**Data Splitting Strategy:**  
We use a three-way split (60/20/20) to create distinct training, validation, and test sets:
- **Training set (60%)**: Used to fit the model (learn patterns from the data)
- **Validation set (20%)**: Used to tune hyperparameters (select best K, compare distance metrics, etc.)
- **Test set (20%)**: Held out completely until final evaluation to assess real-world performance

The split is **stratified** to ensure each subset maintains the same class proportions as the original dataset. This is critical for classification problems to avoid biased performance estimates.

> **Question**: After tuning hyperparameters on the validation set, why do we evaluate the final model on a separate test set instead of reporting validation performance?
>  
> A) To validate that our chosen hyperparameters work well across different random seeds
>
> B) To get an unbiased estimate of how the model will perform on completely new data in production
>
> C) To ensure the model complexity matches the data complexity
>
> D) To verify that feature scaling was applied correctly across all splits

The test set provides an honest assessment of generalization performance because it played no role in model selection or hyperparameter tuning.

In [None]:
# Examine class distribution
print("Class distribution:\n", df['class'].value_counts().sort_index(), "\n")

# Descriptive statistics of features
display(df[feature_names].describe().T[['mean', 'std', 'min', 'max']])

# Stratified split into train, validation, and test sets (60/20/20)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)
print("Train size:", X_train.shape[0], "Validation size:", X_val.shape[0], "Test size:", X_test.shape[0])

These figures match the scikit-learn load_wine description (178 samples, 13 features, 3 classes, with per-class counts).

## Effect of Feature Scaling on KNN

KNN uses distance to find nearest neighbors; if features are on very different scales, distance calculations will be dominated by the feature with the largest range. The tiny demo below illustrates how a difference in *Proline* (hundreds) can swamp a difference in *Malic acid* (tenths). Therefore, scaling features to comparable ranges is critical for KNN.

In [None]:
# Demonstrate distance dominance (hypothetical differences)
from math import sqrt

delta_proline_large = 100.0
delta_malic_small = 0.5

d1 = sqrt(delta_proline_large**2 + 0.0**2)
d2 = sqrt(0.0**2 + delta_malic_small**2)

print("Distance if only Proline differs by +100:", round(d1, 3))
print("Distance if only Malic differs by +0.5  :", round(d2, 3))
print("Ratio (Proline / Malic):", round(d1 / d2, 1))


Next, we train a baseline KNN model with `K=5` using **unscaled** features and **scaled** features to compare validation performance. Note that we scale using parameters learned from the training set only to avoid leakage.


In [None]:
# Baseline without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
raw_val_acc = accuracy_score(y_val, knn_raw.predict(X_val))

# Baseline with scaling (fit on train only)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled   = scaler.transform(X_val)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
scaled_val_acc = accuracy_score(y_val, knn_scaled.predict(X_val_scaled))

print(f"Validation accuracy without scaling: {raw_val_acc:.3f}")
print(f"Validation accuracy with scaling:  {scaled_val_acc:.3f}")


The scaled model often performs dramatically better because each feature contributes fairly to distance computation.


**t‑SNE visualization of the wine dataset after feature scaling.**  
t‑SNE is fit on the **full dataset** purely for visualization. It preserves local structure but should **not** be used for tuning or evaluation. This does **not** leak information into the model.


In [None]:
from sklearn.manifold import TSNE

X_scaled_full = StandardScaler().fit_transform(X)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled_full)

plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', edgecolor='k', alpha=0.7)
plt.xlabel("$X_1$ (t-SNE)")
plt.ylabel("$X_2$ (t-SNE)")
plt.axis([-20, 20, -20, 20])
plt.title("Wine dataset — t-SNE (Scaled Features)")
plt.grid(True)
plt.show()

## Distance Metric Considerations

The choice of distance metric is a hyperparameter that can significantly impact KNN performance, yet it's often overlooked. Common distance metrics for continuous features include:

- **Euclidean (L2)**: Measures straight-line distance; computed as √(Σ(xᵢ - yᵢ)²)
- **Manhattan (L1)**: Measures city-block distance; computed as Σ|xᵢ - yᵢ|  
- **Minkowski**: Generalizes both L1 and L2 with parameter p (p=1 gives Manhattan, p=2 gives Euclidean)
- **Cosine**: Measures angle between vectors; commonly used for text data and high-dimensional sparse features

Different metrics can produce different neighbor sets and different predictions, especially when features have varying scales or distributions. The Wine dataset contains chemical measurements that may occasionally include extreme values due to measurement errors, unusual growing conditions, or genuinely exceptional wines.

> **Question**: For datasets with continuous features that may contain occasional extreme outliers, which statement about distance metrics in KNN is most accurate?
>  
> A) Euclidean distance is generally more robust to outliers because squaring differences normalizes their relative impact
>
> B) Manhattan distance is generally more robust to outliers because it uses absolute differences rather than squared differences
>
> C) Both metrics are equally affected by outliers once features are properly scaled to the same range
>
> D) Cosine distance is always the best choice for robustness because it normalizes for vector magnitude

The choice of distance metric should be validated empirically on your specific dataset using held-out validation data.

## Choosing K: Bias–Variance Trade‑off

The hyperparameter K fundamentally controls how KNN makes predictions and directly impacts model performance:

- **Small K (e.g., K=1)**: Each prediction is determined by very few neighbors, creating highly flexible decision boundaries that adapt closely to individual training points
- **Large K (e.g., K=50)**: Predictions average over many neighbors, producing smoother decision boundaries that change gradually across the feature space
- **Optimal K**: Typically found somewhere in between, balancing the ability to capture genuine patterns while avoiding sensitivity to noise or outliers

We'll systematically evaluate K values from 1 to 20, tracking both training and validation accuracy. The gap between these curves reveals how well each K value generalizes to unseen data. Large gaps suggest the model is memorizing training-specific patterns rather than learning generalizable relationships.

> **Question**: After training KNN with K=1, you observe 100% training accuracy but only 88% validation accuracy (a 12 percentage point gap). Which approach is most likely to improve validation performance?
>  
> A) Increase K to create smoother decision boundaries and improve generalization to new data
>
> B) Keep K=1 but collect more training samples to reduce the performance gap
>
> C) Keep K=1 but apply more sophisticated feature engineering to capture better patterns  
>
> D) Switch to weighted KNN with K=1 where closer neighbors have more influence on predictions

The train-validation gap is a key diagnostic for detecting when a model is too flexible for the available data.

In [None]:
train_acc, val_acc = [], []
k_sweep = range(1, 21)

for k in k_sweep:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train_scaled, y_train)
    train_acc.append(accuracy_score(y_train, model.predict(X_train_scaled)))
    val_acc.append(accuracy_score(y_val, model.predict(X_val_scaled)))

# Best K by validation
best_k_idx = int(np.argmax(val_acc))
chosen_k = best_k_idx + 1
best_val = max(val_acc)
max_gap = np.max(np.array(train_acc) - np.array(val_acc))

# Use Euclidean distance (default and most commonly used for continuous features)
chosen_metric = 'euclidean'

print("Selected hyperparameters:")
print(f"  K = {chosen_k}")
print(f"  Distance metric = {chosen_metric}")
print(f"  Validation accuracy = {best_val:.3f}")
print(f"Max (train - validation) gap across K: {max_gap:.3f}")

# Plot train vs validation accuracy vs K
plt.figure()
plt.scatter(list(k_sweep), train_acc, label='Train Accuracy')
plt.scatter(list(k_sweep), val_acc, label='Validation Accuracy')
plt.axvline(chosen_k, linestyle='--', label=f'Best K={chosen_k}')
plt.axis([0, 20, 0.8, 1.05])
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.grid()
plt.legend()
plt.tight_layout()
plt.show()

## Model Evaluation on Test Set

After selecting our hyperparameters using the validation set, we're ready for the final evaluation. At this stage we:

1. **Combine training and validation sets**: This provides the model with the maximum available data for learning, since we've already locked in our hyperparameter choices
2. **Refit the complete pipeline**: The scaler learns standardization parameters from the combined dataset, and KNN memorizes all combined training examples
3. **Evaluate once on the test set**: This held-out data provides our unbiased estimate of real-world performance

**Critical principle**: The test set has played zero role in any modeling decisions—no hyperparameter selection, no feature engineering choices, no model architecture decisions. It therefore provides an honest estimate of how the model will perform when deployed on genuinely new data from the same distribution.

> **Question**: After final evaluation, your test accuracy (94.4%) is slightly lower than your best validation accuracy (97.2%). Before deployment, which interpretation and next step is most appropriate?
>  
> A) This small decrease is normal variation; verify that test performance meets your accuracy requirements and document the results
>
> B) Re-evaluate hyperparameters using the test set to identify values that achieve better performance on this data split
>
> C) This indicates potential data leakage between validation and test sets; recreate the splits and re-run the experiment  
>
> D) Average the validation and test accuracies to obtain a more stable estimate of expected production performance

Remember: the test set is used exactly once for evaluation. Any optimization based on test results invalidates its role as an unbiased estimator.

In [None]:
# Combine training and validation sets for final training
X_train_all = np.vstack([X_train, X_val])
y_train_all = np.hstack([y_train, y_val])

# Build pipeline (scaler + KNN) with chosen hyperparameters — no weights
final_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=chosen_k, metric=chosen_metric))
])

final_pipe.fit(X_train_all, y_train_all)

# Predict on test set
test_pred = final_pipe.predict(X_test)
test_acc  = accuracy_score(y_test, test_pred)

print("Test accuracy:", round(test_acc, 3))


Beyond accuracy, we examine **balanced accuracy** (accounts for class imbalance) and **macro F1** (averages F1 across classes), print a **classification report**, show a **per-class table**, and plot both the raw and normalized confusion matrices.


In [None]:
# Print classification report and per-class metrics
print("\nClassification report (test):")
print(classification_report(y_test, test_pred, digits=3, target_names=[str(c) for c in np.unique(y)]))

labels = list(np.unique(y))
prec, rec, f1, sup = precision_recall_fscore_support(y_test, test_pred, labels=labels)

per_class_df = pd.DataFrame({
    'precision': prec,
    'recall': rec,
    'f1': f1,
    'support': sup
}, index=labels)
display(per_class_df)

# Balanced accuracy & macro F1
print("Balanced accuracy (test):", round(balanced_accuracy_score(y_test, test_pred), 3))
print("Macro F1 (test):         ", round(f1_score(y_test, test_pred, average='macro'), 3))

# Confusion matrices: raw and normalized
cm_raw = confusion_matrix(y_test, test_pred, labels=labels)
cm_norm = confusion_matrix(y_test, test_pred, labels=labels, normalize='true')

# Raw confusion matrix heatmap
fig, ax = plt.subplots()
im = ax.imshow(cm_raw, cmap='Blues')
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
ax.set_title("Confusion Matrix (Raw)")
for i in range(len(labels)):
    for j in range(len(labels)):
        ax.text(j, i, cm_raw[i, j], ha='center', va='center', color='black')
fig.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

# Normalized confusion matrix heatmap
fig, ax = plt.subplots()
im = ax.imshow(cm_norm, vmin=0, vmax=1, cmap='Blues')
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
ax.set_title("Confusion Matrix (Normalized)")
for i in range(len(labels)):
    for j in range(len(labels)):
        ax.text(j, i, f"{cm_norm[i, j]:.2f}", ha='center', va='center', color='black')
fig.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()


## Limitations (Current Scope) & What’s Next
This notebook uses a **single hold‑out validation** set, which is simple but sensitive to data splits. In practice, data scientists often use **k‑fold cross‑validation** or nested validation to obtain more reliable estimates and avoid overfitting hyperparameters to a single split. We also used brute‑force neighbor search (`algorithm='brute'`) and didn’t explore scalability techniques like KD‑trees, Ball Trees, or approximate nearest neighbor libraries (e.g. FAISS, HNSW). These become important when your archive grows to millions of rows or requires low‑latency predictions. Finally, we didn’t address class imbalance or cost‑sensitive KNN; these are natural extensions for more advanced courses.

## Common Pitfalls and Best Practices

### Critical Mistakes to Avoid:

1. **Forgetting to scale features** ❌  
   KNN is distance-based; features on different scales will dominate distance calculations. Always use StandardScaler or similar normalization.

2. **Using the test set for hyperparameter tuning** ❌  
   This creates data leakage and inflates performance estimates. Use a separate validation set or cross-validation for all tuning decisions.

3. **Choosing K=1 for production systems** ❌  
   While K=1 may give perfect training accuracy, it's highly sensitive to noise and outliers. Always validate with K > 1 on held-out data.

4. **Ignoring computational cost** ❌  
   KNN requires computing distances to all training points at prediction time. For large datasets (>100K samples), consider approximate nearest neighbor methods or alternative algorithms.

5. **Treating class imbalance casually** ❌  
   If one class has 90% of samples, KNN will naturally favor that class. Consider using balanced class weights, stratified sampling, or appropriate evaluation metrics (balanced accuracy, F1).

### Best Practices:

✅ **Always scale features** before applying KNN to continuous data  
✅ **Use stratified splits** to maintain class proportions across train/val/test sets  
✅ **Validate hyperparameters** (K, distance metric) on separate validation data  
✅ **Consider dimensionality**: KNN performance degrades in very high dimensions (curse of dimensionality); consider dimensionality reduction (PCA, feature selection) for >50 features  
✅ **Monitor the train-validation gap** to detect overfitting early  
✅ **Use domain knowledge**: For some applications (text, images), specialized distance metrics (cosine, Hamming) may work better than Euclidean

## Conclusion
- **Scaling** prevents large‑range features from dominating distance computations.  
- **Tuning K** via validation balances bias and variance; a very small K overfits, a very large K underfits.  
- **Distance metric and K** are hyperparameters; small grids reveal significant differences.  
- KNN remains a powerful, intuitive baseline—use it to build intuition about distance and similarity before advancing to more sophisticated models.


# Appendix

## Real‑World Applications: Where This Is Useful (Concrete Ops)
- **Authenticity & origin (PDO/PGI):** Check that a lot labeled “Cultivar A / Region X” matches historical chemical profiles; flag likely mislabels or adulteration.  
- **Supplier & intake QA:** Compare incoming lots against past lots of the same cultivar to catch off‑spec deliveries early, saving tank space and time.  
- **Process monitoring & early warnings:** Periodic lab panels classified against expected states; abnormal neighbors trigger investigation of contamination or process drift.
- **Counterfeit screening:** Rapidly triage shipments before expensive sensory panels or full mass‑spec profiling.  

## Why KNN specifically
- **Small/medium data with strong locality:** Wine labs typically have hundreds or thousands of historical lots—KNN thrives here without heavy parametric assumptions.  
- **Example‑based explanation:** You can justify a prediction by saying, “8/10 nearest wines were Cultivar 2 with similar magnesium, phenolics, and color intensity.”  
