<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/K-Nearest%20Neighbhours%20Classification/KNN%20Classification%20with%20Wine%20Dataset%20Case%20Study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study: KNN Classification with Wine Dataset (UCI)

K‑Nearest Neighbors (KNN) is a simple yet powerful machine learning algorithm that classifies new examples based on similarity to known examples. In this case study, we’ll step through a practical example using the **Wine recognition** dataset (from the UCI Machine Learning Repository) to illustrate key concepts and best practices of KNN classification. This dataset contains chemical analysis results for wines from three cultivars (classes), with 13 continuous features (e.g. alcohol content, acidity, magnesium, phenols, color intensity, etc.). We simulate the scenario of predicting a wine’s cultivar from its chemical properties, akin to a chemist identifying origin by lab measurements.

## What we’ll cover
- **Data exploration and preparation:** Understanding feature scales and splitting data into training, validation, and test sets.  
- **Impact of feature scaling:** Demonstrating how scaling features affects KNN performance.  
- **Choosing the number of neighbors (K):** Tuning K to balance model complexity (bias vs. variance).  
- **Distance metric considerations:** How the choice of distance measure can affect KNN.  
- **Model evaluation:** Evaluating the final model on a test set to ensure it generalizes well to unseen data.


## Learning Objectives

By the end of this case study, you will be able to:

1. **Explain why feature scaling is critical for KNN** and demonstrate its impact on classification accuracy
2. **Apply proper data splitting strategies** (train/validation/test) to avoid data leakage and obtain unbiased performance estimates
3. **Tune hyperparameters systematically** by evaluating K values and distance metrics on validation data
4. **Interpret the bias-variance tradeoff** in the context of KNN's K parameter and identify signs of overfitting
5. **Evaluate classification models comprehensively** using accuracy, balanced accuracy, F1 scores, and confusion matrices
6. **Understand when KNN is appropriate** for real-world classification problems and recognize its limitations

These skills form the foundation for applying distance-based learning algorithms to practical classification tasks.

## Exploring the Dataset
Before diving into modeling, let's load the dataset and examine its features. The dataset has 178 samples, each with 13 features. The target `class` is an integer (0, 1, or 2) representing the wine cultivar.

**Typical feature ranges (intuition):**  
- Alcohol ~ 11–15  
- Malic acid ~ 0.7–6  
- Alcalinity of ash ~ 10–30  
- Magnesium ~ 70–160  
- Color intensity ~ 1–13  
- Proline ~ 280–1700  

Large differences in magnitude (e.g., *Proline* in hundreds vs *Malic acid* single digits) motivate **scaling** before using distance-based models.


In [None]:
# Imports and data loading
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score, f1_score,
    precision_recall_fscore_support, confusion_matrix, classification_report
)
from sklearn.metrics import pairwise_distances
from collections import Counter

# Load the wine dataset
data = load_wine()
X = data.data
y = data.target
feature_names = data.feature_names

# Create DataFrame for exploration
df = pd.DataFrame(X, columns=feature_names)
df['class'] = y
df.head()



Let's examine the class distribution and feature statistics before splitting our data.

**Data Splitting Strategy:**  
We use a three-way split (60/20/20) to create distinct training, validation, and test sets:
- **Training set (60%)**: Used to fit the model (learn patterns from the data)
- **Validation set (20%)**: Used to tune hyperparameters (select best K, compare distance metrics, etc.)
- **Test set (20%)**: Held out completely until final evaluation to assess real-world performance

The split is **stratified** to ensure each subset maintains the same class proportions as the original dataset. This is critical for classification problems to avoid biased performance estimates.

> **Question**: After tuning hyperparameters on the validation set, why do we evaluate the final model on a separate test set instead of reporting validation performance?
>  
> A) To validate that our chosen hyperparameters work well across different random seeds
>
> B) To get an unbiased estimate of how the model will perform on completely new data in production
>
> C) To ensure the model complexity matches the data complexity
>
> D) To verify that feature scaling was applied correctly across all splits

The test set provides an honest assessment of generalization performance because it played no role in model selection or hyperparameter tuning.

<details>

<summary>Click to reveal answer</summary>

**Correct Answer: B**

**Explanation:**

**B is TRUE**: This is the fundamental purpose of a test set in machine learning.
- During hyperparameter tuning, we evaluate many different K values on the validation set
- We select the K that performs best on validation data (e.g., highest accuracy)
- This selection process means our final model was **chosen specifically because it performed well on the validation set**
- Validation performance is therefore **optimistically biased** - it represents the best case among all values we tried
- The test set, which played no role in any decisions, provides an **unbiased estimate** of real-world performance

**A is FALSE**: While cross-validation with different random seeds is good practice, that's not why we have a separate test set
- The test set exists specifically to avoid the optimistic bias from hyperparameter selection
- Cross-validation on different seeds would still be done on training/validation data, not the test set

**C is FALSE**: The test set doesn't verify model complexity matching data complexity
- Model complexity in KNN is controlled by K (and we already selected that using validation data)
- The test set simply estimates generalization performance

**D is FALSE**: Feature scaling verification happens during the validation phase, not testing
- We apply the same scaling pipeline to all splits (train, validation, test)
- The test set exists to measure generalization, not verify preprocessing

**Key Insight**: The test set is your "truth detector" - it reveals whether your model selection process (which used validation data) produced a genuinely good model or just one that happened to work well on that particular validation split.

**Real-World Analogy**: If you're studying for an exam by taking practice tests, your practice test scores will be optimistically biased because you're using them to guide your study strategy. The actual exam (test set) provides an unbiased measure of your true knowledge.

</details>

In [None]:
# Examine class distribution
print("Class distribution:\n", df['class'].value_counts().sort_index(), "\n")

# Descriptive statistics of features
display(df[feature_names].describe().T[['mean', 'std', 'min', 'max']])

# Stratified split into train, validation, and test sets (60/20/20)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)
print("Train size:", X_train.shape[0], "Validation size:", X_val.shape[0], "Test size:", X_test.shape[0])

## Effect of Feature Scaling on KNN

KNN uses distance to find nearest neighbors. If features are on vastly different scales, distance calculations will be dominated by the feature with the largest range. For example, in the Wine dataset, Proline ranges from 280–1700 while Malic acid ranges from 0.7–6. Without scaling, differences in Proline will completely overwhelm differences in Malic acid, causing KNN to effectively ignore the smaller-scale features.

Let's demonstrate this by training a baseline KNN model with K=5 using both **unscaled** and **scaled** features, then comparing their validation performance.

We fit the scaler using only the training set to avoid data leakage, then transform both train and validation sets.

In [None]:
# Baseline without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
raw_val_acc = accuracy_score(y_val, knn_raw.predict(X_val))

# Baseline with scaling (fit on train only)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled   = scaler.transform(X_val)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
scaled_val_acc = accuracy_score(y_val, knn_scaled.predict(X_val_scaled))

print(f"Validation accuracy without scaling: {raw_val_acc:.3f}")
print(f"Validation accuracy with scaling:  {scaled_val_acc:.3f}")


## Distance Metric Considerations

The choice of distance metric is a hyperparameter that can significantly impact KNN performance, yet it's often overlooked. Common distance metrics for continuous features include:

- **Euclidean (L2)**: Measures straight-line distance; computed as √(Σ(xᵢ - yᵢ)²)
- **Manhattan (L1)**: Measures city-block distance; computed as Σ|xᵢ - yᵢ|  
- **Minkowski**: Generalizes both L1 and L2 with parameter p (p=1 gives Manhattan, p=2 gives Euclidean)
- **Cosine**: Measures angle between vectors; commonly used for text data and high-dimensional sparse features

Different metrics can produce different neighbor sets and different predictions, especially when features have varying scales or distributions. The Wine dataset contains chemical measurements that may occasionally include extreme values due to measurement errors, unusual growing conditions, or genuinely exceptional wines.

> **Question**: For datasets with continuous features that may contain occasional extreme outliers, which statement about distance metrics in KNN is most accurate?
>  
> A) Euclidean distance is generally more robust to outliers because squaring differences normalizes their relative impact
>
> B) Manhattan distance is generally more robust to outliers because it uses absolute differences rather than squared differences
>
> C) Both metrics are equally affected by outliers once features are properly scaled to the same range
>
> D) Cosine distance is always the best choice for robustness because it normalizes for vector magnitude

The choice of distance metric should be validated empirically on your specific dataset using held-out validation data.

<details>

<summary>Click to reveal answer</summary>

**Correct Answer: B**

**Explanation:**

**B is TRUE**: Manhattan distance is more robust to outliers than Euclidean distance.
- **Euclidean distance** squares the differences: distance = √(Σ(xᵢ - yᵢ)²)
  - Squaring amplifies the impact of outliers dramatically
  - Example: If one feature has a difference of 10, squaring makes it 100
  - A single outlier can dominate the entire distance calculation
- **Manhattan distance** uses absolute values: distance = Σ|xᵢ - yᵢ|
  - Differences grow linearly, not quadratically
  - Example: A difference of 10 contributes 10, not 100
  - Outliers have proportionally less influence

**Concrete Example:**
```
Point A: [1, 1, 1]
Point B: [2, 2, 2]  
Point C: [1, 1, 100]  # Outlier in 3rd feature

Euclidean distance from A to B: √((1)² + (1)² + (1)²) = √3 ≈ 1.73
Euclidean distance from A to C: √((0)² + (0)² + (99)²) = 99.0

Manhattan distance from A to B: |1| + |1| + |1| = 3
Manhattan distance from A to C: |0| + |0| + |99| = 99

The outlier dominates in both, but less extremely with Manhattan.
More importantly, in squared space (Euclidean), the outlier gets 99² = 9,801 weight!
```

**A is FALSE**: Squaring does NOT normalize - it **amplifies** outliers
- Squaring makes large values exponentially larger
- This is the opposite of robustness

**C is FALSE**: Scaling to the same range doesn't eliminate the difference in outlier sensitivity
- Even with scaled features (e.g., all in [0, 1]), Euclidean still squares differences
- Manhattan's linear behavior remains more robust

**D is FALSE**: Cosine distance is NOT always best for robustness
- Cosine measures angular similarity, not distance
- It's useful for high-dimensional sparse data (like text), not necessarily for outlier robustness
- For continuous chemical measurements, Manhattan or Euclidean are more appropriate

**Key Insight**: The mathematical operation matters. Squaring (L2) amplifies errors; absolute values (L1) treat them linearly. This is why robust regression often uses L1 loss instead of L2 (mean squared error).

</details>

## Choosing K: Bias–Variance Trade‑off

The hyperparameter K fundamentally controls how KNN makes predictions and directly impacts model performance:

- **Small K (e.g., K=1)**: Each prediction is determined by very few neighbors, creating highly flexible decision boundaries that adapt closely to individual training points
- **Large K (e.g., K=50)**: Predictions average over many neighbors, producing smoother decision boundaries that change gradually across the feature space
- **Optimal K**: Typically found somewhere in between, balancing the ability to capture genuine patterns while avoiding sensitivity to noise or outliers

We'll systematically evaluate K values from 1 to 20, tracking both training and validation accuracy. The gap between these curves reveals how well each K value generalizes to unseen data. Large gaps suggest the model is memorizing training-specific patterns rather than learning generalizable relationships.

> **Question**: After training KNN with K=1, you observe 100% training accuracy but only 88% validation accuracy (a 12 percentage point gap). Which approach is most likely to improve validation performance?
>  
> A) Increase K to create smoother decision boundaries and improve generalization to new data
>
> B) Keep K=1 but collect more training samples to reduce the performance gap
>
> C) Keep K=1 but apply more sophisticated feature engineering to capture better patterns  
>
> D) Switch to weighted KNN with K=1 where closer neighbors have more influence on predictions

The train-validation gap is a key diagnostic for detecting when a model is too flexible for the available data.

<details>

<summary>Click to reveal answer</summary>

**Correct Answer: A**

**Explanation:**

**A is TRUE**: Increasing K is the direct solution to overfitting in KNN.
- **The problem**: K=1 has 100% training accuracy but only 88% validation accuracy
  - This 12-point gap indicates severe overfitting
  - The model memorizes training data instead of learning general patterns
  - Each training point becomes its own "class island," perfect for training but terrible for generalization
- **The solution**: Increase K (e.g., try K=5, K=10, K=15)
  - Larger K averages over more neighbors, smoothing decision boundaries
  - Reduces sensitivity to individual noisy training points
  - Trades some training accuracy for better validation accuracy (which is what we want!)

**Expected outcome**: As K increases, you'll see:
- Training accuracy decrease (e.g., from 100% to 95%)
- Validation accuracy increase (e.g., from 88% to 95%)
- The gap shrinks, indicating better generalization

**B is FALSE**: More training samples help, but don't fix the fundamental problem
- K=1 will still memorize the training data, even with more samples
- You'd just memorize more examples without generalizing better
- This is treating the symptom, not the cause

**C is FALSE**: Feature engineering doesn't address overfitting from K=1
- The problem isn't lack of signal; it's the model's excessive flexibility
- K=1 can already perfectly fit the training data - more features won't help validation performance
- In fact, adding more features with K=1 could make overfitting worse (curse of dimensionality)

**D is FALSE**: Weighted KNN with K=1 doesn't solve overfitting
- Weighted KNN gives closer neighbors more influence
- With K=1, there's only one neighbor, so weighting is meaningless
- This doesn't address the core issue of decision boundaries that are too complex

**Key Insight**: The train-validation gap is your overfitting detector. When you see a large gap with small K, the solution is to increase K to regularize the model (make it less flexible). This is the bias-variance tradeoff in action.

**Analogy**: If you memorize answers to practice problems (K=1) instead of understanding concepts (larger K), you'll ace the practice test but fail the real exam.

</details>

In [None]:
train_acc, val_acc = [], []
k_sweep = range(1, 21)

for k in k_sweep:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train_scaled, y_train)
    train_acc.append(accuracy_score(y_train, model.predict(X_train_scaled)))
    val_acc.append(accuracy_score(y_val, model.predict(X_val_scaled)))

# Best K by validation
best_k_idx = int(np.argmax(val_acc))
chosen_k = best_k_idx + 1
best_val = max(val_acc)
max_gap = np.max(np.array(train_acc) - np.array(val_acc))

# Use Euclidean distance (default and most commonly used for continuous features)
chosen_metric = 'euclidean'

print("Selected hyperparameters:")
print(f"  K = {chosen_k}")
print(f"  Distance metric = {chosen_metric}")
print(f"  Validation accuracy = {best_val:.3f}")
print(f"Max (train - validation) gap across K: {max_gap:.3f}")

# Plot train vs validation accuracy vs K
plt.figure()
plt.scatter(list(k_sweep), train_acc, label='Train Accuracy')
plt.scatter(list(k_sweep), val_acc, label='Validation Accuracy')
plt.axvline(chosen_k, linestyle='--', label=f'Best K={chosen_k}')
plt.axis([0, 20, 0.8, 1.05])
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.grid()
plt.legend()
plt.tight_layout()
plt.show()

## Model Evaluation on Test Set

After selecting our hyperparameters using the validation set, we're ready for the final evaluation. At this stage we:

1. **Combine training and validation sets**: This provides the model with the maximum available data for learning, since we've already locked in our hyperparameter choices
2. **Refit the complete pipeline**: The scaler learns standardization parameters from the combined dataset, and KNN memorizes all combined training examples
3. **Evaluate once on the test set**: This held-out data provides our unbiased estimate of real-world performance

**Critical principle**: The test set has played zero role in any modeling decisions—no hyperparameter selection, no feature engineering choices, no model architecture decisions. It therefore provides an honest estimate of how the model will perform when deployed on genuinely new data from the same distribution.

> **Question**: After final evaluation, your test accuracy (94.4%) is slightly lower than your best validation accuracy (97.2%). Before deployment, which interpretation and next step is most appropriate?
>  
> A) This small decrease is normal variation; verify that test performance meets your accuracy requirements and document the results
>
> B) Re-evaluate hyperparameters using the test set to identify values that achieve better performance on this data split
>
> C) This indicates potential data leakage between validation and test sets; recreate the splits and re-run the experiment  
>
> D) Average the validation and test accuracies to obtain a more stable estimate of expected production performance

Remember: the test set is used exactly once for evaluation. Any optimization based on test results invalidates its role as an unbiased estimator.

<details>

<summary>Click to reveal answer</summary>

**Correct Answer: A**

**Explanation:**

**A is TRUE**: This small decrease is expected and normal; focus on whether it meets requirements.
- **Why the decrease is normal**:
  - Validation accuracy (97.2%) was from a specific 20% subset during hyperparameter tuning
  - Test accuracy (94.4%) is from a different 20% subset
  - Random variation in data splits naturally causes performance differences of ~1-3 percentage points
  - The model was selected because it performed well on validation data specifically
- **What to do next**:
  - Check if 94.4% meets your business/application requirements
  - Document both validation and test results
  - Consider the test accuracy (94.4%) as your expected production performance
  - Deploy if requirements are met

**B is FALSE**: Re-evaluating hyperparameters using the test set defeats its purpose
- **This invalidates the test set**: Once you optimize on test data, it's no longer "held-out"
- **Creates data leakage**: You've now leaked test information into model selection
- **Leads to optimistic bias**: Future performance will likely be worse than test results
- **Breaks the ML workflow**: Test set must be used exactly once, for final evaluation only

**C is FALSE**: This performance difference does NOT indicate data leakage
- **Expected pattern**: Test performance slightly different from validation is normal
- **Data leakage would show**: Suspiciously high performance on both validation and test (e.g., both >99%)
- **2.8-point difference** is well within normal variation range
- **Actual leakage signs**: Would be if test > validation, or both are unrealistically high

**D is FALSE**: Averaging validation and test accuracies is statistically invalid
- **Test set is the true estimate**: It's your unbiased measure of production performance
- **Validation was used for selection**: It's optimistically biased
- **Mixing them** combines biased and unbiased estimates incorrectly
- **Report test accuracy**: 94.4% is your expected production performance

**Key Insight**: The test set is your reality check. Small differences from validation are normal. Large unexpected gains or losses require investigation. Never re-optimize using test set results.

**Real-World Practice**:
```
Acceptable patterns:
- Validation: 97.2%, Test: 94.4% ✓ (small expected decrease)
- Validation: 95.0%, Test: 94.5% ✓ (very close)

Concerning patterns:  
- Validation: 80%, Test: 95% ⚠ (suggests data leakage or distribution shift)
- Validation: 95%, Test: 70% ⚠ (suggests severe overfitting to validation set)
```

</details>

In [None]:
# Combine training and validation sets for final training
X_train_all = np.vstack([X_train, X_val])
y_train_all = np.hstack([y_train, y_val])

# Build pipeline (scaler + KNN) with chosen hyperparameters — no weights
final_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=chosen_k, metric=chosen_metric))
])

final_pipe.fit(X_train_all, y_train_all)

# Predict on test set
test_pred = final_pipe.predict(X_test)
test_acc  = accuracy_score(y_test, test_pred)

print("Test accuracy:", round(test_acc, 3))


Beyond accuracy, we examine **balanced accuracy** (accounts for class imbalance) and **macro F1** (averages F1 across classes), print a **classification report**, show a **per-class table**, and plot both the raw and normalized confusion matrices.


In [None]:
# Print classification report and per-class metrics
print("\nClassification report (test):")
print(classification_report(y_test, test_pred, digits=3, target_names=[str(c) for c in np.unique(y)]))

labels = list(np.unique(y))
prec, rec, f1, sup = precision_recall_fscore_support(y_test, test_pred, labels=labels)

per_class_df = pd.DataFrame({
    'precision': prec,
    'recall': rec,
    'f1': f1,
    'support': sup
}, index=labels)
display(per_class_df)

# Balanced accuracy & macro F1
print("Balanced accuracy (test):", round(balanced_accuracy_score(y_test, test_pred), 3))
print("Macro F1 (test):         ", round(f1_score(y_test, test_pred, average='macro'), 3))

# Confusion matrices: raw and normalized
cm_raw = confusion_matrix(y_test, test_pred, labels=labels)
cm_norm = confusion_matrix(y_test, test_pred, labels=labels, normalize='true')

# Raw confusion matrix heatmap
fig, ax = plt.subplots()
im = ax.imshow(cm_raw, cmap='Blues')
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
ax.set_title("Confusion Matrix (Raw)")
for i in range(len(labels)):
    for j in range(len(labels)):
        ax.text(j, i, cm_raw[i, j], ha='center', va='center', color='black')
fig.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

# Normalized confusion matrix heatmap
fig, ax = plt.subplots()
im = ax.imshow(cm_norm, vmin=0, vmax=1, cmap='Blues')
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
ax.set_title("Confusion Matrix (Normalized)")
for i in range(len(labels)):
    for j in range(len(labels)):
        ax.text(j, i, f"{cm_norm[i, j]:.2f}", ha='center', va='center', color='black')
fig.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()


## Computational Complexity: Understanding KNN's Performance Characteristics

Unlike parametric models (e.g., logistic regression, decision trees) that learn a compact representation during training, KNN is a **lazy learner**—it stores all training examples and defers computation until prediction time. This has important implications for computational cost:

- **Training time**: O(1) — essentially zero, just storing the data
- **Prediction time**: O(n·d·k) where n = training samples, d = features, k = neighbors
  - Must compute distance to all n training points
  - Each distance calculation involves d features
  - Must find k smallest distances (can use partial sort)

Let's measure the actual time cost for training and prediction on our Wine dataset:

In [None]:
import time

# Measure training time
start = time.time()
timing_knn = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=chosen_k, metric=chosen_metric))
])
timing_knn.fit(X_train_all, y_train_all)
train_time = time.time() - start

# Measure prediction time on test set
start = time.time()
_ = timing_knn.predict(X_test)
predict_time = time.time() - start

# Measure single prediction time
start = time.time()
_ = timing_knn.predict(X_test[0:1])
single_predict_time = time.time() - start

print(f"Training time: {train_time*1000:.2f} ms")
print(f"Prediction time for {len(X_test)} samples: {predict_time*1000:.2f} ms")
print(f"Average prediction time per sample: {predict_time/len(X_test)*1000:.3f} ms")
print(f"Single prediction time: {single_predict_time*1000:.3f} ms")
print(f"\nDataset characteristics:")
print(f"  Training samples: {len(X_train_all)}")
print(f"  Features: {X_train_all.shape[1]}")
print(f"  K: {chosen_k}")
print(f"\nNote: For datasets with >100K samples or low-latency requirements (<1ms),")
print(f"consider approximate nearest neighbor methods (FAISS, Annoy, HNSW).")

## Feature Importance: Which Chemical Properties Matter Most?

Unlike tree-based models, KNN doesn't have built-in feature importance. However, we can use **permutation importance** to identify which features contribute most to classification accuracy. This technique randomly shuffles each feature and measures how much the model's performance degrades—important features cause larger drops in accuracy when permuted.

Understanding feature importance helps:
- **Interpret the model**: Which chemical properties distinguish wine cultivars?
- **Feature selection**: Could we achieve similar accuracy with fewer features?
- **Domain validation**: Do the important features align with wine chemistry knowledge?

In [None]:
from sklearn.inspection import permutation_importance

# Compute permutation importance on test set
perm_importance = permutation_importance(
    final_pipe, X_test, y_test,
    n_repeats=10,
    random_state=42,
    scoring='accuracy'
)

# Create DataFrame sorted by importance
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)

print("Feature Importance (Permutation):\n")
print(importance_df.to_string(index=False))

# Visualize feature importance
plt.figure(figsize=(10, 6))
indices = importance_df.index[:10]  # Top 10 features
plt.barh(range(len(indices)), importance_df.loc[indices, 'importance_mean'],
         xerr=importance_df.loc[indices, 'importance_std'], align='center')
plt.yticks(range(len(indices)), importance_df.loc[indices, 'feature'])
plt.xlabel('Decrease in Accuracy (Permutation Importance)')
plt.ylabel('Feature')
plt.title('Top 10 Most Important Features for Wine Classification')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# Identify top features
top_features = importance_df.head(3)['feature'].tolist()
print(f"\nTop 3 most important features: {', '.join(top_features)}")
print(f"These chemical properties are most discriminative for identifying wine cultivars.")

## Error Analysis: Understanding Misclassifications

Not all predictions are equally confident. For some wines, the K nearest neighbors all agree on the class (high confidence), while for others, the neighbors are mixed between multiple classes (low confidence/high ambiguity). Analyzing where the model makes mistakes helps us:

- **Identify boundary cases**: Wines that are chemically intermediate between cultivars
- **Assess prediction confidence**: Use neighbor agreement as a proxy for uncertainty
- **Guide data collection**: If high-ambiguity regions have many errors, collect more labeled samples there

Let's analyze the test set predictions by examining neighbor homogeneity:

In [None]:
# Get the KNN model from pipeline and transform test data
X_test_scaled = final_pipe.named_steps['scaler'].transform(X_test)
knn_model = final_pipe.named_steps['knn']

# Find nearest neighbors for each test point
distances, neighbor_indices = knn_model.kneighbors(X_test_scaled)

# For each test point, compute neighbor class agreement
neighbor_homogeneity = []
for i, neighbors in enumerate(neighbor_indices):
    neighbor_classes = y_train_all[neighbors]
    # Homogeneity: fraction of neighbors that match the majority class
    majority_class = np.bincount(neighbor_classes).argmax()
    agreement = np.sum(neighbor_classes == majority_class) / len(neighbor_classes)
    neighbor_homogeneity.append(agreement)

neighbor_homogeneity = np.array(neighbor_homogeneity)

# Identify correct and incorrect predictions
correct_mask = (test_pred == y_test)

# Statistics
print("Prediction Confidence Analysis:\n")
print(f"Correct predictions - Mean neighbor agreement: {neighbor_homogeneity[correct_mask].mean():.3f}")
print(f"Incorrect predictions - Mean neighbor agreement: {neighbor_homogeneity[~correct_mask].mean():.3f}")
print(f"\nHigh confidence predictions (100% neighbor agreement): {np.sum(neighbor_homogeneity == 1.0)} / {len(y_test)}")
print(f"Low confidence predictions (<80% neighbor agreement): {np.sum(neighbor_homogeneity < 0.8)} / {len(y_test)}")

# Visualize relationship between confidence and correctness
plt.figure(figsize=(10, 5))

# Plot 1: Histogram of neighbor homogeneity by correctness
plt.subplot(1, 2, 1)
plt.hist(neighbor_homogeneity[correct_mask], bins=10, alpha=0.7, label='Correct', color='green', edgecolor='black')
plt.hist(neighbor_homogeneity[~correct_mask], bins=10, alpha=0.7, label='Incorrect', color='red', edgecolor='black')
plt.xlabel('Neighbor Homogeneity (Agreement)')
plt.ylabel('Count')
plt.title('Prediction Confidence vs Correctness')
plt.legend()
plt.grid(axis='y', alpha=0.3)

# Plot 2: Misclassifications by actual class
plt.subplot(1, 2, 2)
misclassified_classes = y_test[~correct_mask]
if len(misclassified_classes) > 0:
    class_counts = np.bincount(misclassified_classes, minlength=len(np.unique(y)))
    plt.bar(range(len(class_counts)), class_counts, edgecolor='black', color='salmon')
    plt.xlabel('True Class')
    plt.ylabel('Number of Misclassifications')
    plt.title('Errors by Wine Cultivar')
    plt.xticks(range(len(class_counts)))
else:
    plt.text(0.5, 0.5, 'No misclassifications!', ha='center', va='center', fontsize=14)
    plt.axis('off')

plt.tight_layout()
plt.show()

# Show details of ambiguous cases
if np.sum(~correct_mask) > 0:
    print(f"\nMisclassified samples: {np.sum(~correct_mask)}")
    print("Most ambiguous misclassifications (lowest neighbor agreement):")
    misclassified_indices = np.where(~correct_mask)[0]
    ambiguous_errors = misclassified_indices[np.argsort(neighbor_homogeneity[~correct_mask])[:min(3, len(misclassified_indices))]]
    for idx in ambiguous_errors:
        neighbors = neighbor_indices[idx]
        neighbor_classes = y_train_all[neighbors]
        print(f"  Test sample {idx}: True={y_test[idx]}, Predicted={test_pred[idx]}, "
              f"Neighbor agreement={neighbor_homogeneity[idx]:.2f}, "
              f"Neighbor classes={neighbor_classes}")
else:
    print("\nPerfect classification - no errors to analyze!")

## Limitations (Current Scope) & What’s Next
This notebook uses a **single hold‑out validation** set, which is simple but sensitive to data splits. In practice, data scientists often use **k‑fold cross‑validation** or nested validation to obtain more reliable estimates and avoid overfitting hyperparameters to a single split. We also used brute‑force neighbor search (`algorithm='brute'`) and didn’t explore scalability techniques like KD‑trees, Ball Trees, or approximate nearest neighbor libraries (e.g. FAISS, HNSW). These become important when your archive grows to millions of rows or requires low‑latency predictions. Finally, we didn’t address class imbalance or cost‑sensitive KNN; these are natural extensions for more advanced courses.

## Common Pitfalls and Best Practices

### Critical Mistakes to Avoid:

1. **Forgetting to scale features** ❌  
   KNN is distance-based; features on different scales will dominate distance calculations. Always use StandardScaler or similar normalization.

2. **Using the test set for hyperparameter tuning** ❌  
   This creates data leakage and inflates performance estimates. Use a separate validation set or cross-validation for all tuning decisions.

3. **Choosing K=1 for production systems** ❌  
   While K=1 may give perfect training accuracy, it's highly sensitive to noise and outliers. Always validate with K > 1 on held-out data.

4. **Ignoring computational cost** ❌  
   KNN requires computing distances to all training points at prediction time. For large datasets (>100K samples), consider approximate nearest neighbor methods or alternative algorithms.

5. **Treating class imbalance casually** ❌  
   If one class has 90% of samples, KNN will naturally favor that class. Consider using balanced class weights, stratified sampling, or appropriate evaluation metrics (balanced accuracy, F1).

### Best Practices:

✅ **Always scale features** before applying KNN to continuous data  
✅ **Use stratified splits** to maintain class proportions across train/val/test sets  
✅ **Validate hyperparameters** (K, distance metric) on separate validation data  
✅ **Consider dimensionality**: KNN performance degrades in very high dimensions (curse of dimensionality); consider dimensionality reduction (PCA, feature selection) for >50 features  
✅ **Monitor the train-validation gap** to detect overfitting early  
✅ **Use domain knowledge**: For some applications (text, images), specialized distance metrics (cosine, Hamming) may work better than Euclidean

## Conclusion

In this case study, we've worked through a complete KNN classification workflow on the Wine dataset, covering the essential concepts and practical techniques:

### Key Takeaways:

1. **Feature scaling is non-negotiable** for distance-based algorithms. Without it, KNN effectively ignores smaller-scale features, leading to poor performance.

2. **Hyperparameter tuning requires systematic validation**. We used a dedicated validation set to select K and the distance metric, ensuring our choices generalize beyond the training data.

3. **The bias-variance tradeoff is visible in the train-validation gap**. Very small K (high flexibility) leads to overfitting; very large K (high rigidity) leads to underfitting.

4. **KNN is computationally expensive at prediction time**. Unlike parametric models that compress knowledge into parameters, KNN must compare against all training examples for each prediction.

5. **Error analysis reveals model confidence**. Neighbor homogeneity provides a natural measure of prediction uncertainty, helping identify boundary cases where the model is less certain.

### When to Use KNN:

✅ **Good fit:**
- Small to medium datasets (<100K samples)
- Problems where local similarity is meaningful
- Situations requiring interpretable, example-based reasoning
- Establishing baselines before trying complex models

❌ **Poor fit:**
- Large datasets requiring low-latency predictions
- High-dimensional data (>50 features) without dimensionality reduction
- Problems where global patterns matter more than local similarity
- Datasets with severe class imbalance (without special handling)

### Next Steps:

- **Try cross-validation** instead of a single validation split for more robust hyperparameter selection
- **Experiment with distance metrics** (Manhattan, Minkowski with different p values) tailored to your data
- **Explore weighted KNN** where closer neighbors have more influence (use `weights='distance'`)
- **Consider dimensionality reduction** (PCA, feature selection) if working with high-dimensional data
- **Compare against other classifiers** (Logistic Regression, Random Forest, SVM) to see if KNN's simplicity is sufficient

KNN remains one of the most intuitive machine learning algorithms—its "similar inputs produce similar outputs" principle mirrors how humans naturally reason about new situations by comparing to past experiences.