# K-Nearest Neighbors (KNN): Instance-Based Learning

## 📘 What You'll Master

**K-Nearest Neighbors (KNN)** is a **non-parametric, instance-based** learning algorithm that makes predictions based on the **similarity** to training examples. Unlike tree-based or linear models that learn a global function, KNN is a **"lazy learner"** — it stores all training data and makes predictions by looking at the K most similar instances.

### 🎯 Why KNN Matters

1. **Simple yet powerful**: No training phase, works well for small-medium datasets
2. **Non-parametric**: Makes no assumptions about data distribution (unlike linear regression)
3. **Naturally handles multi-class**: Classification with 10+ classes without modification
4. **Interpretable**: "You're like these 5 other examples"
5. **Foundation for similarity-based systems**: Recommendation engines, case-based reasoning

### 🔬 Real-World Applications

**Post-Silicon Validation:**
- **Similar die detection**: Find dies with similar parametric profiles for root cause analysis
- **Reference-based prediction**: "This device looks like known failures"
- **Spatial similarity**: Identify neighboring dies with correlated failures
- **Test correlation**: Find tests that behave similarly across devices

**General AI/ML:**
- **Recommendation systems**: "Users like you also liked..."
- **Anomaly detection**: Identify samples far from all neighbors
- **Medical diagnosis**: "Patients with similar symptoms had condition X"
- **Content-based filtering**: Find similar images, documents, or products

### 📊 Learning Path Context

```mermaid
graph LR
    A[Linear Models<br/>010-015] --> B[Tree Models<br/>016-018]
    B --> C[Boosting<br/>019-021]
    C --> D[Meta-Ensembles<br/>022]
    D --> E[KNN<br/>023 YOU ARE HERE]
    E --> F[SVM<br/>024]
    F --> G[Naive Bayes<br/>025]
    G --> H[Clustering<br/>026-030]
    
    style E fill:#ff6b6b,stroke:#c92a2a,stroke-width:3px,color:#fff
```

**What Makes KNN Different:**
- **No training phase**: Just stores data (contrast with tree/boosting training)
- **Prediction = search**: Find K nearest neighbors each time
- **Distance-based**: Everything depends on how you measure "similarity"
- **Memory-based**: Stores all training data (can be large)

---


## 🔍 How KNN Works: The Similarity Principle

### KNN Algorithm Flow

```mermaid
graph TD
    A[New Sample x] --> B[Calculate Distance<br/>to ALL Training Samples]
    B --> C[Sort by Distance:<br/>Closest to Farthest]
    C --> D[Select K Nearest<br/>Neighbors]
    D --> E{Task Type?}
    E -->|Classification| F[Majority Vote<br/>among K neighbors]
    E -->|Regression| G[Average of<br/>K neighbor values]
    F --> H[Predicted Class]
    G --> I[Predicted Value]
    
    style A fill:#e3f2fd
    style D fill:#fff3e0
    style H fill:#c8e6c9
    style I fill:#c8e6c9
```

### 📝 What's Happening in KNN?

**1. Distance Calculation** - Measure similarity to all training samples

**2. K Selection** - Choose number of neighbors (hyperparameter)

**3. Voting/Averaging** - Aggregate predictions from K neighbors

**4. Prediction** - Output most common class (classification) or average (regression)

### 🎯 Key Insight

**"Similar inputs should have similar outputs"**

If 8 out of 10 nearest neighbors are "Pass", the new sample is likely "Pass" too.

---


## 📐 Mathematical Foundation

### 1️⃣ Distance Metrics: Measuring Similarity

**Euclidean Distance** (most common, L2 norm):

$$
d(x, x') = \sqrt{\sum_{j=1}^p (x_j - x'_j)^2}
$$

- $x, x'$: Two samples with $p$ features
- Geometric "straight-line" distance
- **Sensitive to scale**: Feature with range [0, 1000] dominates feature with [0, 1]
- **Requires normalization**: Always use StandardScaler or MinMaxScaler

**Manhattan Distance** (L1 norm, city-block):

$$
d(x, x') = \sum_{j=1}^p |x_j - x'_j|
$$

- Sum of absolute differences (like walking city blocks)
- Less sensitive to outliers than Euclidean
- Good for high-dimensional data

**Minkowski Distance** (generalized):

$$
d(x, x') = \left( \sum_{j=1}^p |x_j - x'_j|^q \right)^{1/q}
$$

- $q=1$: Manhattan, $q=2$: Euclidean
- Larger $q$: More weight to large differences

**Cosine Similarity** (angle-based, for text/high-dim):

$$
\text{similarity}(x, x') = \frac{x \cdot x'}{\|x\| \|x'\|} = \frac{\sum_{j=1}^p x_j x'_j}{\sqrt{\sum_{j=1}^p x_j^2} \sqrt{\sum_{j=1}^p x'^2_j}}
$$

- Measures angle, not magnitude (invariant to scale)
- Distance: $d = 1 - \text{similarity}$
- Common in text classification (TF-IDF vectors)

---

### 2️⃣ Classification: Majority Vote

**Prediction for sample $x$**:

$$
\hat{y}(x) = \arg\max_c \sum_{i \in \mathcal{N}_K(x)} \mathbb{1}(y_i = c)
$$

- $\mathcal{N}_K(x)$: Set of K nearest neighbors to $x$
- $\mathbb{1}(\cdot)$: Indicator function (1 if true, 0 otherwise)
- $\arg\max_c$: Class with most votes among K neighbors

**Example:** K=5 neighbors have classes [Pass, Pass, Fail, Pass, Pass]
- Pass: 4 votes, Fail: 1 vote → Predict **Pass**

**Weighted Voting** (closer neighbors have more influence):

$$
\hat{y}(x) = \arg\max_c \sum_{i \in \mathcal{N}_K(x)} w_i \cdot \mathbb{1}(y_i = c), \quad w_i = \frac{1}{d(x, x_i) + \epsilon}
$$

- Closer neighbors get higher weight ($w_i \propto 1/d$)
- $\epsilon$: Small constant to avoid division by zero

---

### 3️⃣ Regression: Averaging

**Prediction for sample $x$**:

$$
\hat{y}(x) = \frac{1}{K} \sum_{i \in \mathcal{N}_K(x)} y_i
$$

- Simple average of K nearest neighbor values

**Weighted Regression**:

$$
\hat{y}(x) = \frac{\sum_{i \in \mathcal{N}_K(x)} w_i \cdot y_i}{\sum_{i \in \mathcal{N}_K(x)} w_i}, \quad w_i = \frac{1}{d(x, x_i) + \epsilon}
$$

- Weighted average (closer neighbors matter more)

---

### 4️⃣ Choosing K: Bias-Variance Trade-off

**Small K (e.g., K=1, K=3):**
- ✅ Low bias (flexible, captures local patterns)
- ❌ High variance (sensitive to noise, overfitting)
- Decision boundary: Jagged, complex

**Large K (e.g., K=50, K=100):**
- ✅ Low variance (smooth, stable predictions)
- ❌ High bias (misses local patterns, underfitting)
- Decision boundary: Smooth, simple

**Optimal K:**
- Use cross-validation to select K
- Typical range: $K = \sqrt{N}$ as starting point (where $N$ = training size)
- Try odd K for binary classification (avoids ties)

---

### 5️⃣ Computational Complexity

**Training:** $O(1)$ — Just stores data, no learning!

**Prediction:** $O(N \cdot p)$ — Calculate distance to all $N$ training samples with $p$ features

**Problem:** Prediction is **SLOW** for large datasets (millions of samples)

**Solutions:**
- **KD-Tree**: $O(p \log N)$ search (works for $p < 20$)
- **Ball Tree**: $O(\log N)$ search (better for high dimensions)
- **Approximate NN**: FAISS, Annoy (sacrifice accuracy for 10-100x speedup)

---


### 📝 What's Happening in This Code?

**Purpose:** Import libraries for KNN implementation and evaluation

**Key Points:**
- **KNeighborsClassifier/Regressor**: sklearn's KNN implementation with multiple distance metrics
- **StandardScaler**: **CRITICAL** for KNN — features must have similar scales
- **make_classification/regression**: Generate synthetic data for demonstrations
- **Metrics**: Accuracy, confusion matrix, MAE for evaluation
- **KD-Tree option**: `algorithm='kd_tree'` for faster search (small p)

**Why This Matters:** KNN is extremely sensitive to feature scales — always normalize!


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    mean_absolute_error, mean_squared_error, r2_score,
    roc_auc_score, roc_curve
)
from sklearn.datasets import make_classification, make_regression, make_blobs
import time

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("✅ Imports complete")
print(f"   NumPy: {np.__version__}")
print(f"   Pandas: {pd.__version__}")
print(f"\n🎯 Ready to explore K-Nearest Neighbors!")

### 📝 What's Happening in This Code?

**Purpose:** Implement KNN from scratch to understand distance calculation and voting

**Key Points:**
- **euclidean_distance**: NumPy vectorized distance calculation ($\sqrt{\sum(x_i - x_j)^2}$)
- **predict**: For each test sample, calculate distances to all training samples
- **np.argsort**: Sort distances and get indices of K nearest neighbors
- **np.bincount**: Count votes for each class (efficient voting)
- **No training**: Just stores X_train and y_train!

**Why This Matters:** KNN is conceptually simple but computationally expensive — understand the distance calculation bottleneck.


In [None]:
class KNNClassifierScratch:
    """K-Nearest Neighbors Classifier from scratch"""
    
    def __init__(self, k=5):
        self.k = k
        
    def fit(self, X, y):
        """Store training data (no actual training)"""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self
    
    def euclidean_distance(self, x1, x2):
        """Calculate Euclidean distance between two vectors"""
        return np.sqrt(np.sum((x1 - x2) ** 2))
    
    def predict(self, X):
        """Predict class for each sample in X"""
        predictions = []
        
        for x_test in X:
            # Calculate distances to all training samples
            distances = [self.euclidean_distance(x_test, x_train) 
                        for x_train in self.X_train]
            
            # Get indices of K nearest neighbors (smallest distances)
            k_indices = np.argsort(distances)[:self.k]
            
            # Get labels of K nearest neighbors
            k_nearest_labels = self.y_train[k_indices]
            
            # Majority vote (most common class)
            most_common = np.bincount(k_nearest_labels).argmax()
            predictions.append(most_common)
        
        return np.array(predictions)

# Demo on simple dataset
print("🧪 Testing KNN from Scratch\n")

# Generate small dataset
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, 
                          n_redundant=0, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# CRITICAL: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train and predict
knn_scratch = KNNClassifierScratch(k=5)
knn_scratch.fit(X_train_scaled, y_train)

start_time = time.time()
y_pred = knn_scratch.predict(X_test_scaled)
pred_time = time.time() - start_time

accuracy = accuracy_score(y_test, y_pred)

print(f"✅ From-Scratch KNN Results (K=5):")
print(f"   Accuracy: {accuracy:.4f}")
print(f"   Prediction time: {pred_time*1000:.2f}ms for {len(X_test)} samples")
print(f"   Per-sample: {pred_time*1000/len(X_test):.2f}ms")

print(f"\n🔍 How it works:")
print(f"   1. For each test sample, calculate distance to ALL {len(X_train)} training samples")
print(f"   2. Sort distances and find K={knn_scratch.k} nearest neighbors")
print(f"   3. Majority vote among neighbors (most common class wins)")
print(f"   4. Repeat for all {len(X_test)} test samples")
print(f"\n⚠️ Note: No training phase — just stores data!")

### 📝 What's Happening in This Code?

**Purpose:** Use sklearn's optimized KNN with multiple distance metrics and algorithms

**Key Points:**
- **n_neighbors=5**: K value (typical starting point)
- **weights='uniform'**: All neighbors vote equally (vs 'distance' for weighted voting)
- **metric='euclidean'**: Distance function (alternatives: 'manhattan', 'minkowski', 'cosine')
- **algorithm='auto'**: Chooses best search method (brute force, KD-tree, ball tree)
- **predict_proba**: Get probability estimates (useful for thresholding)

**Why This Matters:** sklearn's KNN is 10-100x faster than from-scratch due to optimized algorithms (C extensions, KD-trees).


In [None]:
print("🚀 Production KNN with sklearn\n")

# Same dataset as before
knn_sklearn = KNeighborsClassifier(
    n_neighbors=5,
    weights='uniform',  # 'uniform' or 'distance'
    metric='euclidean',  # Distance function
    algorithm='auto',    # 'auto', 'ball_tree', 'kd_tree', 'brute'
    n_jobs=-1            # Parallel processing
)

start_time = time.time()
knn_sklearn.fit(X_train_scaled, y_train)
fit_time = time.time() - start_time

start_time = time.time()
y_pred_sklearn = knn_sklearn.predict(X_test_scaled)
y_proba_sklearn = knn_sklearn.predict_proba(X_test_scaled)[:, 1]
pred_time_sklearn = time.time() - start_time

accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)
auc_sklearn = roc_auc_score(y_test, y_proba_sklearn)

print(f"✅ sklearn KNN Results (K=5):")
print(f"   Fit time: {fit_time*1000:.2f}ms (just stores data)")
print(f"   Prediction time: {pred_time_sklearn*1000:.2f}ms for {len(X_test)} samples")
print(f"   Per-sample: {pred_time_sklearn*1000/len(X_test):.2f}ms")
print(f"   Accuracy: {accuracy_sklearn:.4f}")
print(f"   AUC: {auc_sklearn:.4f}")

speedup = pred_time / pred_time_sklearn
print(f"\n⚡ Speedup over from-scratch: {speedup:.1f}x")

# Comparison with from-scratch
print(f"\n📊 Validation (same predictions?): {np.array_equal(y_pred, y_pred_sklearn)}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_sklearn)
print(f"\n📋 Confusion Matrix:")
print(f"   True Neg:  {cm[0,0]:3d}  |  False Pos: {cm[0,1]:3d}")
print(f"   False Neg: {cm[1,0]:3d}  |  True Pos:  {cm[1,1]:3d}")

### 📝 What's Happening in This Code?

**Purpose:** Find optimal K value using cross-validation

**Key Points:**
- **K range**: Test K=1, 3, 5, 7, 9, 11, 15, 20, 30 (common values)
- **cross_val_score**: 5-fold CV to get robust accuracy estimate
- **Bias-variance trade-off**: Small K (high variance), large K (high bias)
- **Elbow method**: Look for K where accuracy plateaus
- **Odd K preferred**: Avoids tie-breaking in binary classification

**Why This Matters:** K is the most important hyperparameter in KNN — wrong K leads to poor performance.


In [None]:
print("🔍 Finding Optimal K via Cross-Validation\n")

# Test different K values
k_values = [1, 3, 5, 7, 9, 11, 15, 20, 30, 50]
cv_scores_mean = []
cv_scores_std = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_scores_mean.append(scores.mean())
    cv_scores_std.append(scores.std())
    print(f"   K={k:2d}: Accuracy = {scores.mean():.4f} ± {scores.std():.4f}")

# Find best K
best_k_idx = np.argmax(cv_scores_mean)
best_k = k_values[best_k_idx]
best_accuracy = cv_scores_mean[best_k_idx]

print(f"\n✅ Best K: {best_k} (Accuracy: {best_accuracy:.4f})")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

ax.errorbar(k_values, cv_scores_mean, yerr=cv_scores_std, 
            marker='o', linestyle='-', linewidth=2, markersize=8,
            capsize=5, capthick=2, label='CV Accuracy ± Std')
ax.axvline(best_k, color='red', linestyle='--', linewidth=2, 
           label=f'Best K={best_k}')
ax.set_xlabel('K (Number of Neighbors)', fontsize=12, fontweight='bold')
ax.set_ylabel('Cross-Validation Accuracy', fontsize=12, fontweight='bold')
ax.set_title('KNN: Finding Optimal K via 5-Fold Cross-Validation', 
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Annotate best K
ax.annotate(f'Best K={best_k}\nAcc={best_accuracy:.4f}', 
            xy=(best_k, best_accuracy), xytext=(best_k+10, best_accuracy-0.02),
            arrowprops=dict(arrowstyle='->', color='red', lw=2),
            fontsize=11, fontweight='bold', color='red')

plt.tight_layout()
plt.show()

print(f"\n🎯 Observations:")
print(f"   • Small K (1-3): High variance, sensitive to noise")
print(f"   • Medium K ({best_k}): Best balance (bias-variance trade-off)")
print(f"   • Large K (30-50): High bias, misses local patterns")
print(f"   • Rule of thumb: Start with K=√N = {int(np.sqrt(len(X_train)))}")

### 📝 What's Happening in This Code?

**Purpose:** Compare different distance metrics (Euclidean, Manhattan, Minkowski, Cosine)

**Key Points:**
- **Euclidean**: Geometric distance, most common ($L_2$ norm)
- **Manhattan**: City-block distance, less sensitive to outliers ($L_1$ norm)
- **Minkowski (p=3)**: Generalized distance ($L_p$ norm)
- **Cosine**: Angle-based, good for high-dimensional data (text, embeddings)
- **Dataset-dependent**: No single "best" metric — try multiple

**Why This Matters:** Distance metric choice can change accuracy by 5-15% — always experiment.


In [None]:
print("📏 Comparing Distance Metrics\n")

metrics = {
    'Euclidean (L2)': 'euclidean',
    'Manhattan (L1)': 'manhattan',
    'Minkowski (p=3)': 'minkowski',  # Need to set p parameter
    'Cosine': 'cosine'
}

metric_results = {}

for name, metric in metrics.items():
    if metric == 'minkowski':
        knn = KNeighborsClassifier(n_neighbors=best_k, metric=metric, p=3, n_jobs=-1)
    else:
        knn = KNeighborsClassifier(n_neighbors=best_k, metric=metric, n_jobs=-1)
    
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    y_proba = knn.predict_proba(X_test_scaled)[:, 1] if metric != 'cosine' else None
    
    accuracy = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba) if y_proba is not None else None
    
    metric_results[name] = {'accuracy': accuracy, 'auc': auc}
    
    auc_str = f"{auc:.4f}" if auc else "N/A"
    print(f"   {name:<20} Accuracy: {accuracy:.4f}, AUC: {auc_str}")

# Best metric
best_metric = max(metric_results.items(), key=lambda x: x[1]['accuracy'])
print(f"\n✅ Best metric: {best_metric[0]} (Accuracy: {best_metric[1]['accuracy']:.4f})")

print(f"\n🎯 When to use each metric:")
print(f"   • Euclidean: Default choice, continuous features, geometric data")
print(f"   • Manhattan: Outlier-robust, grid-like data (images, city maps)")
print(f"   • Minkowski: Tunable p parameter (balance between L1/L2)")
print(f"   • Cosine: Text data (TF-IDF), high-dimensional embeddings, angle matters more than magnitude")

### 📝 What's Happening in This Code?

**Purpose:** Compare uniform voting (equal weights) vs distance-weighted voting

**Key Points:**
- **weights='uniform'**: All K neighbors vote equally (default)
- **weights='distance'**: Closer neighbors have more influence ($w_i = 1/d_i$)
- **Distance weighting**: Helps when nearest neighbors are very close (high confidence)
- **Trade-off**: Weighted can overfit to very close neighbors

**Why This Matters:** Distance weighting typically improves accuracy by 1-3% when decision boundaries are complex.


In [None]:
print("⚖️ Uniform vs Distance-Weighted Voting\n")

# Uniform voting (all neighbors equal)
knn_uniform = KNeighborsClassifier(n_neighbors=best_k, weights='uniform', n_jobs=-1)
knn_uniform.fit(X_train_scaled, y_train)
y_pred_uniform = knn_uniform.predict(X_test_scaled)
acc_uniform = accuracy_score(y_test, y_pred_uniform)

# Distance-weighted voting (closer neighbors matter more)
knn_weighted = KNeighborsClassifier(n_neighbors=best_k, weights='distance', n_jobs=-1)
knn_weighted.fit(X_train_scaled, y_train)
y_pred_weighted = knn_weighted.predict(X_test_scaled)
acc_weighted = accuracy_score(y_test, y_pred_weighted)

print(f"✅ Voting Comparison (K={best_k}):")
print(f"   Uniform (equal weights):    Accuracy = {acc_uniform:.4f}")
print(f"   Distance-weighted:          Accuracy = {acc_weighted:.4f}")
print(f"   Improvement: {(acc_weighted - acc_uniform)*100:+.2f}%")

print(f"\n🔍 How distance weighting works:")
print(f"   • Weight w_i = 1 / (distance_i + ε)")
print(f"   • Closer neighbors get higher weight")
print(f"   • Example: distance=0.1 → weight=10, distance=1.0 → weight=1.0")
print(f"   • Prediction = weighted average of K neighbor classes")

print(f"\n🎯 When to use each:")
print(f"   • Uniform: Simple, robust, good starting point")
print(f"   • Distance-weighted: Complex boundaries, varying density, when nearest neighbor very close")

### 📝 What's Happening in This Code?

**Purpose:** Demonstrate why feature scaling is CRITICAL for KNN

**Key Points:**
- **Without scaling**: Feature with large range (0-1000) dominates distance calculation
- **With scaling**: All features contribute equally to distance
- **StandardScaler**: $z = (x - \mu) / \sigma$ (mean=0, std=1)
- **MinMaxScaler**: $x' = (x - x_{min}) / (x_{max} - x_{min})$ (range [0,1])
- **Impact**: Scaling can change accuracy by 20-40%!

**Why This Matters:** KNN without scaling is almost always wrong — ALWAYS scale features!


In [None]:
print("⚠️ Feature Scaling Sensitivity Demo\n")

# Create dataset with different feature scales
np.random.seed(42)
n_samples = 500
X_unscaled = np.column_stack([
    np.random.randn(n_samples) * 1,      # Feature 1: range ~[-3, 3]
    np.random.randn(n_samples) * 1000,   # Feature 2: range ~[-3000, 3000] (DOMINATES!)
    np.random.randn(n_samples) * 0.1     # Feature 3: range ~[-0.3, 0.3]
])
y_demo = (X_unscaled[:, 0] + X_unscaled[:, 1]/1000 + X_unscaled[:, 2]*10 > 0).astype(int)

X_train_demo, X_test_demo, y_train_demo, y_test_demo = train_test_split(
    X_unscaled, y_demo, test_size=0.3, random_state=42
)

print(f"📊 Feature Ranges (training data):")
for i in range(X_train_demo.shape[1]):
    print(f"   Feature {i+1}: [{X_train_demo[:, i].min():.2f}, {X_train_demo[:, i].max():.2f}] "
          f"(std={X_train_demo[:, i].std():.2f})")

# Test WITHOUT scaling
print(f"\n❌ KNN WITHOUT Scaling:")
knn_no_scale = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_no_scale.fit(X_train_demo, y_train_demo)
y_pred_no_scale = knn_no_scale.predict(X_test_demo)
acc_no_scale = accuracy_score(y_test_demo, y_pred_no_scale)
print(f"   Accuracy: {acc_no_scale:.4f}")
print(f"   Problem: Feature 2 (large range) dominates distance calculation!")
print(f"   Distance ≈ √((f1_diff)² + (f2_diff)² + (f3_diff)²)")
print(f"            ≈ √(1² + 1000² + 0.1²) ≈ 1000 (f2 dominates!)")

# Test WITH StandardScaler
print(f"\n✅ KNN WITH StandardScaler:")
scaler_std = StandardScaler()
X_train_std = scaler_std.fit_transform(X_train_demo)
X_test_std = scaler_std.transform(X_test_demo)

knn_std = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_std.fit(X_train_std, y_train_demo)
y_pred_std = knn_std.predict(X_test_std)
acc_std = accuracy_score(y_test_demo, y_pred_std)
print(f"   Accuracy: {acc_std:.4f}")
print(f"   All features now have mean=0, std=1 → equal contribution")

# Test WITH MinMaxScaler
print(f"\n✅ KNN WITH MinMaxScaler:")
scaler_mm = MinMaxScaler()
X_train_mm = scaler_mm.fit_transform(X_train_demo)
X_test_mm = scaler_mm.transform(X_test_demo)

knn_mm = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
knn_mm.fit(X_train_mm, y_train_demo)
y_pred_mm = knn_mm.predict(X_test_mm)
acc_mm = accuracy_score(y_test_demo, y_pred_mm)
print(f"   Accuracy: {acc_mm:.4f}")
print(f"   All features now in range [0, 1] → equal contribution")

print(f"\n🎯 Results Summary:")
print(f"   No scaling:      {acc_no_scale:.4f} (POOR - dominated by large-scale feature)")
print(f"   StandardScaler:  {acc_std:.4f} (GOOD - mean=0, std=1)")
print(f"   MinMaxScaler:    {acc_mm:.4f} (GOOD - range [0,1])")
print(f"   Improvement:     {(acc_std - acc_no_scale)*100:+.2f}% (scaling is CRITICAL!)")

print(f"\n✅ Best Practice: ALWAYS scale features before KNN!")

### 📝 What's Happening in This Code?

**Purpose:** Use KNN for regression (predict continuous values, not classes)

**Key Points:**
- **KNeighborsRegressor**: Same distance calculation, but averages K neighbor values
- **Prediction**: $\hat{y} = \frac{1}{K} \sum_{i \in \mathcal{N}_K(x)} y_i$ (simple average)
- **Weighted average**: Closer neighbors contribute more (weights='distance')
- **Evaluation**: MAE, RMSE, R² (regression metrics)
- **Use case**: Local patterns, non-parametric regression, missing data interpolation

**Why This Matters:** KNN regression captures local patterns that global models (linear regression) miss.


In [None]:
from sklearn.neighbors import KNeighborsRegressor

print("📊 KNN Regression Demo\n")

# Generate regression dataset
X_reg, y_reg = make_regression(n_samples=500, n_features=10, n_informative=8,
                               noise=15, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

# Scale features (CRITICAL for KNN)
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

# Test different K values
k_values_reg = [1, 3, 5, 10, 20, 30]
results_reg = []

for k in k_values_reg:
    knn_reg = KNeighborsRegressor(n_neighbors=k, weights='distance', n_jobs=-1)
    knn_reg.fit(X_train_reg_scaled, y_train_reg)
    y_pred_reg = knn_reg.predict(X_test_reg_scaled)
    
    mae = mean_absolute_error(y_test_reg, y_pred_reg)
    rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
    r2 = r2_score(y_test_reg, y_pred_reg)
    
    results_reg.append({'k': k, 'mae': mae, 'rmse': rmse, 'r2': r2})
    print(f"   K={k:2d}: MAE={mae:8.2f}, RMSE={rmse:8.2f}, R²={r2:6.4f}")

# Best K (by R²)
best_result = max(results_reg, key=lambda x: x['r2'])
print(f"\n✅ Best K: {best_result['k']} (R²={best_result['r2']:.4f})")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

k_vals = [r['k'] for r in results_reg]
mae_vals = [r['mae'] for r in results_reg]
rmse_vals = [r['rmse'] for r in results_reg]
r2_vals = [r['r2'] for r in results_reg]

axes[0].plot(k_vals, mae_vals, marker='o', linewidth=2, markersize=8)
axes[0].set_xlabel('K', fontweight='bold')
axes[0].set_ylabel('MAE', fontweight='bold')
axes[0].set_title('Mean Absolute Error vs K', fontweight='bold')
axes[0].grid(True, alpha=0.3)

axes[1].plot(k_vals, rmse_vals, marker='o', linewidth=2, markersize=8, color='orange')
axes[1].set_xlabel('K', fontweight='bold')
axes[1].set_ylabel('RMSE', fontweight='bold')
axes[1].set_title('Root Mean Squared Error vs K', fontweight='bold')
axes[1].grid(True, alpha=0.3)

axes[2].plot(k_vals, r2_vals, marker='o', linewidth=2, markersize=8, color='green')
axes[2].set_xlabel('K', fontweight='bold')
axes[2].set_ylabel('R²', fontweight='bold')
axes[2].set_title('R² Score vs K', fontweight='bold')
axes[2].grid(True, alpha=0.3)
axes[2].axhline(best_result['r2'], color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print(f"\n🎯 KNN Regression Insights:")
print(f"   • K=1: Perfect fit to training (high variance, overfitting)")
print(f"   • K=5-10: Good balance (captures local patterns)")
print(f"   • K=30+: Smooth predictions (high bias, underfitting)")
print(f"   • Use weighted averaging (weights='distance') for better results")

## 🔬 Post-Silicon Application: Similar Device Detection for Root Cause Analysis

### Business Problem

**Scenario:** 50,000 devices tested, 200 fail with similar symptoms. Engineers need to find **all devices with similar parametric profiles** for root cause analysis.

**Traditional Approach (Manual):**
- Engineer looks at 25 parametric tests for each failure
- Manually searches for similar patterns across 50K devices
- Time: 2-4 hours per failure analysis
- Misses subtle similarities (human pattern recognition limits)

**KNN Approach (Automated):**
- Calculate distance from failed device to all 50K devices
- Return K=20 most similar devices (nearest neighbors)
- Visualize common patterns across similar failures
- Time: <1 second per query

### 💰 Business Value

- **Root cause speed**: 2 hours → 5 minutes (95% reduction)
- **Similar failure detection**: Find hidden patterns missed manually
- **Proactive identification**: Predict which devices might fail based on similarity to known failures
- **Cost impact**: $500K-2M per major failure investigation saved

---


### 📝 What's Happening in This Code?

**Purpose:** Generate realistic 50K device dataset with spatial and parametric features

**Key Points:**
- **50K devices**: Production-scale dataset
- **25 parametric tests**: Voltage, current, frequency, power, leakage, temperature, timing
- **Spatial features**: die_x, die_y (wafer position), equipment_id, lot_id
- **Failure modes**: Inject 3 distinct failure patterns (high leakage, timing violations, power anomalies)
- **Similarity groups**: Devices with similar test profiles cluster together

**Why This Matters:** KNN excels at finding similar instances in high-dimensional parametric test data.


In [None]:
print("🏭 Generating 50K Device Post-Silicon Dataset\n")

np.random.seed(42)
n_devices = 50000

# Spatial features
die_x = np.random.randint(0, 50, n_devices)
die_y = np.random.randint(0, 50, n_devices)
equipment_id = np.random.randint(1, 21, n_devices)  # 20 testers
lot_id = np.random.randint(1, 51, n_devices)  # 50 lots

# 25 Parametric tests (realistic semiconductor test parameters)
vdd_core = np.random.normal(1.0, 0.02, n_devices)  # Core voltage
vdd_io = np.random.normal(1.8, 0.03, n_devices)    # I/O voltage
idd_active = np.random.normal(50, 5, n_devices)    # Active current (mA)
idd_standby = np.random.normal(0.5, 0.1, n_devices) # Standby current (mA)
freq_max = np.random.normal(2000, 100, n_devices)  # Max frequency (MHz)
freq_min = np.random.normal(500, 50, n_devices)    # Min frequency (MHz)
power_active = np.random.normal(100, 10, n_devices) # Active power (mW)
power_idle = np.random.normal(5, 1, n_devices)     # Idle power (mW)
leakage_total = np.random.lognormal(1.0, 0.5, n_devices) # Total leakage (μA)
leakage_gate = np.random.lognormal(0.5, 0.3, n_devices)  # Gate leakage (μA)
temp_junction = np.random.normal(85, 5, n_devices) # Junction temp (°C)
temp_ambient = np.random.normal(25, 2, n_devices)  # Ambient temp (°C)
timing_setup = np.random.normal(1.5, 0.2, n_devices) # Setup time (ns)
timing_hold = np.random.normal(0.8, 0.1, n_devices)  # Hold time (ns)
timing_clk2q = np.random.normal(2.0, 0.3, n_devices) # Clock-to-Q (ns)
delay_rising = np.random.normal(1.2, 0.15, n_devices) # Rising edge delay (ns)
delay_falling = np.random.normal(1.1, 0.12, n_devices) # Falling edge delay (ns)
capacitance_in = np.random.normal(5.0, 0.5, n_devices) # Input capacitance (pF)
capacitance_out = np.random.normal(8.0, 0.8, n_devices) # Output capacitance (pF)
resistance_on = np.random.normal(50, 5, n_devices)  # On-resistance (Ω)
resistance_off = np.random.normal(1e6, 1e5, n_devices) # Off-resistance (Ω)
noise_margin_high = np.random.normal(0.4, 0.05, n_devices) # High noise margin (V)
noise_margin_low = np.random.normal(0.4, 0.05, n_devices)  # Low noise margin (V)
propagation_delay = np.random.normal(3.5, 0.4, n_devices)  # Total propagation (ns)
slew_rate = np.random.normal(2.0, 0.3, n_devices)   # Slew rate (V/ns)

# Inject 3 distinct failure modes (for similarity detection demo)
n_failures = 200
failure_indices = np.random.choice(n_devices, n_failures, replace=False)

# Failure Mode 1: High leakage (first 70 failures)
fm1_indices = failure_indices[:70]
leakage_total[fm1_indices] = np.random.lognormal(3.0, 0.3, 70)  # 10x higher
leakage_gate[fm1_indices] = np.random.lognormal(2.5, 0.3, 70)   # 5x higher
power_idle[fm1_indices] = np.random.normal(15, 2, 70)           # 3x higher

# Failure Mode 2: Timing violations (next 80 failures)
fm2_indices = failure_indices[70:150]
timing_setup[fm2_indices] = np.random.normal(0.5, 0.1, 80)      # Too fast (violation)
timing_hold[fm2_indices] = np.random.normal(0.3, 0.05, 80)      # Too fast
propagation_delay[fm2_indices] = np.random.normal(5.5, 0.5, 80) # Slow
freq_max[fm2_indices] = np.random.normal(1500, 100, 80)         # Can't reach max freq

# Failure Mode 3: Power anomalies (last 50 failures)
fm3_indices = failure_indices[150:]
power_active[fm3_indices] = np.random.normal(200, 20, 50)       # 2x higher
idd_active[fm3_indices] = np.random.normal(100, 10, 50)         # 2x higher
temp_junction[fm3_indices] = np.random.normal(105, 5, 50)       # Overheating

# Create DataFrame
df_ps = pd.DataFrame({
    'device_id': range(n_devices),
    'die_x': die_x,
    'die_y': die_y,
    'equipment_id': equipment_id,
    'lot_id': lot_id,
    'vdd_core': vdd_core,
    'vdd_io': vdd_io,
    'idd_active': idd_active,
    'idd_standby': idd_standby,
    'freq_max': freq_max,
    'freq_min': freq_min,
    'power_active': power_active,
    'power_idle': power_idle,
    'leakage_total': leakage_total,
    'leakage_gate': leakage_gate,
    'temp_junction': temp_junction,
    'temp_ambient': temp_ambient,
    'timing_setup': timing_setup,
    'timing_hold': timing_hold,
    'timing_clk2q': timing_clk2q,
    'delay_rising': delay_rising,
    'delay_falling': delay_falling,
    'capacitance_in': capacitance_in,
    'capacitance_out': capacitance_out,
    'resistance_on': resistance_on,
    'resistance_off': resistance_off,
    'noise_margin_high': noise_margin_high,
    'noise_margin_low': noise_margin_low,
    'propagation_delay': propagation_delay,
    'slew_rate': slew_rate
})

# Label failures
df_ps['is_failure'] = 0
df_ps.loc[failure_indices, 'is_failure'] = 1

# Failure mode labels
df_ps['failure_mode'] = 'Pass'
df_ps.loc[fm1_indices, 'failure_mode'] = 'High_Leakage'
df_ps.loc[fm2_indices, 'failure_mode'] = 'Timing_Violation'
df_ps.loc[fm3_indices, 'failure_mode'] = 'Power_Anomaly'

print(f"✅ Dataset Generated:")
print(f"   Total devices: {len(df_ps):,}")
print(f"   Parametric tests: 25 (voltage, current, frequency, timing, power, leakage, etc.)")
print(f"   Failures: {df_ps['is_failure'].sum():,} ({df_ps['is_failure'].mean()*100:.2f}%)")
print(f"\n📊 Failure Mode Distribution:")
print(df_ps['failure_mode'].value_counts())

print(f"\n🎯 Use Case: Find similar devices to any failed device for root cause analysis")
print(f"   Example: Device 12345 fails → Find K=20 nearest neighbors → Analyze common patterns")

### 📝 What's Happening in This Code?

**Purpose:** Use KNN to find similar devices for root cause analysis

**Key Points:**
- **Query device**: Select a failed device (e.g., high leakage failure)
- **kneighbors()**: Return distances and indices of K nearest neighbors
- **Feature space**: 25 parametric tests (scaled) define "similarity"
- **Interpretation**: Neighbors likely share same root cause
- **Validation**: Check if neighbors have same failure mode

**Why This Matters:** Automated similarity search replaces hours of manual analysis with <1 second query.


In [None]:
print("🔍 Similar Device Detection with KNN\n")

# Prepare features (25 parametric tests)
feature_cols = [col for col in df_ps.columns if col not in 
                ['device_id', 'is_failure', 'failure_mode', 'die_x', 'die_y', 'equipment_id', 'lot_id']]

X_ps = df_ps[feature_cols].values

# Scale features (CRITICAL for KNN)
scaler_ps = StandardScaler()
X_ps_scaled = scaler_ps.fit_transform(X_ps)

# Build KNN model (K=20 similar devices)
knn_ps = KNeighborsClassifier(n_neighbors=20, metric='euclidean', n_jobs=-1)
knn_ps.fit(X_ps_scaled, df_ps['is_failure'].values)

# Example: Find similar devices to a high leakage failure
query_idx = fm1_indices[0]  # First high leakage failure
query_device = X_ps_scaled[query_idx:query_idx+1]

print(f"🎯 Query Device: {df_ps.loc[query_idx, 'device_id']}")
print(f"   Failure Mode: {df_ps.loc[query_idx, 'failure_mode']}")
print(f"   Key Parameters:")
print(f"     Leakage Total: {df_ps.loc[query_idx, 'leakage_total']:.2f} μA (high!)")
print(f"     Leakage Gate:  {df_ps.loc[query_idx, 'leakage_gate']:.2f} μA (high!)")
print(f"     Power Idle:    {df_ps.loc[query_idx, 'power_idle']:.2f} mW (high!)")

# Find K=20 nearest neighbors
distances, neighbor_indices = knn_ps.kneighbors(query_device, n_neighbors=21)  # +1 for self
neighbor_indices = neighbor_indices[0][1:]  # Exclude self
distances = distances[0][1:]

print(f"\n🔍 20 Most Similar Devices:")
print(f"   {'Device':<10} {'Distance':<12} {'Failure Mode':<20} {'Leakage Total':<15}")
print(f"   {'-'*60}")

for i, (neighbor_idx, dist) in enumerate(zip(neighbor_indices[:10], distances[:10])):
    device_id = df_ps.loc[neighbor_idx, 'device_id']
    failure_mode = df_ps.loc[neighbor_idx, 'failure_mode']
    leakage = df_ps.loc[neighbor_idx, 'leakage_total']
    print(f"   {device_id:<10} {dist:<12.4f} {failure_mode:<20} {leakage:<15.2f}")

print(f"   ... (showing 10 of 20)")

# Analyze neighbor failure modes
neighbor_failure_modes = df_ps.loc[neighbor_indices, 'failure_mode'].value_counts()
print(f"\n📊 Neighbor Failure Mode Distribution:")
for mode, count in neighbor_failure_modes.items():
    print(f"   {mode:<25} {count:3d} ({count/20*100:5.1f}%)")

# Success metric: How many neighbors have SAME failure mode?
same_mode_count = neighbor_failure_modes.get(df_ps.loc[query_idx, 'failure_mode'], 0)
print(f"\n✅ Similarity Validation:")
print(f"   Neighbors with SAME failure mode: {same_mode_count}/20 ({same_mode_count/20*100:.1f}%)")
print(f"   Expected by chance: ~0.4% (200 failures / 50K devices)")
print(f"   Enrichment: {(same_mode_count/20) / 0.004:.1f}x")

print(f"\n💡 Root Cause Insight:")
print(f"   • {same_mode_count} neighbors share '{df_ps.loc[query_idx, 'failure_mode']}' failure mode")
print(f"   • Common pattern: High leakage + high idle power")
print(f"   • Likely root cause: Process defect (gate oxide quality)")
print(f"   • Actionable: Check wafer position, lot, equipment correlation")

print(f"\n⏱️ Performance:")
start_time = time.time()
_, _ = knn_ps.kneighbors(query_device, n_neighbors=21)
query_time = time.time() - start_time
print(f"   Query time: {query_time*1000:.2f}ms for 50K device search")
print(f"   Manual analysis: ~2 hours → Automated: <1 second (7200x faster!)")

### 📝 What's Happening in This Code?

**Purpose:** Visualize spatial distribution of similar devices on wafer map

**Key Points:**
- **Wafer map**: Scatter plot of die_x vs die_y positions
- **Query device**: Red star (failed device being analyzed)
- **Similar devices**: Colored by failure mode (shows clustering)
- **Spatial correlation**: If neighbors cluster spatially → process defect
- **Random distribution**: If neighbors scattered → parametric drift

**Why This Matters:** Spatial patterns reveal whether failures are location-dependent (process) or random (design).


In [None]:
print("🗺️ Wafer Map Visualization of Similar Devices\n")

fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Left plot: All devices
ax1 = axes[0]
for mode in df_ps['failure_mode'].unique():
    mask = df_ps['failure_mode'] == mode
    color = 'lightgray' if mode == 'Pass' else None
    alpha = 0.3 if mode == 'Pass' else 0.7
    ax1.scatter(df_ps[mask]['die_x'], df_ps[mask]['die_y'], 
                c=color, label=mode, alpha=alpha, s=10)

ax1.scatter(df_ps.loc[query_idx, 'die_x'], df_ps.loc[query_idx, 'die_y'],
            marker='*', s=500, c='red', edgecolors='black', linewidths=2,
            label='Query Device', zorder=10)

ax1.set_xlabel('Die X Position', fontweight='bold', fontsize=11)
ax1.set_ylabel('Die Y Position', fontweight='bold', fontsize=11)
ax1.set_title('Wafer Map: All 50K Devices', fontweight='bold', fontsize=13)
ax1.legend(loc='upper right', fontsize=9)
ax1.grid(True, alpha=0.3)

# Right plot: Query device + 20 nearest neighbors
ax2 = axes[1]

# Plot all devices in gray background
ax2.scatter(df_ps['die_x'], df_ps['die_y'], c='lightgray', alpha=0.2, s=10, label='All devices')

# Plot 20 nearest neighbors (colored by failure mode)
neighbor_data = df_ps.loc[neighbor_indices]
for mode in neighbor_data['failure_mode'].unique():
    mask = neighbor_data['failure_mode'] == mode
    ax2.scatter(neighbor_data[mask]['die_x'], neighbor_data[mask]['die_y'],
                label=f'{mode} (neighbor)', s=100, alpha=0.8, edgecolors='black', linewidths=1.5)

# Query device
ax2.scatter(df_ps.loc[query_idx, 'die_x'], df_ps.loc[query_idx, 'die_y'],
            marker='*', s=500, c='red', edgecolors='black', linewidths=2,
            label='Query Device', zorder=10)

ax2.set_xlabel('Die X Position', fontweight='bold', fontsize=11)
ax2.set_ylabel('Die Y Position', fontweight='bold', fontsize=11)
ax2.set_title('Query Device + 20 Nearest Neighbors', fontweight='bold', fontsize=13)
ax2.legend(loc='upper right', fontsize=9)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Spatial analysis
query_x = df_ps.loc[query_idx, 'die_x']
query_y = df_ps.loc[query_idx, 'die_y']
neighbor_data = df_ps.loc[neighbor_indices]

spatial_distances = np.sqrt((neighbor_data['die_x'] - query_x)**2 + 
                           (neighbor_data['die_y'] - query_y)**2)
mean_spatial_dist = spatial_distances.mean()

print(f"📍 Spatial Analysis:")
print(f"   Query device position: ({query_x}, {query_y})")
print(f"   Mean spatial distance to neighbors: {mean_spatial_dist:.2f} die positions")
print(f"   Wafer diagonal: {np.sqrt(50**2 + 50**2):.2f} die positions")
print(f"   Spatial clustering: {'YES (likely process defect)' if mean_spatial_dist < 20 else 'NO (parametric similarity)'}")

print(f"\n🎯 Interpretation:")
if mean_spatial_dist < 20:
    print(f"   ✅ Neighbors are spatially CLUSTERED → Process defect (localized issue)")
    print(f"   → Check: Wafer position, lithography, etch uniformity")
else:
    print(f"   ✅ Neighbors are spatially SCATTERED → Parametric similarity (not location-dependent)")
    print(f"   → Check: Design sensitivity, test conditions, measurement accuracy")

## ⚠️ The Curse of Dimensionality

### Problem: KNN Degrades in High Dimensions

**Mathematical Insight:**

As dimensionality $p$ increases, **distances between all points become similar**:

$$
\frac{d_{max} - d_{min}}{d_{min}} \to 0 \text{ as } p \to \infty
$$

- In high dimensions, **"nearest" and "farthest" neighbors have similar distances**
- Notion of "neighborhood" breaks down
- KNN predictions become random

**Volume Explosion:**

To cover 10% of data in:
- 1D: Need range of 0.1 (10%)
- 2D: Need range of $\sqrt{0.1} \approx 0.32$ (32%)
- 10D: Need range of $0.1^{1/10} \approx 0.8$ (80%!)
- 100D: Need range of $0.1^{1/100} \approx 0.977$ (98%!)

**Implication:** In 100D, to find 10% of neighbors, you must search 98% of space!

---


### 📝 What's Happening in This Code?

**Purpose:** Demonstrate curse of dimensionality with distance distribution analysis

**Key Points:**
- **Test dimensions**: 2D, 10D, 50D, 100D, 500D
- **Distance distribution**: Calculate pairwise distances for 1000 random samples
- **Distance ratio**: $(d_{max} - d_{min}) / d_{min}$ measures discrimination
- **Expectation**: Ratio → 0 as dimensionality increases (all distances similar)
- **Practical limit**: KNN unreliable above ~50-100 dimensions without dimensionality reduction

**Why This Matters:** Understanding curse of dimensionality prevents misuse of KNN on high-dim data.


In [None]:
from scipy.spatial.distance import pdist

print("🌀 Curse of Dimensionality Demonstration\n")

dimensions = [2, 10, 50, 100, 500]
n_samples = 1000
results_dim = []

for p in dimensions:
    # Generate random data in p dimensions
    X_dim = np.random.randn(n_samples, p)
    
    # Calculate pairwise Euclidean distances
    distances = pdist(X_dim, metric='euclidean')
    
    d_min = distances.min()
    d_max = distances.max()
    d_mean = distances.mean()
    d_std = distances.std()
    
    # Distance ratio: measures discrimination
    ratio = (d_max - d_min) / d_min
    
    results_dim.append({
        'p': p,
        'd_min': d_min,
        'd_max': d_max,
        'd_mean': d_mean,
        'd_std': d_std,
        'ratio': ratio
    })
    
    print(f"   p={p:3d}: d_min={d_min:6.2f}, d_max={d_max:6.2f}, "
          f"d_mean={d_mean:6.2f}, ratio={(d_max-d_min)/d_min:6.4f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Distance distributions for different dimensions
ax1 = axes[0]
for p in [2, 10, 50, 100]:
    X_demo = np.random.randn(500, p)
    distances_demo = pdist(X_demo, metric='euclidean')
    ax1.hist(distances_demo, bins=50, alpha=0.6, label=f'p={p}')

ax1.set_xlabel('Euclidean Distance', fontweight='bold', fontsize=11)
ax1.set_ylabel('Frequency', fontweight='bold', fontsize=11)
ax1.set_title('Distance Distributions: Low vs High Dimensions', fontweight='bold', fontsize=13)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Right: Distance ratio vs dimensionality
ax2 = axes[1]
p_vals = [r['p'] for r in results_dim]
ratio_vals = [r['ratio'] for r in results_dim]

ax2.plot(p_vals, ratio_vals, marker='o', linewidth=2, markersize=10, color='red')
ax2.set_xlabel('Dimensionality (p)', fontweight='bold', fontsize=11)
ax2.set_ylabel('(d_max - d_min) / d_min', fontweight='bold', fontsize=11)
ax2.set_title('Distance Discrimination Degrades with Dimensionality', fontweight='bold', fontsize=13)
ax2.set_xscale('log')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n🎯 Observations:")
print(f"   • p=2:   Clear separation between nearest/farthest (ratio={results_dim[0]['ratio']:.2f})")
print(f"   • p=10:  Moderate discrimination (ratio={results_dim[1]['ratio']:.2f})")
print(f"   • p=100: Poor discrimination (ratio={results_dim[3]['ratio']:.2f})")
print(f"   • p=500: Almost no discrimination (ratio={results_dim[4]['ratio']:.2f})")

print(f"\n⚠️ Practical Implications:")
print(f"   • KNN works well: p < 20-30 (low/medium dimensions)")
print(f"   • KNN struggles: p > 50 (high dimensions)")
print(f"   • Solutions:")
print(f"     1. Dimensionality reduction: PCA, t-SNE, UMAP (reduce to p=10-20)")
print(f"     2. Feature selection: Keep only most important features")
print(f"     3. Use different model: Random Forest, XGBoost (handle high-dim better)")
print(f"     4. Distance metric: Try Manhattan (L1) instead of Euclidean (L2)")

## 🎯 Real-World KNN Projects

### 🔬 Post-Silicon Validation Projects (4)

### **1. Similar Failure Detection for Root Cause Analysis**
**Objective:** Automated similarity search across 100K+ devices for failure investigation

**Features:**
- 25 parametric tests (voltage, current, frequency, timing, power, leakage)
- Spatial features: die_x, die_y, wafer_id
- Categorical: equipment_id, lot_id, test_program_version
- Query: Failed device → Find K=50 most similar devices
- Distance metric: Euclidean on scaled features

**Success Metrics:**
- Neighbor purity: >80% share same failure mode
- Root cause time: 2 hours → 5 minutes (95% reduction)
- False positives: <10% (irrelevant neighbors)
- Query latency: <100ms for 100K device search

**Business Value:** $500K-2M per major failure investigation saved

---

### **2. Reference Die Matching for Adaptive Testing**
**Objective:** Match new devices to reference database for test flow optimization

**Features:**
- First 5 parametric tests (early in test flow)
- Historical database: 1M+ reference devices with known outcomes
- KNN query: New device → Find K=100 similar historical devices
- Prediction: Skip unnecessary tests if all neighbors passed same tests
- Adaptive: Test flow customized based on similarity

**Success Metrics:**
- Test time reduction: 15-25% (skip redundant tests)
- Test escape rate: <0.1% (safety maintained)
- Throughput increase: 500K → 650K devices/day
- Database update: Real-time (new devices added to reference)

**Business Value:** $2-5M annual savings per production line

---

### **3. Wafer Map Spatial Similarity Clustering**
**Objective:** Identify spatial failure patterns using KNN clustering

**Features:**
- Spatial: die_x, die_y (wafer coordinates)
- Parametric: 10 key tests (voltage, leakage, frequency)
- Combined distance: 50% spatial + 50% parametric (weighted metric)
- KNN clustering: Find K=20 neighbors for each die
- Pattern detection: Clusters with >80% failures → spatial defect

**Success Metrics:**
- Defect detection: Identify 95% of systematic patterns
- False alarms: <5% (avoid spurious patterns)
- Real-time: Analyze 40K die wafer in <5 seconds
- Visualization: Automatic wafer map highlighting

**Business Value:** Early detection → $10-30M saved per fab year

---

### **4. Multi-Site Test Correlation via Nearest Neighbor Analysis**
**Objective:** Find similar devices across 3 test sites for correlation analysis

**Features:**
- Site A: 15 parametric tests (wafer test)
- Site B: 20 parametric tests (final test)
- Site C: 10 parametric tests (system test)
- KNN across sites: Device at Site B → Find similar at Site A and C
- Correlation analysis: Predict Site C results from Site A+B similarity

**Success Metrics:**
- Cross-site prediction: R² > 0.85
- Test elimination: Skip 30% of redundant tests at Site C
- Escapes detected: Catch 90% of discrepancies between sites
- Latency: <50ms per device query across 3 databases

**Business Value:** $5-15M annual test cost reduction

---

### 🌐 General AI/ML Projects (4)

### **5. Content-Based Recommendation Engine**
**Objective:** Recommend products based on item similarity (not collaborative filtering)

**Features:**
- Product attributes: price, category, brand, color, size (15 features)
- User viewing history: Last 10 viewed items
- KNN query: Current product → Find K=20 similar products
- Recommendation: Show similar items based on content, not user behavior
- Distance: Cosine similarity for categorical, Euclidean for numerical

**Success Metrics:**
- Click-through rate: 8-12% (vs 5% random)
- Conversion rate: 15-20% increase
- Latency: <20ms per recommendation
- Cold start: Works for new users (no behavior history needed)

**Business Value:** 10-15% revenue increase → $5-20M for mid-size e-commerce

---

### **6. Medical Diagnosis via Case-Based Reasoning**
**Objective:** Find similar patient cases for diagnostic support

**Features:**
- Patient attributes: age, gender, BMI, blood pressure, lab results (50 features)
- Symptoms: 20 binary indicators (fever, cough, fatigue, etc.)
- Medical history: 10 categorical (diabetes, hypertension, etc.)
- KNN query: New patient → Find K=10 most similar past cases
- Diagnosis: Majority vote among neighbor diagnoses

**Success Metrics:**
- Diagnostic accuracy: 80-85% (comparable to junior doctors)
- Top-3 accuracy: 95% (true diagnosis in top 3 suggestions)
- Confidence: High when neighbors agree (>80% same diagnosis)
- Explanation: Show doctor the 10 similar cases for validation

**Business Value:** Diagnostic support → Reduce misdiagnosis 20-30%

---

### **7. Anomaly Detection via Isolation Distance**
**Objective:** Detect anomalies by finding samples far from all neighbors

**Features:**
- Transaction data: amount, merchant, location, time, user behavior (30 features)
- KNN query: Each transaction → Find K=20 nearest neighbors
- Anomaly score: Average distance to K neighbors (high = anomaly)
- Threshold: Flag top 1% as suspicious (tunable)
- Real-time: Score new transactions in <10ms

**Success Metrics:**
- Fraud detection: 85-90% recall at 5% false positive rate
- Novel fraud: Catch new patterns (distance-based, not rule-based)
- Latency: <10ms (real-time authorization)
- Explainability: Show why transaction is anomalous (far from all neighbors)

**Business Value:** Block $10-50M fraud annually, reduce false declines 20%

---

### **8. Image Search via Feature Similarity (CNN Embeddings + KNN)**
**Objective:** Find visually similar images using deep learning embeddings

**Features:**
- CNN embeddings: ResNet50 last layer (2048-dim vectors)
- Dimensionality reduction: PCA to 128-dim (curse of dimensionality mitigation)
- KNN index: FAISS library for billion-scale search
- Query: Input image → Extract embedding → Find K=50 nearest neighbors
- Distance: Cosine similarity (angle-based)

**Success Metrics:**
- Precision@10: >80% (8 out of 10 results are relevant)
- Query latency: <50ms for 10M image database
- Scale: 1B+ images with approximate NN (FAISS)
- User satisfaction: 4.2/5 stars for result quality

**Business Value:** Power visual search for e-commerce, stock photo sites

---


## ✅ Key Takeaways: KNN Mastery

### 🎯 When to Use KNN

**✅ KNN Excels When:**
- Small-medium datasets (<100K samples)
- Low-medium dimensions (p < 20-30 features)
- Interpretability matters ("You're like these 5 examples")
- Non-parametric (no assumptions about data distribution)
- Similarity search is the goal (recommendation, case-based reasoning)
- Local patterns matter (complex, non-linear decision boundaries)

**❌ Avoid KNN When:**
- Large datasets (>1M samples) — prediction too slow
- High dimensions (p > 50) — curse of dimensionality
- Real-time inference (<1ms) — distance calculation bottleneck
- Features have vastly different scales — MUST scale, or KNN fails
- Memory constrained — stores ALL training data
- Categorical features dominate — use CatBoost or one-hot + scale

---

### 🔑 Critical Success Factors

1. **ALWAYS Scale Features**
   - Use StandardScaler or MinMaxScaler
   - Unscaled features → large-range features dominate distance
   - Impact: 20-40% accuracy difference

2. **Choose K via Cross-Validation**
   - Small K (1-3): High variance, overfitting
   - Large K (50+): High bias, underfitting
   - Rule of thumb: Start with K = √N
   - Use odd K for binary classification (avoid ties)

3. **Distance Metric Matters**
   - Euclidean: Default, continuous features
   - Manhattan: Outlier-robust, grid-like data
   - Cosine: Text, embeddings, high-dimensional
   - Experiment: 5-15% accuracy variation

4. **Weighted Voting Usually Better**
   - weights='distance': Closer neighbors more influential
   - Improvement: 1-3% over uniform voting
   - Use for complex boundaries

5. **Mitigate Curse of Dimensionality**
   - Dimensionality reduction: PCA, t-SNE, UMAP (p → 10-20)
   - Feature selection: Keep only important features
   - Alternative: Switch to Random Forest, XGBoost (handle high-dim)

---

### 📊 Performance Expectations

| Dataset Size | Dimensionality | Training Time | Prediction Time | Accuracy vs RF/XGB |
|--------------|----------------|---------------|-----------------|--------------------|
| <10K         | p < 20         | Instant       | <10ms           | Comparable         |
| 10K-100K     | p < 20         | Instant       | 10-100ms        | Comparable         |
| >100K        | p < 20         | Instant       | >100ms          | Similar, but slow  |
| Any          | p > 50         | Instant       | Any             | Poor (curse of dim)|

*Training time always instant (just stores data), prediction is bottleneck*

---

### 🔧 Best Practices

1. **Always use StandardScaler/MinMaxScaler before KNN**
   ```python
   scaler = StandardScaler()
   X_train_scaled = scaler.fit_transform(X_train)
   X_test_scaled = scaler.transform(X_test)
   ```

2. **Cross-validation for K selection**
   ```python
   k_values = [1, 3, 5, 7, 9, 11, 15, 20]
   for k in k_values:
       scores = cross_val_score(KNN(n_neighbors=k), X, y, cv=5)
   best_k = k_values[np.argmax(mean_scores)]
   ```

3. **Use KD-tree for low dimensions (<20), brute force for high**
   ```python
   knn = KNeighborsClassifier(algorithm='auto')  # sklearn chooses best
   ```

4. **For large datasets, use approximate NN (FAISS, Annoy)**
   ```python
   import faiss
   index = faiss.IndexFlatL2(d)  # d = dimensionality
   index.add(X_train)  # Add training data
   D, I = index.search(X_test, k=10)  # 10-100x faster
   ```

5. **Dimensionality reduction for p > 50**
   ```python
   from sklearn.decomposition import PCA
   pca = PCA(n_components=20)
   X_reduced = pca.fit_transform(X)
   knn.fit(X_reduced, y)  # KNN on reduced features
   ```

---

### 🚀 Next Steps

1. **024_Support_Vector_Machines.ipynb** - Kernel methods, margin optimization
2. **025_Naive_Bayes.ipynb** - Probabilistic classification
3. **026_K_Means_Clustering.ipynb** - Unsupervised learning with KNN-like distance
4. **030_Dimensionality_Reduction.ipynb** - PCA, t-SNE, UMAP for curse mitigation

---

### 🎓 What You've Mastered

✅ **Instance-based learning** - No training, similarity-based prediction  
✅ **Distance metrics** - Euclidean, Manhattan, Minkowski, Cosine  
✅ **K selection** - Bias-variance trade-off, cross-validation  
✅ **Feature scaling** - CRITICAL for KNN success  
✅ **Voting strategies** - Uniform vs distance-weighted  
✅ **Curse of dimensionality** - Why KNN fails in high dimensions  
✅ **KNN regression** - Local averaging for continuous predictions  
✅ **Similarity search** - Root cause analysis, recommendation systems  
✅ **Production optimization** - KD-tree, FAISS, approximate NN  
✅ **Business applications** - $500K-50M impact across domains  

You now understand when KNN shines and when to use alternatives! 🎉


## 📚 References and Further Reading

### Original Papers

1. **Nearest Neighbor Pattern Classification** (1967)  
   Cover & Hart, IEEE Transactions on Information Theory  
   Classic paper establishing KNN foundations

2. **The Curse of Dimensionality in Data Mining and Time Series Prediction** (2005)  
   Verleysen & François, IWANN 2005  
   Comprehensive analysis of dimensionality problems

3. **An Investigation of Practical Approximate Nearest Neighbor Algorithms** (2004)  
   Liu et al., NeurIPS 2004  
   Comparison of fast NN search algorithms

### Official Documentation

4. **sklearn Neighbors Module**  
   https://scikit-learn.org/stable/modules/neighbors.html

5. **KNeighborsClassifier API**  
   https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

6. **KNeighborsRegressor API**  
   https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

### Fast Nearest Neighbor Libraries

7. **FAISS (Facebook AI Similarity Search)**  
   https://github.com/facebookresearch/faiss  
   Billion-scale approximate NN, GPU support

8. **Annoy (Approximate Nearest Neighbors Oh Yeah)**  
   https://github.com/spotify/annoy  
   Spotify's library for fast NN search

9. **NMSLIB (Non-Metric Space Library)**  
   https://github.com/nmslib/nmslib  
   Fast NN for various distance metrics

### Books

10. **The Elements of Statistical Learning** (Hastie, Tibshirani, Friedman)  
    Chapter 13: Prototype Methods (includes KNN)

11. **Pattern Recognition and Machine Learning** (Bishop)  
    Chapter 2.5: Nearest Neighbor Methods

### Related Notebooks

- **024_Support_Vector_Machines.ipynb** (next) - Another instance-based method
- **026_K_Means_Clustering.ipynb** - Similar distance-based approach
- **030_Dimensionality_Reduction.ipynb** - Mitigation for curse of dimensionality
- **010_Linear_Regression.ipynb** - Contrast with parametric models
- **016_Decision_Trees.ipynb** - Alternative for high-dim data

---

**Notebook Complete!** ✅  
**Next:** 024_Support_Vector_Machines.ipynb - Kernel methods and margin optimization
