# Week 8 - K-Nearest Neighbors and distance metrics

Euclidean, Manhattan, Minkowski, Cosine


In [1]:
!pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.14.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.6 (from statsmodels)
  Downloading patsy-1.0.2-py2.py3-none-any.whl.metadata (3.6 kB)
Downloading statsmodels-0.14.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownloading patsy-1.0.2-py2.py3-none-any.whl (233 kB)
Installing collected packages: patsy, statsmodels
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [statsmodels][0m [statsmodels]
[1A[2KSuccessfully installed patsy-1.0.2 statsmodels-0.14.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import statsmodels.api as sm
import networkx as nx

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix

In [3]:
# Dataset 1
df = pd.read_csv("diabetes_012_health_indicators_BRFSS2015.csv")

# Dataset 2
df_pima = pd.read_csv("pima_indian_diabetes_dataset.csv") 

# Dataset 1

K-Nearest Neighbors with the following distance metrics: Euclidean, Manhattan, Minkowski, Cosine

In [5]:
# --- KNN Classification on Balanced 10,000-Sample (Dataset 1) ---

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import resample
import numpy as np

# --- 1. Create a balanced 10,000-row sample ---
df_small = (
    df.groupby('Diabetes_012', group_keys=False)
      .apply(lambda x: x.sample(n=5000, random_state=42, replace=True))
)


X = df_small.drop(columns=['Diabetes_012'])
y = df_small['Diabetes_012']

# --- 2. Scale features ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 3. Split data ---
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, stratify=y, random_state=42
)

# --- 4. Evaluate multiple distance metrics ---
distance_metrics = ['euclidean', 'manhattan', 'minkowski', 'cosine']

for metric in distance_metrics:
    print(f"\n🔹 Distance Metric: {metric.upper()}")
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    
    # 5-fold cross-validation
    cv_scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
    print(f"Cross-Validation Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
    
    # Fit and test
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.3f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))



🔹 Distance Metric: EUCLIDEAN


  .apply(lambda x: x.sample(n=5000, random_state=42, replace=True))


Cross-Validation Accuracy: 0.501 ± 0.013
Test Accuracy: 0.509
Classification Report:
              precision    recall  f1-score   support

         0.0       0.52      0.59      0.55      1000
         1.0       0.49      0.52      0.50      1000
         2.0       0.53      0.41      0.46      1000

    accuracy                           0.51      3000
   macro avg       0.51      0.51      0.51      3000
weighted avg       0.51      0.51      0.51      3000


🔹 Distance Metric: MANHATTAN
Cross-Validation Accuracy: 0.508 ± 0.013
Test Accuracy: 0.524
Classification Report:
              precision    recall  f1-score   support

         0.0       0.53      0.61      0.57      1000
         1.0       0.49      0.53      0.51      1000
         2.0       0.56      0.43      0.49      1000

    accuracy                           0.52      3000
   macro avg       0.53      0.52      0.52      3000
weighted avg       0.53      0.52      0.52      3000


🔹 Distance Metric: MINKOWSKI
Cross-Va

I used 5,000 samples per group becuase thw full 200,000 dataset took too long. If I ran with 33,333 per group it had an error with memory. 

# Dataset 2

In [6]:
# --- KNN Classification with Multiple Distance Metrics (Dataset 2) ---

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# --- 1. Prepare data ---
X = df_pima.drop(columns=['Outcome'])
y = df_pima['Outcome']

# --- 2. Scale features ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 3. Train/test split ---
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, stratify=y, random_state=42
)

# --- 4. Define distance metrics to test ---
distance_metrics = ['euclidean', 'manhattan', 'minkowski', 'cosine']

# --- 5. Run KNN for each distance metric ---
for metric in distance_metrics:
    print(f"\n🔹 Distance Metric: {metric.upper()}")
    
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    
    # Cross-validation accuracy (5 folds)
    cv_scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
    print(f"Cross-Validation Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
    
    # Fit on full training data
    knn.fit(X_train, y_train)
    
    # Test set evaluation
    y_pred = knn.predict(X_test)
    print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.3f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))



🔹 Distance Metric: EUCLIDEAN
Cross-Validation Accuracy: 0.733 ± 0.014
Test Accuracy: 0.708
Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.80      0.78       100
           1       0.59      0.54      0.56        54

    accuracy                           0.71       154
   macro avg       0.68      0.67      0.67       154
weighted avg       0.70      0.71      0.70       154


🔹 Distance Metric: MANHATTAN
Cross-Validation Accuracy: 0.730 ± 0.026
Test Accuracy: 0.734
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.83      0.80       100
           1       0.64      0.56      0.59        54

    accuracy                           0.73       154
   macro avg       0.71      0.69      0.70       154
weighted avg       0.73      0.73      0.73       154


🔹 Distance Metric: MINKOWSKI
Cross-Validation Accuracy: 0.733 ± 0.014
Test Accuracy: 0.708
Classification Report:
 

NOTES 

KNN doesn’t handle imbalance well — it’s biased toward the majority class.
- what I did: stratify=y --> Keeps class proportions consistent across train/test, it's a metric in train test split, "Ensures your train/test sets preserve class proportions."
- can do weighted KNN: KNeighborsClassifier(weights='distance') — easy and effective. "Gives closer neighbors more influence than distant ones."

| Dataset                      | Situation                              | Recommendation                                                                                               |
| ---------------------------- | -------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| **Dataset 1 (Diabetes_012)** | Heavily imbalanced (class 0 dominates) | 🔹 Use **resampling** or **weights='distance'** in KNN.<br>🔹 Keep your 10 k balanced subset (already good). |
| **Dataset 2 (Pima)**         | More balanced, smaller (~700 rows)     | ✅ Stratification is enough — no resampling needed.                                                           |


In [None]:
#If you want to use weighted KNN to reduce imbalance bias:

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='distance')


knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='distance')


increasing k = The model becomes smoother (less sensitive to noise). Accuracy may improve slightly, but it can underfit if k is too large (loses local detail).

decreasing k = The model becomes more flexible and fits local variations better, but can overfit and become noisy. k = 1 memorizes the training set.

Typical sweet spot: k ≈ 3–10.
You can tune it with GridSearchCV(param_grid={'n_neighbors': range(3, 21)})

TO DO 

- analyze the output results and compare to models from previous week 
- test different k numbers 
- tune k 