**K-Nearest Neighbors (KNN) is a supervised, non-parametric, and instance-based (lazy learning) machine learning algorithm.
It is used for classification and regression problems.**

# Core Idea of KNN

**A data point is classified or predicted based on the K closest data points in the training dataset.**

Classification → Majority voting

Regression → Average of neighbors’ values

# How KNN Works

1.Choose the value of K

2.Calculate the distance between the new data point and all training points

3.Select the K nearest neighbors

4.Make a prediction:

Classification → Most common class

Regression → Mean of target values

# Distance Metrics in Machine Learning

**Distance metrics measure how similar or dissimilar two data points are.
They are the foundation of algorithms like KNN, K-Means, DBSCAN, Hierarchical Clustering, etc.**

## Why Distance Metrics Matter

**Machine learning models don’t understand meaning — only numbers.
Distance metrics help answer:**

“How close are two observations?”

Wrong metric  → wrong neighbors → wrong prediction

**1.Euclidean Distance (Most Common)**

Straight-line (as-the-crow-flies) distance

Points: (2,3) and (5,7)

In [1]:
import numpy as np

In [2]:
np.sqrt((2-5)**2+(3-7)**2)

np.float64(5.0)

Use When

1.Continuous numeric data

2.Features on similar scale

**Sensitive to outliers**

**2.Manhattan Distance (City Block Distance)**

Distance traveled on grid roads (no diagonals)

In [3]:
abs(2-5)+abs(3-7)

7

Use When

1.Grid-like data

2.Robust to outliers

**3.Minkowski Distance (Generalized Distance)**

$$
d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}}
$$


| p value | Distance  |
| ------- | --------- |
| p = 1   | Manhattan |
| p = 2   | Euclidean |
| p → ∞   | Chebyshev |


**KNN for Classification (Example)**

Problem: Email → Spam or Not Spam

K = 5

3 neighbors → Spam

2 neighbors → Not Spam

Prediction: Spam

**KNN for Regression (Example)**

Problem: Predict house price 

K = 3

Neighbor prices:

₹50L

₹55L

₹60L

Prediction = (50 + 55 + 60) / 3 = ₹55L

## Choosing the Value of K

| K Value       | Effect                            |
| ------------- | --------------------------------- |
| Small K (1–3) | High variance, sensitive to noise |
| Large K       | High bias, smoother boundary      |
| Odd K         | Avoids tie in classification      |


Best Practice: Use cross-validation to choose optimal K

**Feature Scaling Is Mandatory**

KNN is distance-based → scaling is crucial

Use:

Standardization (Z-score)

Min-Max Scaling

Without scaling → biased distance calculations

**Advantages of KNN**

✔ Simple & intuitive
✔ No training phase
✔ Handles multi-class problems naturally
✔ Works well with small datasets

**Disadvantages of KNN**

❌ Slow for large datasets
❌ High memory usage
❌ Sensitive to noise & outliers
❌ Feature scaling required

**When to Use KNN**

✅ Small to medium datasets
✅ Low-dimensional data
✅ Benchmarking / baseline model
❌ Very large or high-dimensional datasets

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load data
X, y = load_iris(return_X_y=True)

# Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# KNN Model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Prediction
print(knn.predict(X_test))


[1 1 0 2 2 2 2 2 2 2 1 0 0 2 2 1 1 1 2 0 1 0 2 2 0 1 0 2 2 0]
