<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/K-Nearest%20Neighbours%20(KNN)%20Classification/Code%20Walk%20Through%3A%20K-Nearest%20Neighbours%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K-Nearest Neighbors Classification: Code Walk Through

This notebook walks through the **computational steps** of the K-Nearest Neighbors (KNN) classification algorithm.

## What We'll Cover:
1. **Visualize the data** - understand the dataset
2. **Calculate distances** - measure similarity between points
3. **Find K nearest neighbors** - identify closest training points
4. **Make prediction** - use majority voting

We'll show **both loop versions** (to understand the logic) and **vectorized NumPy versions** (for efficiency).

## Step 1: Import Libraries

We need:
- **NumPy** for numerical operations
- **Matplotlib** for visualization
- **sklearn.metrics.pairwise_distances** for efficient distance calculation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances

## Step 2: Create Training Data

We have:
- **10 training points** with **2 features** each
- **2 classes**: class 0 and class 1
- First 5 points belong to class 0
- Last 5 points belong to class 1

In [None]:
# Training data: 10 points with 2 features
X_train = np.array( [ [1.536, 3.554],   # Point 0, class 0
                      [1.771, 2.783],   # Point 1, class 0
                      [2.506, 2.880],   # Point 2, class 0
                      [2.652, 4.545],   # Point 3, class 0
                      [3.590, 3.784],   # Point 4, class 0
                      [1.279, 1.443],   # Point 5, class 1
                      [2.000, 2.325],   # Point 6, class 1
                      [2.096, 0.583],   # Point 7, class 1
                      [2.539, 1.541],   # Point 8, class 1
                      [3.251, 0.080] ] ) # Point 9, class 1

# Labels: which class each point belongs to (0 or 1)
y_train = np.array( [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] )

print("Training data shape:", X_train.shape)  # (10, 2) = 10 points, 2 features
print("Labels shape:", y_train.shape)         # (10,) = 10 labels
print("\nFirst few training points:")
print(X_train[:3])
print("\nCorresponding labels:")
print(y_train[:3])

## Step 3: Visualize the Data

Let's plot our training data to see how it's distributed in 2D space.

**Note:** We start by just looking at the data points, **without worrying about their classes yet**.

In [None]:
# Simple scatter plot of all training points
plt.figure(figsize=(8, 6))
plt.scatter(X_train[:,0], X_train[:,1],
           c='steelblue', s=100, alpha=0.6,
           edgecolors='black', linewidths=1.5)
plt.xlabel('Feature 1 ($x_1$)', fontsize=12)
plt.ylabel('Feature 2 ($x_2$)', fontsize=12)
plt.title('Training Data Visualization', fontsize=14)
plt.grid(True, alpha=0.3)
plt.axis([0, 5, 0, 5])
plt.show()

print(f"We have {len(X_train)} training points in 2D space")

## Step 4: Define Test Point

Now we have a new point **[2.0, 2.0]** that we want to classify.

**Question:** Should this point be classified as class 0 or class 1?

KNN will answer this by finding the K nearest training points and using **majority voting**.

In [None]:
# Test point: a new point we want to classify
X_test = np.array([[2.0, 2.0]])

print("Test point:", X_test[0])
print("Shape:", X_test.shape)  # (1, 2) = 1 point, 2 features

# Visualize test point with training data
plt.figure(figsize=(8, 6))
plt.scatter(X_train[:,0], X_train[:,1],
           c='steelblue', s=100, alpha=0.6,
           edgecolors='black', linewidths=1.5,
           label='Training points')
plt.scatter(X_test[:,0], X_test[:,1],
           c='red', s=300, marker='*',
           edgecolors='black', linewidths=2,
           label='Test point')
plt.xlabel('Feature 1 ($x_1$)', fontsize=12)
plt.ylabel('Feature 2 ($x_2$)', fontsize=12)
plt.title('Test Point to Classify', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis([0, 5, 0, 5])
plt.show()

## Step 5: Calculate Distances

To find nearest neighbors, we need to calculate the distance from the test point to each training point.

We'll use **Euclidean distance**:

$$d = \sqrt{(x_1 - x_1')^2 + (x_2 - x_2')^2}$$

### Manual Calculation Example

Let's manually calculate the distance from test point **[2.0, 2.0]** to the **first training point [1.536, 3.554]**:

In [None]:
# Manual calculation for first training point
test_point = X_test[0]       # [2.0, 2.0]
first_train_point = X_train[0]  # [1.536, 3.554]

print("Test point:         ", test_point)
print("First training point:", first_train_point)
print()

# Step 1: Calculate differences
diff_1 = test_point[0] - first_train_point[0]
diff_2 = test_point[1] - first_train_point[1]
print(f"Step 1 - Differences:")
print(f"  Feature 1: {test_point[0]:.3f} - {first_train_point[0]:.3f} = {diff_1:.3f}")
print(f"  Feature 2: {test_point[1]:.3f} - {first_train_point[1]:.3f} = {diff_2:.3f}")
print()

# Step 2: Square the differences
squared_1 = diff_1 ** 2
squared_2 = diff_2 ** 2
print(f"Step 2 - Square the differences:")
print(f"  ({diff_1:.3f})² = {squared_1:.3f}")
print(f"  ({diff_2:.3f})² = {squared_2:.3f}")
print()

# Step 3: Sum the squared differences
sum_squared = squared_1 + squared_2
print(f"Step 3 - Sum:")
print(f"  {squared_1:.3f} + {squared_2:.3f} = {sum_squared:.3f}")
print()

# Step 4: Take square root
distance = np.sqrt(sum_squared)
print(f"Step 4 - Square root:")
print(f"  √{sum_squared:.3f} = {distance:.3f}")
print()
print(f"Distance from test point to first training point: {distance:.3f}")

### Approach 1: Using a Loop (Explicit Logic)

Now let's calculate distances to **all** training points using a loop.

This shows the logic clearly: we go through each training point one by one.

In [None]:
# Calculate distances using a loop
distances_loop = []

for i in range(len(X_train)):
    # Get the training point
    train_point = X_train[i]

    # Calculate difference for each feature
    diff = test_point - train_point

    # Square the differences
    squared_diff = diff ** 2

    # Sum and take square root
    distance = np.sqrt(np.sum(squared_diff))

    # Store the distance
    distances_loop.append(distance)

    print(f"Distance to point {i}: {distance:.4f}")

# Convert to numpy array
distances_loop = np.array(distances_loop)
print(f"\nDistances shape: {distances_loop.shape}")

### Approach 2: Vectorized NumPy (Broadcasting)

Instead of looping, we can use **NumPy broadcasting** to calculate all distances at once!

**Key idea:** 
- When we subtract `X_test` (shape: 1×2) from `X_train` (shape: 10×2), NumPy automatically broadcasts
- This creates a 10×2 array of differences - one row for each training point
- Then we square, sum across features (axis=1), and take the square root

This is **much faster** than loops for large datasets!

In [None]:
# Calculate distances using vectorized NumPy operations
# Step 1: Calculate differences (broadcasting automatically expands dimensions)
diff = X_test - X_train
print("Differences shape:", diff.shape)  # (10, 2) - one row per training point
print("First few differences:")
print(diff[:3])
print()

# Step 2: Square the differences
squared_diff = diff ** 2
print("Squared differences shape:", squared_diff.shape)  # (10, 2)
print()

# Step 3: Sum across features (axis=1)
sum_squared = squared_diff.sum(axis=1)
print("Sum of squared differences shape:", sum_squared.shape)  # (10,)
print("Sum of squared differences:")
print(sum_squared)
print()

# Step 4: Take square root
distances_numpy = np.sqrt(sum_squared)
print("Distances (NumPy vectorized):")
print(distances_numpy)
print()

# Or more concisely in one line:
distances_numpy_compact = np.sqrt(((X_test - X_train) ** 2).sum(axis=1))
print("Same result (compact version):")
print(distances_numpy_compact)
print()

# Verify both approaches give same result
print("Loop and NumPy results match:", np.allclose(distances_loop, distances_numpy))

### Approach 3: Using sklearn's `pairwise_distances`

For convenience, sklearn provides `pairwise_distances` which does all of this for us.

This is a library function that's highly optimized and can use different distance metrics.

**Note:** `pairwise_distances` returns a 2D array (matrix of distances), so we use `.ravel()` to flatten it to 1D.

In [None]:
# Calculate distances using pairwise_distances from sklearn
distances_2d = pairwise_distances(X_test, X_train)
print("2D array shape:", distances_2d.shape)  # (1, 10) = 1 test point, 10 training points
print("2D array:")
print(distances_2d)
print()

# Flatten to 1D array using .ravel()
distances_sklearn = distances_2d.ravel()
print("1D array shape:", distances_sklearn.shape)  # (10,)
print("1D array:")
print(distances_sklearn)
print()

# Verify all three approaches give same result
print("All approaches match:")
print("  Loop vs NumPy:", np.allclose(distances_loop, distances_numpy))print("  NumPy vs sklearn:", np.allclose(distances_numpy, distances_sklearn))

## Step 6: Find K Nearest Neighbors

Now we have distances to all training points. We need to find the **5 closest points** (K=5).

**How do we find them?**
We need to:
1. Sort the distances from smallest to largest
2. Get the **indices** (positions) of the 5 smallest distances

### What is `argsort()`?

`argsort()` returns the **indices** that would sort an array, not the sorted values themselves.

**Example:**
- Array: [4.5, 2.1, 7.3, 1.8, 3.2]
- `argsort()` returns: [3, 1, 4, 0, 2]
- This means: index 3 has the smallest value (1.8), then index 1 (2.1), then index 4 (3.2), etc.

In [None]:
# Let's see which indices argsort returns
sorted_indices = np.argsort(distances_sklearn)
print("All indices sorted by distance:")
print(sorted_indices)
print()

# Get the first 5 indices (K=5 nearest neighbors)
K = 5
nearest_indices = sorted_indices[:K]
print(f"Indices of {K} nearest neighbors:")
print(nearest_indices)
print()

# Show the actual distances
print(f"Distances to these {K} nearest neighbors:")
for i, idx in enumerate(nearest_indices):
    print(f"  Neighbor {i+1}: point {idx}, distance = {distances_sklearn[idx]:.4f}")

## Step 7: Get Labels of Nearest Neighbors

Now we know **which** training points are closest (indices: {}).

Let's see what **classes** these neighbors belong to.

In [None]:
# Get the labels of the K nearest neighbors
neighbor_labels = y_train[nearest_indices]

print(f"Labels of {K} nearest neighbors:")
print(neighbor_labels)
print()

# Show details
print("Detailed view:")
for i, idx in enumerate(nearest_indices):
    print(f"  Neighbor {i+1}: training point {idx}, class = {y_train[idx]}, distance = {distances_sklearn[idx]:.4f}")

## Step 8: Make Prediction via Majority Voting

Now we have the labels: {}

**Voting:**
- Count how many neighbors belong to each class
- The class with the most votes wins!

### Approach 1: Manual Counting

In [None]:
# Count votes manually
count_class_0 = 0
count_class_1 = 0

for label in neighbor_labels:
    if label == 0:
        count_class_0 += 1
    elif label == 1:
        count_class_1 += 1

print("Vote counts:")
print(f"  Class 0: {count_class_0} votes")
print(f"  Class 1: {count_class_1} votes")
print()

# Determine winner
if count_class_0 > count_class_1:
    prediction_manual = 0
    print(f"Winner: Class 0 (with {count_class_0} votes)")
else:
    prediction_manual = 1
    print(f"Winner: Class 1 (with {count_class_1} votes)")

print(f"\nPredicted class: {prediction_manual}")

### Approach 2: Using NumPy's `unique()`

`np.unique()` with `return_counts=True` counts occurrences of each unique value.

This is more efficient than manual counting.

In [None]:
# Count votes using np.unique
unique_labels, vote_counts = np.unique(neighbor_labels, return_counts=True)

print("Unique labels found:", unique_labels)
print("Vote counts:", vote_counts)
print()

# Show the mapping
print("Vote summary:")
for label, count in zip(unique_labels, vote_counts):
    print(f"  Class {label}: {count} votes")
print()

# Find the class with most votes using argmax
winner_index = np.argmax(vote_counts)
prediction_numpy = unique_labels[winner_index]

print(f"Winner index: {winner_index}")
print(f"Predicted class: {prediction_numpy}")
print()

# Verify both approaches match
print(f"Manual and NumPy predictions match: {prediction_manual == prediction_numpy}")

## Step 9: Visualize the Result

Let's visualize the test point, its K nearest neighbors, and the prediction.

In [None]:
# Visualize the classification result
plt.figure(figsize=(10, 8))

# Plot all training points by class
plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1],
           c='blue', s=100, alpha=0.3, label='Class 0 (training)', edgecolors='black')
plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1],
           c='orange', s=100, alpha=0.3, label='Class 1 (training)', edgecolors='black')

# Highlight the K nearest neighbors
nearest_neighbors_X = X_train[nearest_indices]
plt.scatter(nearest_neighbors_X[:,0], nearest_neighbors_X[:,1],
           c='green', s=200, marker='s', alpha=0.7,
           edgecolors='darkgreen', linewidths=2, label=f'{K} Nearest Neighbors')

# Plot test point
test_color = 'blue' if prediction_numpy == 0 else 'orange'
plt.scatter(X_test[:,0], X_test[:,1],
           c=test_color, s=400, marker='*',
           edgecolors='black', linewidths=2,
           label=f'Test point (predicted: class {prediction_numpy})')

plt.xlabel('Feature 1 ($x_1$)', fontsize=12)
plt.ylabel('Feature 2 ($x_2$)', fontsize=12)
plt.title(f'KNN Classification Result (K={K})', fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.axis([0, 5, 0, 5])
plt.show()

print(f"Test point {X_test[0]} is classified as class {prediction_numpy}")
print(f"Based on {K} nearest neighbors voting: {vote_counts[winner_index]} votes")

## Summary

We've walked through all the computational steps of KNN Classification:

1. ✅ **Visualized data** - saw training points in 2D space
2. ✅ **Calculated distances** - computed Euclidean distance from test point to each training point
3. ✅ **Found K nearest neighbors** - used `argsort()` to find indices of 5 closest points
4. ✅ **Made prediction** - used majority voting among the K neighbors

### Key NumPy Operations Used:

- **`pairwise_distances(X_test, X_train)`** - efficiently calculates all distances
- **`.ravel()`** - flattens 2D array to 1D
- **`np.argsort(distances)`** - returns indices that would sort the array
- **`array[indices]`** - fancy indexing to select multiple elements
- **`np.unique(labels, return_counts=True)`** - counts occurrences of each value
- **`np.argmax(counts)`** - finds index of maximum value

### Why Both Approaches?

- **Loop versions** help you understand the logic step-by-step
- **Vectorized versions** are much faster for large datasets

In practice, use vectorized operations, but understanding loops helps you know what's happening under the hood!