<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/K-Nearest%20Neighbours%20Regression/KNN%20Regression%20Code%20Walk%20Through.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K-Nearest Neighbors Regression: Code Walk Through

This notebook walks through the **computational steps** of the K-Nearest Neighbors (KNN) regression algorithm.

## What We'll Cover:
1. **Visualize the data** - understand the dataset
2. **Calculate distances** - measure similarity between points
3. **Find K nearest neighbors** - identify closest training points
4. **Make prediction** - use **averaging** (not voting!)

We'll show **both loop versions** (to understand the logic) and **vectorized NumPy versions** (for efficiency).

### Key Difference from Classification:
- **Classification:** Use majority **voting** among neighbors
- **Regression:** Use **averaging** of neighbors' values

## Step 1: Import Libraries

We need:
- **NumPy** for numerical operations
- **Matplotlib** for visualization
- **sklearn.metrics.pairwise_distances** for efficient distance calculation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances

## Step 2: Create Training Data

We have:
- **10 training points** with **2 features** each
- **Continuous target values** (not classes!)
- Target values range from approximately 1.6 to 4.5

In [None]:
# Training data: 10 points with 2 features
X_train = np.array( [ [1.536, 3.554],   # Point 0
                      [1.771, 2.783],   # Point 1
                      [2.506, 2.880],   # Point 2
                      [2.652, 4.545],   # Point 3
                      [3.590, 3.784],   # Point 4
                      [1.279, 1.443],   # Point 5
                      [2.000, 2.325],   # Point 6
                      [2.096, 0.583],   # Point 7
                      [2.539, 1.541],   # Point 8
                      [3.251, 0.080] ] ) # Point 9

# Target values: continuous numbers (not discrete classes)
y_train = np.array( [2.728, 2.456, 2.641, 3.520, 3.667,
                     1.612, 2.136, 1.591, 2.143, 1.827] )

print("Training data shape:", X_train.shape)  # (10, 2) = 10 points, 2 features
print("Target values shape:", y_train.shape)   # (10,) = 10 target values
print("\nFirst few training points:")
print(X_train[:3])
print("\nCorresponding target values:")
print(y_train[:3])
print(f"\nTarget value range: [{y_train.min():.3f}, {y_train.max():.3f}]")

## Step 3: Visualize the Data

Let's plot our training data to see how it's distributed in 2D space.

**Note:** We start by just looking at the data points, **without worrying about their target values yet**.

In [None]:
# Simple scatter plot of all training points
plt.figure(figsize=(8, 6))
plt.scatter(X_train[:,0], X_train[:,1],
           c='steelblue', s=100, alpha=0.6,
           edgecolors='black', linewidths=1.5)
plt.xlabel('Feature 1 ($x_1$)', fontsize=12)
plt.ylabel('Feature 2 ($x_2$)', fontsize=12)
plt.title('Training Data Visualization', fontsize=14)
plt.grid(True, alpha=0.3)
plt.axis([0, 5, 0, 5])
plt.show()

print(f"We have {len(X_train)} training points in 2D space")
print(f"Each point has an associated target value (continuous)")

## Step 4: Define Test Point

Now we have a new point **[2.0, 2.0]** that we want to predict a value for.

**Question:** What target value should we predict for this point?

KNN will answer this by finding the K nearest training points and **averaging their target values**.

In [None]:
# Test point: a new point we want to predict for
X_test = np.array([[2.0, 2.0]])

print("Test point:", X_test[0])
print("Shape:", X_test.shape)  # (1, 2) = 1 point, 2 features

# Visualize test point with training data
plt.figure(figsize=(8, 6))
plt.scatter(X_train[:,0], X_train[:,1],
           c='steelblue', s=100, alpha=0.6,
           edgecolors='black', linewidths=1.5,
           label='Training points')
plt.scatter(X_test[:,0], X_test[:,1],
           c='red', s=300, marker='*',
           edgecolors='black', linewidths=2,
           label='Test point')
plt.xlabel('Feature 1 ($x_1$)', fontsize=12)
plt.ylabel('Feature 2 ($x_2$)', fontsize=12)
plt.title('Test Point to Predict', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis([0, 5, 0, 5])
plt.show()

## Step 5: Calculate Distances

To find nearest neighbors, we need to calculate the distance from the test point to each training point.

We'll use **Euclidean distance**:

$$d = \sqrt{(x_1 - x_1')^2 + (x_2 - x_2')^2}$$

### Manual Calculation Example

Let's manually calculate the distance from test point **[2.0, 2.0]** to the **first training point [1.536, 3.554]**:

In [None]:
# Manual calculation for first training point
test_point = X_test[0]       # [2.0, 2.0]
first_train_point = X_train[0]  # [1.536, 3.554]

print("Test point:         ", test_point)
print("First training point:", first_train_point)
print()

# Step 1: Calculate differences
diff_1 = test_point[0] - first_train_point[0]
diff_2 = test_point[1] - first_train_point[1]
print(f"Step 1 - Differences:")
print(f"  Feature 1: {test_point[0]:.3f} - {first_train_point[0]:.3f} = {diff_1:.3f}")
print(f"  Feature 2: {test_point[1]:.3f} - {first_train_point[1]:.3f} = {diff_2:.3f}")
print()

# Step 2: Square the differences
squared_1 = diff_1 ** 2
squared_2 = diff_2 ** 2
print(f"Step 2 - Square the differences:")
print(f"  ({diff_1:.3f})² = {squared_1:.3f}")
print(f"  ({diff_2:.3f})² = {squared_2:.3f}")
print()

# Step 3: Sum the squared differences
sum_squared = squared_1 + squared_2
print(f"Step 3 - Sum:")
print(f"  {squared_1:.3f} + {squared_2:.3f} = {sum_squared:.3f}")
print()

# Step 4: Take square root
distance = np.sqrt(sum_squared)
print(f"Step 4 - Square root:")
print(f"  √{sum_squared:.3f} = {distance:.3f}")
print()
print(f"Distance from test point to first training point: {distance:.3f}")

### Approach 1: Using a Loop (Explicit Logic)

Now let's calculate distances to **all** training points using a loop.

This shows the logic clearly: we go through each training point one by one.

In [None]:
# Calculate distances using a loop
distances_loop = []

for i in range(len(X_train)):
    # Get the training point
    train_point = X_train[i]

    # Calculate difference for each feature
    diff = test_point - train_point

    # Square the differences
    squared_diff = diff ** 2

    # Sum and take square root
    distance = np.sqrt(np.sum(squared_diff))

    # Store the distance
    distances_loop.append(distance)

    print(f"Distance to point {i}: {distance:.4f}")

# Convert to numpy array
distances_loop = np.array(distances_loop)
print(f"\nDistances shape: {distances_loop.shape}")

### Approach 2: Using Vectorization (Efficient)

Instead of looping, we can use `pairwise_distances` from sklearn.

This computes **all distances at once** using optimized NumPy operations - much faster!

**Note:** `pairwise_distances` returns a 2D array (matrix of distances), so we use `.ravel()` to flatten it to 1D.

In [None]:
# Calculate distances using pairwise_distances
distances_2d = pairwise_distances(X_test, X_train)
print("2D array shape:", distances_2d.shape)  # (1, 10) = 1 test point, 10 training points
print("2D array:")
print(distances_2d)
print()

# Flatten to 1D array using .ravel()
distances_vectorized = distances_2d.ravel()
print("1D array shape:", distances_vectorized.shape)  # (10,)
print("1D array:")
print(distances_vectorized)
print()

# Verify both approaches give same result
print("Results match:", np.allclose(distances_loop, distances_vectorized))

## Step 6: Find K Nearest Neighbors

Now we have distances to all training points. We need to find the **5 closest points** (K=5).

**How do we find them?**
We need to:
1. Sort the distances from smallest to largest
2. Get the **indices** (positions) of the 5 smallest distances

### What is `argsort()`?

`argsort()` returns the **indices** that would sort an array, not the sorted values themselves.

**Example:**
- Array: [4.5, 2.1, 7.3, 1.8, 3.2]
- `argsort()` returns: [3, 1, 4, 0, 2]
- This means: index 3 has the smallest value (1.8), then index 1 (2.1), then index 4 (3.2), etc.

In [None]:
# Let's see which indices argsort returns
sorted_indices = np.argsort(distances_vectorized)
print("All indices sorted by distance:")
print(sorted_indices)
print()

# Get the first 5 indices (K=5 nearest neighbors)
K = 5
nearest_indices = sorted_indices[:K]
print(f"Indices of {K} nearest neighbors:")
print(nearest_indices)
print()

# Show the actual distances
print(f"Distances to these {K} nearest neighbors:")
for i, idx in enumerate(nearest_indices):
    print(f"  Neighbor {i+1}: point {idx}, distance = {distances_vectorized[idx]:.4f}")

## Step 7: Get Target Values of Nearest Neighbors

Now we know **which** training points are closest.

Let's see what **target values** these neighbors have.

Remember: In regression, each training point has a continuous target value (not a class label).

In [None]:
# Get the target values of the K nearest neighbors
neighbor_values = y_train[nearest_indices]

print(f"Target values of {K} nearest neighbors:")
print(neighbor_values)
print()

# Show details
print("Detailed view:")
for i, idx in enumerate(nearest_indices):
    print(f"  Neighbor {i+1}: point {idx}, target = {y_train[idx]:.3f}, distance = {distances_vectorized[idx]:.4f}")

## Step 8: Make Prediction via Averaging

Now we have the target values of the K nearest neighbors.

**For regression, we predict by taking the average (mean) of these values.**

### Approach 1: Manual Averaging

In [None]:
# Calculate average manually
sum_values = 0
for value in neighbor_values:
    sum_values += value

average_manual = sum_values / len(neighbor_values)

print("Target values:", neighbor_values)
print(f"\nSum of values: {sum_values:.3f}")
print(f"Number of neighbors: {len(neighbor_values)}")
print(f"Average: {sum_values:.3f} / {len(neighbor_values)} = {average_manual:.3f}")
print(f"\nPredicted value (manual): {average_manual:.3f}")

### Approach 2: Using NumPy's `mean()`

`np.mean()` or `.mean()` calculates the average efficiently.

This is more concise than manual summation.

In [None]:
# Calculate average using NumPy
prediction_numpy = neighbor_values.mean()

print("Target values:", neighbor_values)
print(f"Mean: {prediction_numpy:.3f}")
print()

# Verify both approaches match
print(f"Manual and NumPy predictions match: {np.isclose(average_manual, prediction_numpy)}")
print(f"Difference: {abs(average_manual - prediction_numpy):.10f}")

## Step 9: Visualize the Result

Let's visualize the test point, its K nearest neighbors, and their target values.

We'll now color the training points by their target values to see the pattern.

In [None]:
# Visualize the regression result
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Left plot: Show all points and K nearest neighbors
ax1.scatter(X_train[:,0], X_train[:,1],
           c='steelblue', s=100, alpha=0.3,
           edgecolors='black', linewidths=1, label='Training points')

# Highlight the K nearest neighbors
nearest_neighbors_X = X_train[nearest_indices]
ax1.scatter(nearest_neighbors_X[:,0], nearest_neighbors_X[:,1],
           c='green', s=200, marker='s', alpha=0.7,
           edgecolors='darkgreen', linewidths=2, label=f'{K} Nearest Neighbors')

# Plot test point
ax1.scatter(X_test[:,0], X_test[:,1],
           c='red', s=400, marker='*',
           edgecolors='black', linewidths=2,
           label=f'Test point (predicted: {prediction_numpy:.3f})')

ax1.set_xlabel('Feature 1 ($x_1$)', fontsize=12)
ax1.set_ylabel('Feature 2 ($x_2$)', fontsize=12)
ax1.set_title(f'KNN Regression: Finding Neighbors (K={K})', fontsize=14)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.axis([0, 5, 0, 5])

# Right plot: Show points colored by target values
scatter = ax2.scatter(X_train[:,0], X_train[:,1],
                     c=y_train, cmap='viridis', s=100, alpha=0.6,
                     edgecolors='black', linewidths=1.5)
plt.colorbar(scatter, ax=ax2, label='Target Value')

# Highlight nearest neighbors
ax2.scatter(nearest_neighbors_X[:,0], nearest_neighbors_X[:,1],
           c='red', s=200, marker='s', alpha=0.5,
           edgecolors='darkred', linewidths=2, label=f'{K} Nearest Neighbors')

# Plot test point
ax2.scatter(X_test[:,0], X_test[:,1],
           c='red', s=400, marker='*',
           edgecolors='black', linewidths=2,
           label=f'Test point')

ax2.set_xlabel('Feature 1 ($x_1$)', fontsize=12)
ax2.set_ylabel('Feature 2 ($x_2$)', fontsize=12)
ax2.set_title('Target Values Distribution', fontsize=14)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
ax2.axis([0, 5, 0, 5])

plt.tight_layout()
plt.show()

print(f"\nTest point {X_test[0]} prediction: {prediction_numpy:.3f}")
print(f"Based on averaging {K} nearest neighbors:")
for i, val in enumerate(neighbor_values):
    print(f"  Neighbor {i+1}: target = {val:.3f}")
print(f"Average: {prediction_numpy:.3f}")

## Summary

We've walked through all the computational steps of KNN Regression:

1. ✅ **Visualized data** - saw training points in 2D space
2. ✅ **Calculated distances** - computed Euclidean distance from test point to each training point
3. ✅ **Found K nearest neighbors** - used `argsort()` to find indices of 5 closest points
4. ✅ **Made prediction** - used **averaging** of the K neighbors' target values

### Key Difference: Classification vs Regression

| Aspect | Classification | Regression |
|--------|---------------|-----------|
| **Target values** | Discrete classes (0, 1, 2, ...) | Continuous numbers (1.5, 2.3, ...) |
| **Prediction method** | Majority voting | Averaging |
| **Output** | Class label | Continuous value |

### Key NumPy Operations Used:

- **`pairwise_distances(X_test, X_train)`** - efficiently calculates all distances
- **`.ravel()`** - flattens 2D array to 1D
- **`np.argsort(distances)`** - returns indices that would sort the array
- **`array[indices]`** - fancy indexing to select multiple elements
- **`.mean()`** or `np.mean()` - calculates average

### Why Both Approaches?

- **Loop versions** help you understand the logic step-by-step
- **Vectorized versions** are much faster for large datasets

In practice, use vectorized operations, but understanding loops helps you know what's happening under the hood!