<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/JNCLectures_Intro_to_ML/blob/main/Week9/2025/Week9_KNN_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# k-Nearest Neighbour (k-NN) Classification

**Overview:**
The k-NN algorithm is a simple, non-parametric, and lazy learning method used for classification. It works by:

1. **Storing** all available cases and class labels.
2. **Computing distances** between a new data point and the stored cases (commonly using the Euclidean distance).
3. **Selecting the 'k' closest points** (neighbors) based on the computed distance.
4. **Assigning the class** to the new data point by majority voting among the k nearest neighbors.

**How It Works:**
- **Training Phase:** The algorithm simply stores the training data.
- **Prediction Phase:** For each test instance, compute the distance to every training instance, select the k nearest ones, and use their labels to determine the most common class (majority vote).

**Advantages:**
- Easy to implement and understand.
- Non-parametric: It makes no assumptions about the underlying data distribution.
- Flexible: Can work with any number of classes.

**Disadvantages:**
- Prediction can be computationally expensive as it involves calculating the distance to every training sample.
- Sensitive to the choice of k and the distance metric.
- Performance may degrade with high-dimensional data (curse of dimensionality).


In [6]:
from sklearn import datasets
from sklearn.model_selection import train_test_split


iris = datasets.load_iris()
list(iris.keys())

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename',
 'data_module']

In [7]:
# get data
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(100, 4) (50, 4) (100,) (50,)


In [16]:
import numpy as np
import scipy.stats as st

def euclidean_distance(x1, x2):
    """Compute the Euclidean distance between two vectors."""
    return np.sqrt(np.sum((x1 - x2) ** 2))

def knn_predict(X_train, y_train, X_test, k=3):
    """
    Predict class labels for the test data using a functional style k-NN.

    Parameters:
      X_train : np.array, shape (n_samples, n_features) - training features.
      y_train : np.array, shape (n_samples,) - training labels.
      X_test  : np.array, shape (n_samples, n_features) - test features.
      k       : int - number of nearest neighbors to use.

    Returns:
      predictions : np.array, shape (n_samples,) - predicted class labels.
    """
    predictions = []
    for x in X_test:
        # Compute distances from x to every training sample
        distances = [euclidean_distance(x, x_train) for x_train in X_train]
        # Get indices of the k closest training samples
        k_indices = np.argsort(distances)[:k]
        # Get the corresponding labels for these indices and convert to a NumPy array
        k_nearest_labels = np.array([y_train[i] for i in k_indices])
        # Use scipy.stats.mode with axis=None to determine the most common label
        common_label = np.atleast_1d(st.mode(k_nearest_labels, axis=None).mode)[0]
        predictions.append(common_label)
    return np.array(predictions)

**Important Note:**  
In some versions of SciPy, `st.mode` may return a scalar when given a one-dimensional array. To avoid indexing errors, we wrap the result using `np.atleast_1d`, ensuring that the mode is always returned as an array.


In [17]:
# Predict class labels using functional k-NN
predictions = knn_predict(X_train, y_train, X_test, k=3)

# Evaluate the classifier's accuracy
accuracy = np.sum(predictions == y_test) / len(y_test)
print("Class-Based k-NN Accuracy:", accuracy)

Class-Based k-NN Accuracy: 0.98


#### Implement using OOPs:

In [18]:
import numpy as np
import scipy.stats as st

def euclidean_distance(x1, x2):
    """Compute the Euclidean distance between two vectors."""
    return np.sqrt(np.sum((x1 - x2) ** 2))

class KNN:
    def __init__(self, k=3):
        """
        Initialize the k-NN classifier.

        Parameters:
          k : int - number of nearest neighbors to use.
        """
        self.k = k

    def fit(self, X, y):
        """
        Store the training data.

        Parameters:
          X : np.array, shape (n_samples, n_features) - training features.
          y : np.array, shape (n_samples,) - training labels.
        """
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        """
        Predict class labels for the test data.

        Parameters:
          X : np.array, shape (n_samples, n_features) - test features.

        Returns:
          predictions : np.array, shape (n_samples,) - predicted class labels.
        """
        predictions = [self._predict(x) for x in X]
        return np.array(predictions)

    def _predict(self, x):
        # Compute distances from x to all training samples
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        # Get indices of the k nearest training samples
        k_indices = np.argsort(distances)[:self.k]
        # Get the labels of these k nearest samples
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        # Use scipy.stats.mode to determine the most common label
        return np.atleast_1d(st.mode(k_nearest_labels, axis=None).mode)[0]




In [19]:
# Create and train the k-NN classifier with k=3
knn = KNN(k=3)
knn.fit(X_train, y_train)

# Predict the labels for the test set
predictions = knn.predict(X_test)

# Evaluate the classifier's accuracy
accuracy = np.sum(predictions == y_test) / len(y_test)
print("Class-Based k-NN Accuracy:", accuracy)


Class-Based k-NN Accuracy: 0.98
