# L4c: K-Nearest Neighbors Approach for Classification
In this lecture, we explore the K-Nearest Neighbors (KNN) algorithm, a non-parametric method for classification tasks. KNN classifies new data points based on the majority class of their $K$ closest neighbors in the feature space.

> __Learning Objectives:__
>
> By the end of this lecture, you will be able to:
> * __Explain the KNN classification algorithm:__ Describe how KNN uses distance metrics and majority voting to classify new data points based on their nearest neighbors in the feature space.
> * __Apply KNN to binary and multiclass problems:__ Implement the KNN algorithm for classification tasks with two or more classes, including proper handling of ties and distance metric selection.
> * __Analyze KNN computational requirements:__ Evaluate the computational cost of KNN classification and understand how dimensionality affects distance-based classification methods.

Let's get started!
___

## Example
Today, we will use the following examples to illustrate key concepts:
 
> [▶ K-Nearest Neighbors Binary Classification of Synthetic Data](CHEME-5820-L4a-Example-MeasureFirmSimilarityScores-Spring-2026.ipynb). In this example, let's implement KNN for binary classification on a synthetic dataset. We'll explore how varying the overlap between classes, the choice of $K$, the distance metric, and other parameters affects classification accuracy.
___

## K-Nearest Neighbor Classification
K-nearest neighbor (KNN) classification is a machine learning algorithm for classification and regression tasks. 

> __Background__ 
> 
> Developed by [Fix and Hodges in 1951](https://www.jstor.org/stable/1403797?socuuid=e7bcd649-778e-473f-8b7b-92805b5fab5f) and later refined by [Cover and Hart in 1967](https://ieeexplore.ieee.org/document/1053964) is a _supervised_ classification algorithm where an object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its $K$ nearest neighbors, where $1 \leq K < n$. 

The algorithm finds the $K$ closest data points to a new instance in the feature space and then classifies the new instance based on the majority class among these neighbors.

> __Key assumption__ 
> 
> The key assumption of a [K-nearest neighbor classifier](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) is that _similar_ inputs have _similar_ labels (in classification tasks) or _similar_ outputs for K-nearest neighbor regression tasks. The critical question is how we define and measure _similarity_.

Let's look at the pseudo-code for KNN classification.

__Initialization__: Provide a reference dataset $\mathcal{D} = \{(\mathbf{x}_{i},y_{i}) \mid i = 1,2,\dots,n\}$, where $\mathbf{x}_i \in \mathbb{R}^{m}$ are $m$-dimensional feature vectors and $y_i \in \{-1,1\}$ are discrete binary labels. Specify the number of neighbors $K$, where $1 \leq K < n$. Choose a distance metric $d(\cdot,\cdot)$ (typically Euclidean distance: $d(\mathbf{x}, \mathbf{x}^*) = \|\mathbf{x} - \mathbf{x}^*\|_2$).

For each new point $\mathbf{x}^*$ to classify __do__:

1. Compute distances between $\mathbf{x}^*$ and all reference points in $\mathcal{D}$: Calculate the distance from $\mathbf{x}^*$ to all reference points in $\mathcal{D}$: $d_i \gets d(\mathbf{x}_i, \mathbf{x}^*)$ for $i = 1, 2, \dots, n$, where $d_i$ is the distance between $\mathbf{x}^*$ and the $i$-th reference point $\mathbf{x}_i$, and $d(\cdot,\cdot)$ is the chosen distance metric.

2. Find $K$ nearest neighbors between $\mathbf{x}^*$ and the reference points in $\mathcal{D}$: Identify the set $\mathcal{N}_K(\mathbf{x}^*) = \{\mathbf{x}_{i_1}, \mathbf{x}_{i_2}, \dots, \mathbf{x}_{i_K}\}$ of $K$ reference points with the smallest distances to $\mathbf{x}^*$, where $i_1, i_2, \dots, i_K$ are the indices of the $K$ nearest neighbors.

3. Assign a class using a majority vote of the $K$ nearest neighbors in set $\mathcal{N}_K(\mathbf{x}^*)$: Count the class labels among the $K$ nearest neighbors and assign the majority class to $\mathbf{x}^*$:
$$
\hat{y}^* = \arg\max_{c \in \{-1,1\}} \sum_{j=1}^{K} \mathbb{1}(y_{i_j} = c)
$$ 
where $\mathbb{1}(\cdot)$ is the indicator function that returns 1 if the condition is true and 0 otherwise. The variable $y_{i_j}$ is the label of the $j$-th nearest neighbor.

__Return__ the predicted label $\hat{y}^*$ for the new point $\mathbf{x}^*$.

The algorithm extends naturally to handle various implementation details and edge cases.

> __Multiclass adaptation__
> 
> The same KNN procedure works for any number of classes $C\ge 2$. Replace the binary label set $\{-1,1\}$ with $\{c_1,\dots,c_C\}$ and keep the majority vote step:
> $$\hat{y}^* = \arg\max_{c \in \{c_1,\dots,c_C\}} \sum_{j=1}^{K} \mathbb{1}(y_{i_j}=c)$$

When implementing KNN, we also need to consider how to handle ties between classes and which distance metric to use.

> __Tie handling__
> 
> If multiple classes obtain the same maximal count, break ties by selecting the class of the nearest neighbor among the tied classes, or by using a distance-weighted vote where closer neighbors contribute more weight.

The choice of distance metric determines how we measure similarity between data points.

> __Distance metric options__
> 
> The choice of distance metric $d(\cdot,\cdot)$ affects classification results:
> * __Euclidean__: $d(\mathbf{x},\mathbf{x}^*) = \|\mathbf{x}-\mathbf{x}^*\|_2$, the L2 norm or standard geometric distance in $m$ dimensions. In practice, the squared Euclidean distance $\|\mathbf{x}-\mathbf{x}^*\|_2^2$ is often used since it avoids the square root computation and preserves neighbor ordering (the square root is monotonic, so ranking neighbors by distance or squared distance gives the same result).
> * __Manhattan__: $d(\mathbf{x},\mathbf{x}^*) = \|\mathbf{x}-\mathbf{x}^*\|_1$, the L1 norm summing absolute differences across each dimension. Manhattan distance is more robust to outliers than Euclidean distance because it does not square the differences. This makes L1 preferable when the data contains outliers or when features have different scales that should not be geometrically combined. Manhattan distance also performs well in high-dimensional spaces where many features may be sparse or irrelevant.
> * __Cosine similarity (converted to distance)__: $d(\mathbf{x},\mathbf{x}^*) = 1 - \frac{\mathbf{x}\cdot\mathbf{x}^*}{\|\mathbf{x}\|_2\|\mathbf{x}^*\|_2}$. This measures the cosine of the angle between two vectors (an inner product normalized by magnitudes), so it focuses on orientation rather than length. Use cosine when vectors are length-normalized and direction matters more than magnitude, or in high-dimensional sparse settings where Euclidean distances are dominated by vector length instead of feature direction.

A key limitation of KNN is its computational cost, which scales with the reference dataset size.

> __Computational cost__
> 
> The naïve implementation requires $O(n\,m)$ operations per query, computing $n$ distances in $m$-dimensional space. This cost increases with both the number of reference points $n$ and the feature dimension $m$. For high-dimensional data, KNN can suffer from the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality), where distance metrics become less meaningful as dimensionality increases.

Let's look at an example of KNN classification for a synthetic dataset.

> __Example__
> 
> [▶ K-Nearest Neighbors Binary Classification of Synthetic Data](CHEME-5820-L4a-Example-MeasureFirmSimilarityScores-Spring-2026.ipynb). In this example, let's implement KNN for binary classification on a synthetic dataset. We'll explore how varying the overlap between classes, the choice of $K$, the distance metric, and other parameters affects classification accuracy.
___

## Summary
K-Nearest Neighbors classifies new data points by finding their $K$ closest neighbors in the reference dataset and assigning the majority class through voting.

> __Key Takeaways:__
>
> * **KNN is a non-parametric classifier**: KNN makes no assumptions about the underlying data distribution and classifies new points based solely on local neighborhood structure, making it flexible but dependent on the choice of distance metric and $K$.
> * **Computational cost scales with dataset size**: The naïve KNN implementation requires computing distances to all $n$ reference points for each query, resulting in $O(n\,m)$ cost per classification. For high-dimensional data, the curse of dimensionality can reduce the effectiveness of distance-based methods.
> * **Algorithm extends naturally to multiclass problems**: The same majority-voting procedure works for any number of classes by counting occurrences of each class label among the $K$ nearest neighbors and selecting the most frequent class.
___