<div style="text-align:center;">
    <img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">
    <h1>CSCI 416: Introduction to Machine Learning</h1>
    <h1>Fall 2025</h1>
    <h1>Nearest neighbors</h1>
</div>

# Contents

- [Nearest neighbors classifiers](#Nearest-neighbors-classifiers)
- [The choice of $k$](#The-choice-of-$k$)
- [The choice of distance](#The-choice-of-distance)
- [The importance of scaling](#The-importance-of-scaling)
- [Training a kNN classifier](#Training-a-kNN-classifier)
- [kNN and the Bayes error](#kNN-and-the-Bayes-error)
- [Building a kNN classifier in Scikit-Learn](#Building-a-kNN-classifier-in-Scikit-Learn)
  * [Fisher's iris data set](#Fisher's-iris-data-set)
  * [Training and test sets](#Training-and-test-sets)
  * [Construct the classifier](#Construct-the-classifier)
  * [Plot the decision regions](#Plot-the-decision-regions)
  * [Evaluate the model](#Evaluate-the-model)
- [Feature scaling](#Feature-scaling)
- [Pipelines in Scikit-Learn](#Pipelines-in-Scikit-Learn)
- [Saving models](#Saving-models)

# Nearest neighbors classifiers

The $k$-nearest neighbors (kNN) classifier is very simple.

Choose $k \geq 1$.  Given  a training set $T$ and a new case $x$,
1. Find the $k$ nearest neighbors of $x$ in $T$.
2. Classify $x$ the be the majority class amongst the $k$ nearest neighbors.

kNN is an example of a **prototype** learning method.

Some methods the training data to build a mathematical model.  The mathematical model is then applied to new cases.

Prototype methods, on the other hand, simply store the training data as prototypes of general cases.  When a new case is encountered, only then is the relationship between the new case and the training data explored and a decision made.

Despite the simplicity of kNN, it can be a surprisingly effective classifier.  It can handles situations where the decision regions are gnarly and a class can have more than one prototype.

Even though kNN is simple, there are still some hyperparameters to tune:
- How should we choose $k$?
- What is meant by *nearest*?  I.e., how do we measure distance or similarity/dissimilarity?

# The choice of $k$

The "best" choice of $k$ is highly problem-dependent.

Large values of $k$ suppress the effects of noise and random variation in the data.
However, large values of $k$ also blur the decision boundaries.

Small values of $k$, on the other hand, may be more likely to fit noise in the training data.

We can tune the hyperparameter $k$.

# The choice of distance

Any type of norm or metric can be used in kNN.

For instance, Scikit-Learn's [<code>DistanceMetric</code> class](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html#distancemetric) provides a number of common metrics.

We can tune the choice of distance as a hyperparameter.

# The importance of scaling

When applying kNN, make sure your data are suitably scaled.

For instance, suppose our data are (width, height) for boxes.  If width is measured in barleycorns while height is measured in (English) ells, then the width values will typically be 100$\times$ that of the height values:
$$
  (185, 24), (167, 24.4), (195, 24.2).
$$
If we use the Euclidean norm to compute distance, the width values will have a much greater impact in the value of the distance between points.

This is why standardizing the data (mean 0, variance 1) or scaling to a given range (e.g, [0,1]) may be a good idea.  In addition to bringing the features into a similar range, scaling also removes units and makes the data dimensionless.

Otherwise, you may encounter the curious situation that your norm is combining quantities with different physical units.

# Training a kNN classifier

Since kNN classifiers stores their training sets, they can be rather large.

In addition, a naive implementation can be slow to apply.  Suppose 
* we are using a $p$-norm for the distance,
* there are $n$ training cases, and
* there are $d$ features (components) in our feature vectors.

Comparing a new instance against the entire training set requires $\Theta(nd)$ distance calculations, so making a prediction for a single instance requires $\Theta(nd)$ work.

Part of the training of a kNN classifier is the construction of special data structures for storing the training set (k-d trees or ball trees).  These data structures make the calculation of the nearest neighbor much more efficient than brute force comparision against the entire training set.

# kNN and Bayes error

Assume a probabilistic relationship between the values of the input features, denoted by $x$, and the class label, denoted by $y$.  The probability that the class label is $y$ given that the input is $x$ is the conditional probability $P(y | x)$.

Suppose we knew this conditional probability (in practice we do not).  Then we would simply predict the most likely label:
$$ \DeclareMathOperator*{\argmax}{arg\,max}
  y_{*} = \argmax_{y}\ P(y | x).
$$
This ideal classifier is called the **Bayes optimal classifier**.  It gives incorrect answers if a sample does not have the most likely label.  The probability of this occuring is called The **Bayes error**:
$$
\newcommand{\ebayes}{\varepsilon_{\mbox{\scriptsize Bayes}}}
  \ebayes = 1 - P(y_{*} | x).
$$
It can be shown that no classifier (using the same features) can have a lower expected error rate.

Cover and Hart (1967) showed that under certain mild assumptions, as the amount of training data grows to $\infty$ the maximum error of a $1$-NN classifier error $\varepsilon$ in binary classification converges to no more than twice the Bayes error.

More precisely, the error converges to
$$
  \ebayes \leq \varepsilon \leq 2 \ebayes (1-\ebayes) \leq 2 \ebayes.
$$
This also means that with enough data we can use the $1$-NN classifier error to bound the Bayes error.

Unfortunately, we are unlikely ever to have enough data. 😦

# Building a NN classifier in Scikit-Learn

The general documentation for kNN in scikit-learn is [here](http://scikit-learn.org/stable/modules/neighbors.html).

The documentation for the <code>KNeighborsClassifier</code> classifier in scikit-learn is [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier).

## Fisher's iris data set

We will use Fisher's iris data set again.

The data set consists of 50 samples from each of three species of iris, 
* [*I. setosa*](https://en.wikipedia.org/wiki/Iris_setosa),
* [*I. versicolor*](https://en.wikipedia.org/wiki/Iris_versicolor), and 
* [*I. virginica*](https://en.wikipedia.org/wiki/Iris_virginica).
  
for a total of 150 instances.

Four **features** are measured for each sample: 
1. the length of the sepal,
2. the width of the sepal,
3. the length of the petal, and
4. the width of the petal.

The measurements are in centimeters.

The **class labels** are
1. Iris setosa,
2. Iris versicolor,
3. Iris virginica.

In [None]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

In [None]:
print("Class labels:", np.unique(y))
print("Class names:", iris.target_names)

In [None]:
print(iris.feature_names)
print(X[0:10,:])  # Just the first 10 rows.

As in the decision tree example, we will use only the petal length and width so we will be able to plot the decision regions:

In [None]:
X = X[:,2:]

## Training and test sets

Split the data into 70% training and 30% test data:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
  train_test_split(X, y, test_size=0.3, random_state=0)  # Be sure to set the random seed so the results are reproducible!

## Construct the classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train, y_train)

## Plot the decision regions

In [None]:
# A hacked up version of https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html.

import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

# Parameters
num_classes = 3
plot_colors = "ryb"

# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    knn,
    X,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[2:],
    ylabel=iris.feature_names[2:],
)

# Plot the training points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_train == i)
    plt.scatter(
        X_train[idx, 0],
        X_train[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15,
    )

plt.suptitle("Decision boundaries of the tree showing the training data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

Observe that kNN can produce complex shapes in its decision boundaries.

## Evaluate the model

In [None]:
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_train)
print('Misclassified training samples: %d' % (y_train != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_train, y_pred))

In [None]:
y_pred = knn.predict(X_test)
print('Misclassified test samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# The confusion matrix for the training data.
y_pred = knn.predict(X_train)

cm = confusion_matrix(y_train, y_pred)

class_names=("I. setosa", "I. versicolor", "I. virginica")
ConfusionMatrixDisplay.from_estimator(knn, X_test, y_test, display_labels=class_names)

In [None]:
# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    knn,
    X,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[2:],
    ylabel=iris.feature_names[2:],
)

# Plot the test points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_test == i)
    plt.scatter(
        X_test[idx, 0],
        X_test[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15,
    )

plt.suptitle("Decision boundaries of the tree showing the test data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

# Feature scaling

The relative scaling of the data affects the performance of many training algorithms as well as the performance of the resulting ML model.

The iris data is already pretty well scaled: across all values there is only a spread of two orders of magnitude:

In [None]:
print(f"mean:      {np.mean(X, axis=0)}")
print(f"std. dev.: {np.std(X, axis=0)}")

q = np.quantile(X, (0, 0.25, 0.5, 0.75, 1), axis=0)
iqr = q[3] - q[1]
print(f"median:    {np.median(X, axis=0)}")
print(f"IQR:       {iqr}")
print(f"quantiles:")
print(f"{q}")

Let's rescale the iris training and test data to have mean 0 and variance 1 using [<code>StandardScaler</code>](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#standardscaler)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)  # Compute the mean and std to be used for later scaling.
X_train_std = sc.transform(X_train)
X_test_std  = sc.transform(X_test)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)

Now repeat the evaluation steps.

In [None]:
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_train_std)
print('Misclassified training samples: %d' % (y_train != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_train, y_pred))

In [None]:
y_pred = knn.predict(X_test_std)
print('Misclassified test samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

class_names=("I. setosa", "I. versicolor", "I. virginica")
ConfusionMatrixDisplay.from_estimator(knn, X_test_std, y_test, display_labels=class_names)

Let's look at the decision regions:

In [None]:
# Parameters
num_classes = 3
plot_colors = "ryb"

X_std = sc.transform(X)

# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    knn,
    X_std,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[2:],
    ylabel=iris.feature_names[2:],
)

# Plot the training points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_train == i)
    plt.scatter(
        X_train_std[idx, 0],
        X_train_std[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15,
    )

plt.suptitle("Decision boundaries of the tree showing the training data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

In [None]:
# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    knn,
    X_std,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[2:],
    ylabel=iris.feature_names[2:],
)

# Plot the test points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_test == i)
    plt.scatter(
        X_test_std[idx, 0],
        X_test_std[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15,
    )

plt.suptitle("Decision boundaries of the tree showing the test data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

# Pipelines in Scikit-Learn

If we add scaling to our process we will need to make sure that the scaling is applied before the model is applied.  Rather than require the user to understand and implement all necessary preprocessing, Scikit-Learn makes it simple to package a whole stream of computation using [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).  Per the documentation:
<blockquote>
Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.

Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.
</blockquote>

Let's roll up the scaling and kNN classifier we just did into a pipeline.

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline

sc = StandardScaler()
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')

clf = make_pipeline(sc, knn)

print(clf)
print(clf[0])
print(clf[1])

In [None]:
clf.fit(X_train, y_train)

Let's confirm the pipeline behaves like what we did above.

In [None]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test)
print('Misclassified training samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

In [None]:
# Plot the decision boundaries.
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
DecisionBoundaryDisplay.from_estimator(
    knn,
    X_std,
    cmap=plt.cm.RdYlBu,
    response_method="predict",
    xlabel=iris.feature_names[2:],
    ylabel=iris.feature_names[2:],
)

# Plot the test points
for i, color in zip(range(num_classes), plot_colors):
    idx = np.where(y_test == i)
    plt.scatter(
        X_test_std[idx, 0],
        X_test_std[idx, 1],
        c=color,
        label=iris.target_names[i],
        edgecolor="black",
        s=15,
    )

plt.suptitle("Decision boundaries of the tree showing the training data")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
_ = plt.axis("tight")

# Saving models

The simplest way to save a Scikit-Learn model is to use the Python built-in [pickle](https://docs.python.org/3/library/pickle.html) module.

Other solutions are discussed in the [Scikit-Learn User Guide](https://scikit-learn.org/stable/model_persistence.html#model-persistence).

In [None]:
from pickle import dump, load

with open("model.pkl", "wb") as f:
    dump(clf, f)

with open("model.pkl", "rb") as f:
    model = load(f)

y_pred = model.predict(X_test)
print(y_pred)