# Nearest Neighbour

Nearest neightbour provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels.

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice.

Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

---

## Math Principle

The classification is based on the class of the nearest neighbours. I.e. the point will check the most representative nearest neighbour's class and decide the class.

A very commonly used algorithm is KNN, in which k stands for the number of neighbor the point will check.

The sklearn also provide rnn, where the algorithm will check all points's classes within the radius assigned by the user.

For knn, which is more commonly used, the k value is highly dependent to the data. Large the k value, more robust the model, but less clear the boundries. This issue can be solved with Silhouette Score and other evaluators.

For rnn, it is a better choice when data is not sampled uniformly. But for higher dimension data, its performance is poor.

By default, the data's weight is the same, user can change the weight with parameter `weights`. 

---

## Realization

1. Brute Force
   
   This algorithm is useful for small data. We can use it by setting parameter `algorithm=brute`

2. K-D Tree
   
   K-D Tree is a more efficient algorithm comparing to the brute force. The basic idea is to generate a k dimensional $2^k$ tree recursively partitions the parameter space along the data axes, dividing it into nested orthotropic regions into which data points are filed.

   Once the K-D Tree is constructed, it will be super fast to check the nearest neighbor. However, as the dimension goes up, the speed of checking nearest points become slow.

3. Ball Tree
   
   To address the inefficiencies of KD Trees in higher dimensions, the ball tree data structure was developed. Where KD trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres. This makes tree construction more costly than that of the KD tree, but results in a data structure which can be very efficient on highly structured data, even in very high dimensions.

   A ball tree recursively divides the data into nodes defined by a centroid $C$ and radius $r$ , such that each point in the node lies within the hyper-sphere defined by $r$ and $C$. 

In [None]:
# Sample Code for KDTree
# Just a demo, plz don't write it yourself

class Node:
    def __init__(self, point, left=None, right=None):
        self.point = point
        self.left = left
        self.right = right

def build_kd_tree(points, depth=0):
    if not points:
        return None

    k = len(points[0])
    axis = depth % k
    points.sort(key=lambda point: point[axis])
    median_idx = len(points) // 2
    median = points[median_idx]
    
    left_points = points[:median_idx]
    right_points = points[median_idx + 1:]
    
    left_subtree = build_kd_tree(left_points, depth + 1)
    right_subtree = build_kd_tree(right_points, depth + 1)
    
    return Node(median, left_subtree, right_subtree)

def print_tree(node, level=0):
    if node:
        print(level, node.point)
        print_tree(node.left, level + 1)
        print_tree(node.right, level + 1)

data = [(2,3,4), (5,4,6), (9,6,8), (4,7,2), (8,1,5), (7,2,3)]
root = build_kd_tree(data)

print_tree(root)

In [None]:
# Sample Code for Ball Tree
# Just a demo, plz don't write it yourself
import numpy as np

class BallNode:
    def __init__(self, center, radius, left=None, right=None, points=None):
        self.center = center  # Ball 的球心
        self.radius = radius  # Ball 的半径
        self.left = left  # 左子树
        self.right = right  # 右子树
        self.points = points  # 子树中的数据点

def build_ball_tree(points):
    if len(points) == 0:
        return None

    center = np.mean(points, axis=0)  # 计算球心
    radius = np.max(np.linalg.norm(points - center, axis=1))  # 计算半径

    if len(points) <= leaf_size:  # 叶节点
        return BallNode(center, radius, points=points)

    left_indices = np.random.choice(len(points), len(points) // 2, replace=False)
    left_points = points[left_indices]
    right_points = np.delete(points, left_indices, axis=0)

    left_child = build_ball_tree(left_points)
    right_child = build_ball_tree(right_points)

    return BallNode(center, radius, left=left_child, right=right_child)

def print_ball_tree(node, depth=0):
    if node is None:
        return

    print("  " * depth, f"Center: {node.center}, Radius: {node.radius}")

    if node.points is not None:
        print("  " * (depth + 1), "Leaf Points:", node.points)
    else:
        print("  " * (depth + 1), "Left Child:")
        print_ball_tree(node.left, depth + 2)
        print("  " * (depth + 1), "Right Child:")
        print_ball_tree(node.right, depth + 2)

np.random.seed(0)
points = np.random.rand(100, 2)  # 生成随机数据点

global leaf_size
leaf_size = 5  # 叶节点中允许的数据点数量

ball_tree = build_ball_tree(points)
print("Ball Tree construction completed.")
print("Printing Ball Tree Structure:")
print_ball_tree(ball_tree)

## Method, Parameter and Attribute

### NearestNeighbors

- Method
  - fit
  - predict
  - kneighbors
    
    Return the distances and indices of the nearest neighbor in a muliti-dimensino array. The output will includes the sample itself.

  - kneighbors_graph()

    Get a sparse graph showing the connections between neighboring points.

- Parameter
  - n_neighbors
    
    Assign how many neighbors to check

  - radius
     
    Range of parameter space to use by default for radius_neighbors queries.

  - leaf_size

    Leaf size passed to BallTree or KDTree. 

  - metric

    [see](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics)

  - p

    Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances.

  - metric_params

    Additional keyword arguments for the metric function.

  - n_jobs

    The number of parallel jobs to run for neighbors search. 
    
    This parameter is used to specify how many concurrent processes or threads should be used for routines that are parallelized with joblib.

  - algorithm
    - ball_tree
    - kd_tree
    - brute
    - auto

- Attributes
  - effective_metric_
  - effective_metric_params_
  - n_features_in_
  - feature_names_in_
  - n_samples_fit_

### KDTree

- Parameter
  - leaf_size
  - metric
- Attributes
  - data: The train data

For more information see [api](https://scikit-learn.org/stable/modules/classes.html)

### BallTree

The same to KDTree

### Nearest Centroid Classifier

The NearestCentroid classifier is a simple algorithm that represents each class by the centroid of its members. In effect, this makes it similar to the label updating phase of the KMeans algorithm. It also has no parameters to choose, making it **a good baseline classifier**. It does, however, **suffer on non-convex classes**, as well as when classes have **drastically different variances**, as **equal variance in all dimensions is assumed**.

In [None]:
from sklearn.neighbors import NearestCentroid
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = NearestCentroid()
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))

### Nearest Shrunken Centroid

The NearestCentroid classifier has a shrink_threshold parameter, which implements the nearest shrunken centroid classifier. In effect, the value of each feature for each centroid is divided by the within-class variance of that feature. The feature values are then reduced by shrink_threshold. Most notably, if a particular feature value crosses zero, it is set to zero. In effect, this removes the feature from affecting the classification. This is useful, for example, for **removing noisy features**.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import ListedColormap

from sklearn import datasets
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.neighbors import NearestCentroid

# import some data to play with
iris = datasets.load_iris()
# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

# Create color maps
cmap_light = ListedColormap(["orange", "cyan", "cornflowerblue"])
cmap_bold = ListedColormap(["darkorange", "c", "darkblue"])

for shrinkage in [None, 0.2]:
    # we create an instance of Nearest Centroid Classifier and fit the data.
    clf = NearestCentroid(shrink_threshold=shrinkage)
    clf.fit(X, y)
    y_pred = clf.predict(X)
    print(shrinkage, np.mean(y == y_pred))

    _, ax = plt.subplots()
    DecisionBoundaryDisplay.from_estimator(
        clf, X, cmap=cmap_light, ax=ax, response_method="predict"
    )

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor="k", s=20)
    plt.title("3-Class classification (shrink_threshold=%r)" % shrinkage)
    plt.axis("tight")

plt.show()

### Nearest Neighbors Transformer

See [guide](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-transformer)

### Neighborhood Components Analysis

Neighborhood Components Analysis (NCA, NeighborhoodComponentsAnalysis) is a distance metric learning algorithm which aims to **improve the accuracy of nearest neighbors classification** compared to the standard Euclidean distance. The algorithm directly maximizes a stochastic variant of the leave-one-out k-nearest neighbors (KNN) score on the training set. It **can also learn a low-dimensional linear projection of data** that can be used for **data visualization** and **fast classification**.

In [None]:
from sklearn.neighbors import (NeighborhoodComponentsAnalysis,KNeighborsClassifier),
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
nca = NeighborhoodComponentsAnalysis(random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
nca_pipe = Pipeline([('nca', nca), ('knn', knn)])
nca_pipe.fit(X_train, y_train)
print(nca_pipe.score(X_test, y_test))

In [None]:
# License: BSD 3 clause

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn import datasets
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

n_neighbors = 1

dataset = datasets.load_iris()
X, y = dataset.data, dataset.target

# we only take two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = X[:, [0, 2]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.7, random_state=42
)

h = 0.05  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(["#FFAAAA", "#AAFFAA", "#AAAAFF"])
cmap_bold = ListedColormap(["#FF0000", "#00FF00", "#0000FF"])

names = ["KNN", "NCA, KNN"]

classifiers = [
    Pipeline(
        [
            ("scaler", StandardScaler()),
            ("knn", KNeighborsClassifier(n_neighbors=n_neighbors)),
        ]
    ),
    Pipeline(
        [
            ("scaler", StandardScaler()),
            ("nca", NeighborhoodComponentsAnalysis()),
            ("knn", KNeighborsClassifier(n_neighbors=n_neighbors)),
        ]
    ),
]

for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    _, ax = plt.subplots()
    DecisionBoundaryDisplay.from_estimator(
        clf,
        X,
        cmap=cmap_light,
        alpha=0.8,
        ax=ax,
        response_method="predict",
        plot_method="pcolormesh",
        shading="auto",
    )

    # Plot also the training and testing points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor="k", s=20)
    plt.title("{} (k = {})".format(name, n_neighbors))
    plt.text(
        0.9,
        0.1,
        "{:.2f}".format(score),
        size=15,
        ha="center",
        va="center",
        transform=plt.gca().transAxes,
    )

plt.show()

In [None]:
# License: BSD 3 clause

import matplotlib.pyplot as plt
import numpy as np

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

n_neighbors = 3
random_state = 0

# Load Digits dataset
X, y = datasets.load_digits(return_X_y=True)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, stratify=y, random_state=random_state
)

dim = len(X[0])
n_classes = len(np.unique(y))

# Reduce dimension to 2 with PCA
pca = make_pipeline(StandardScaler(), PCA(n_components=2, random_state=random_state))

# Reduce dimension to 2 with LinearDiscriminantAnalysis
lda = make_pipeline(StandardScaler(), LinearDiscriminantAnalysis(n_components=2))

# Reduce dimension to 2 with NeighborhoodComponentAnalysis
nca = make_pipeline(
    StandardScaler(),
    NeighborhoodComponentsAnalysis(n_components=2, random_state=random_state),
)

# Use a nearest neighbor classifier to evaluate the methods
knn = KNeighborsClassifier(n_neighbors=n_neighbors)

# Make a list of the methods to be compared
dim_reduction_methods = [("PCA", pca), ("LDA", lda), ("NCA", nca)]

# plt.figure()
for i, (name, model) in enumerate(dim_reduction_methods):
    plt.figure()
    # plt.subplot(1, 3, i + 1, aspect=1)

    # Fit the method's model
    model.fit(X_train, y_train)

    # Fit a nearest neighbor classifier on the embedded training set
    knn.fit(model.transform(X_train), y_train)

    # Compute the nearest neighbor accuracy on the embedded test set
    acc_knn = knn.score(model.transform(X_test), y_test)

    # Embed the data set in 2 dimensions using the fitted model
    X_embedded = model.transform(X)

    # Plot the projected points and show the evaluation score
    plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, s=30, cmap="Set1")
    plt.title(
        "{}, KNN (k={})\nTest accuracy = {:.2f}".format(name, n_neighbors, acc_knn)
    )
plt.show()

For more advancerd example, see [here](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#sphx-glr-auto-examples-manifold-plot-lle-digits-py)