# K-NEAREST NEIGHBORS (KNN)

## GENERAL

KNN is a superwised learning algorithm used for both classification and regression tasks. It is a non-parametric, instance-based (lazy) learning algorithm that makes predictions based on the similarity between data points.

## LIBRARIES

1. ***scikit-learn***: a general-purpose KNN that is easy to use, well-optimized, and widely adopted in both industry and academia.

    * KNeighborsClassifier - for classification
    * KNeighborsRegressor - for regression
    * KNNImputer - for handling missing values

In [1]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import KNNImputer

  from pandas.core import (


2. ***faiss***: a large-scale KNN that is extremely fast for high-dimensional large datasets (millions of points) and optimized by Facebook AI.

    * faiss.IndexFlatL2 - for fast nearest-neighbor search

In [2]:
# pip install faiss-cpu (for CPU version)
# pip install faiss-gpu (for GPU version)

import faiss

particularly popular for its efficiency and scalability in handling high-dimensional vectors, making it useful in various applications such as image and text retrieval, recommendation systems, and more

## HYPERPARAMETERS

1. ***n_neighbors (K)*** – Number of Neighbors: This parameter specifies the number of closest data points to consider for classification or regression. Selecting an appropriate value for *K* is essential—if *K* is too small, the model may have high variance, resulting in *overfitting*, while a very large *K* can lead to high bias, causing *underfitting*.

    Best Practice: Use cross-validation (GridSearchCV) to find the optimal *K*.

In [3]:
# EXAMPLE
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Define the parameter grid
param_grid = {'n_neighbors': range(1, 20)}

# Perform Grid Search
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

# Output the best value of K
print('Best K:', grid.best_params_['n_neighbors'])

Best K: 5


2. ***metric*** - Distance Metric: This parameter specifies how the algorithm measures the distance between points. Common options:

    * *Euclidean Distance* - default, best for continuous data
    * *Manhattan Distance* - better for grid-like data, e.g., city blocks
    * *Minkowski Distance* - generalized form
    * *Hamming Distane* - for categorical data

In [4]:
# EXAMPLE
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)

3. ***weights*** - Weighting of Neighbors: This parameter specifies how much influence each neighbor has. Options:
    * *'iniform'* - all neighbours contribute equally
    * *'distance'* - closer neighbours have more influence

    Best Practice: *'distance'* is better for datasets where closer points are more relevant.

In [5]:
# EXAMPLE
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='distance')
knn.fit(X_train, y_train)


4. ***algorithm*** - Search Algorithm for Nearest Neighbors: This parameter specifies how neighbors are serached. Options:
    * *'auto'* (default) - automatically selects the best method
    * *'ball_tree'* - good for meium-sized data
    * *'kd_tree'* - efficient for low-dimensional data
    * *'brute'* - slower but works for all cases

    Best Practice: Leave it as *'auto'*, unless you have a large dataset.

In [6]:
# EXAMPLE
knn = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')

## KNN Hyperparameter tuning EXAMPLE

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

param_grid = {
    'n_neighbors': range(1, 20),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'hamming']
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

print('Best Parameters:', grid.best_params_)

Best Parameters: {'metric': 'manhattan', 'n_neighbors': 8, 'weights': 'uniform'}
