<a href="https://colab.research.google.com/github/sakeefkarim/miscellaneous/blob/main/code/knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A *Very* Basic Introduction to $k$-Nearest Neighbours Algorithms

[Sakeef M. Karim](https://www.sakeefkarim.com/)

sakeef.karim@nyu.edu

## Preliminaries

Let's import a few essential libraries (e.g., `pandas` for data wrangling) and submodules (e.g., `sklearn.neighbors` from `scikit-learn`) to develop our $k$-nearest neighbours algorithm.

In [1]:
# For data manipulation:

import pandas as pd

# From scikit-learn, we import modules to pre-process data, fit KNN classifier:

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

# numpy will help us set up our Grid Search

import numpy as np

# To save, load our model:

from joblib import dump, load

Once again, we can programmatically mount our Google Drive folders onto a Colab session as follows:

In [None]:
from google.colab import drive
drive.mount('/drive')

Mounted at /drive


To fit a supervised machine learning algorithm, we'll need some labeled data. To this end, we'll once again work with a dataset that we first encountered during the [<font face="Inconsolata" size=4.5> PopAgingDataViz</font>](https://popagingdataviz.com/) workshop —
 [gapminder](https://jennybc.github.io/gapminder/).

In [2]:
# Loading gapminder dataset:

gapminder = pd.read_excel("https://github.com/sakeefkarim/intro.python.24/raw/main/data/gapminder.xlsx")

In the spirit of simplicity, we'll do some ***pre-processing*** by:

+ Isolating the latest year in `gapminder` (2007) and dropping the `year` column.

+ Generating a dummy indicator (`asia`) of whether a country is in Asia.

+ Isolating our feature vector and target variable in separate objects.

+ Standardizing our input variables (features).

In [6]:
# Homing-in on observations in the latest year (2007)

gapminder = gapminder.query("year == 2007").reset_index(drop=True).drop(columns='year')

# Generating dummy indicator indexing whether a country is in Asia:

gapminder['asia'] = pd.get_dummies(gapminder['continent'])['Asia']

# Dropping observations with missing values (not necessary for gapminder):

# gapminder.dropna(inplace = True)

# Removing target variable and categorical indicators from feature vector:

X = gapminder.drop(columns = ['asia', 'continent', 'country'])

# Isolating target variable:

y = gapminder['asia']

# Standardizing feature vector

scaler = StandardScaler()

X = scaler.fit_transform(X)

## Train-Test Split, Cross-Validation

Next, we'll split our sample into two disjoint sets: a **training set** featuring 85% of our observations; and a **testing set**—or *hold-out sample* comprising 15% of the original dataset—that will not be involved in the training or validation process.

Then, we'll initialize our KNN and use (stratified) $k$-fold cross-validation to fit a basic KNN model.

In [None]:
# Perform train-test split:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,
                                                    test_size = 0.15,
                                                    random_state = 905)

# Initializing KNN classifier with k = 5, fitting model:

knn = KNeighborsClassifier(n_neighbors = 5)

knn.fit(X_train, y_train)

# Stratified k-fold cross-validation:

skfold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 905)

# Cross-validation score:

cross_val_score(knn, X_train, y_train, cv = skfold).mean()

# Measure of predictive performance:

knn.score(X_test, y_test)

## Hyperparameter Optimization

Next, we'll use the `GridSearchCV` method to select the optimal value of $k$ by using a grid search of possible hyperparameter values (odd numbers between 1 and 13).

In [None]:
# Creating a grid of potential hyperparameter values (odd numbers from 1 to 13):

k_grid = {'n_neighbors': np.arange(start = 1, stop = 14, step = 2) }

# Setting up a grid search to home-in on best value of k:

grid = GridSearchCV(KNeighborsClassifier(), param_grid = k_grid, cv = skfold)

grid.fit(X_train, y_train)

# Extract best score and hyperparameter value:

print("Best Mean Cross-Validation Score: {:.3f}".format(grid.best_score_))

print("Best Parameters (Value of k): {}".format(grid.best_params_))

print("Test Set Score: {:.3f}".format(grid.score(X_test, y_test)))

# pd.DataFrame(grid.cv_results_)

Best Mean Cross-Validation Score: 0.817
Best Parameters (Value of k): {'n_neighbors': 3}
Test Set Score: 0.773


## Storing Model

Finally, we'll generate our $k_3$ model and store it for future use.

In [None]:
# Fitting our model of choice:

knn_3 = KNeighborsClassifier(n_neighbors = 3)

knn_classifier = knn_3.fit(X_train, y_train)

# Saving model in Google Drive folder:

dump(knn_classifier, '/drive/My Drive/Colab/knn_classifier.joblib')

# Using it in the future:

# loaded_knn = load('/drive/My Drive/Colab/knn_classifier.joblib')

# loaded_knn.score(X_test, y_test)

# Exercises

1. Import the `penguins` data frame from the [`{palmerpenguins}`](https://allisonhorst.github.io/palmerpenguins/) package into Python.

2. Isolate observations from the latest `year` in `penguins`.

3. Develop a $k$-nearest neighbours **regressor** to predict a numeric outcome of interest. Report your algorithm’s cross-validation score and out-of-sample performance.

4.  Develop a $k$-nearest neighbours **classifier** to predict a categorical outcome of interest. Report your algorithm’s cross-validation score and out-of-sample performance.

