<a href="https://colab.research.google.com/github/sakeefkarim/intro_quantitative_sociology/blob/main/data/week%2011/code/SOCI269_Week_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://collegeaim.org/wp-content/uploads/2021/09/Amherst.png" alt="Amherst Logo" width="200"/>


# A *Very* Basic Introduction to $k$-Nearest Neighbours Algorithms in `Python` <img src="https://s3.dualstack.us-east-2.amazonaws.com/pythondotorg-assets/media/community/logos/python-logo-only.png" alt="Python logo" width="30">

[Sakeef M. Karim](https://www.sakeefkarim.com/)

skarim@amherst.edu

## Preliminaries

Let's import a few essential libraries (e.g., `pandas` for data wrangling) and submodules (e.g., `sklearn.neighbors` from `scikit-learn`) to develop our $k$-nearest neighbours algorithm.

In [None]:
# For data manipulation:

import pandas as pd
import numpy as np

# From scikit-learn, we import modules to pre-process data, fit KNN classifier:

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV

from sklearn.metrics import classification_report

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

# To save, load our model:

from joblib import dump, load

Once again, we can programmatically mount our Google Drive folders onto a Colab session as follows:

In [None]:
from google.colab import drive
drive.mount('/drive')

## Data

To fit a supervised machine learning algorithm, we'll need some observed data. To this end, let's work with [gapminder](https://jennybc.github.io/gapminder/) once again.

In [None]:
# Loading gapminder dataset:

gapminder = pd.read_excel('https://github.com/sakeefkarim/intro_quantitative_sociology/raw/refs/heads/main/data/week%2010/data/gapminder.xlsx')

## "Pre-Processing" I

In the spirit of simplicity, we'll do some ***pre-processing*** by:

+ Isolating the latest year in `gapminder` (2007) and dropping the `year` column.

+ Generating a dummy indicator (`europe`) of whether a country is in Europe.

+ Isolating our feature vector and target variable in separate objects.

In [None]:
# Homing-in on observations in the latest year (2007)

gapminder = gapminder.query('year == 2007').reset_index(drop = True).drop(columns='year')

# Generating dummy indicator indexing whether a country is in Europe:

gapminder['europe'] = pd.get_dummies(gapminder['continent'])['Europe']

# Dropping observations with missing values (not necessary for gapminder):

# gapminder.dropna(inplace = True)

# Removing target variable and categorical indicators from feature vector:

X = gapminder.drop(columns = ['europe', 'continent', 'country'])

# Isolating target variable:

y = gapminder['europe']

## "Pre-Processing" II: A Train-Test Split

Next, we'll split our sample into two disjoint sets: a **training set** featuring 80% of our observations; and a **testing set**—or *hold-out sample* comprising 20% of the original dataset—that will not be involved in the training or validation process. We'll also ensure that our feature vectors have been standardized.


In [None]:
# Perform train-test split:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,
                                                    test_size = 0.2,
                                                    random_state = 905)

# Standardizing feature data

scaler = StandardScaler()

# Applied to training data

X_train = scaler.fit_transform(X_train)

# Applied to test data

X_test = scaler.transform(X_test)


## Initializing KNN, Performing Cross-Validation

Now, let's initialize our KNN and use (stratified) $k$-fold cross-validation to fit a basic KNN model.

In [None]:
# Initializing KNN classifier with k = 1, fitting model:

knn = KNeighborsClassifier(n_neighbors = 1)

knn.fit(X_train, y_train)

# Stratified k-fold cross-validation:

skfold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 905)

# Cross-validation score:

print('Cross-Validation Score (Accuracy):', '{:.3f}'.format(cross_val_score(knn, X_train, y_train, cv = skfold).mean()))

# Measure of predictive performance:

print('Test Score (Accuracy):', '{:.3f}'.format(knn.score(X_test, y_test)))

# Predictions on hold-out subsample

europe_pred = knn.predict(X_test)

# More performance metrics

print(classification_report(y_test, europe_pred))


## Hyperparameter Optimization

Next, we'll use the `GridSearchCV` method to select the optimal value of $k$ by using a grid search of possible hyperparameter values (odd numbers between 1 and 13).

In [None]:
# Creating a grid of potential hyperparameter values (odd numbers from 1 to 13):

k_grid = {'n_neighbors': np.arange(start = 1, stop = 14, step = 2) }

# Setting up a grid search to home-in on best value of k:

grid = GridSearchCV(KNeighborsClassifier(), param_grid = k_grid, cv = skfold)

grid.fit(X_train, y_train)

# Extract best score and hyperparameter value:

print(f'Best Mean Cross-Validation Score: {grid.best_score_:.3f}')

print(f'Best Parameter (Value of k): {grid.best_params_['n_neighbors']}')

print(f'Test Set Score: {grid.score(X_test, y_test):.3f}')


# pd.DataFrame(grid.cv_results_)

## Storing Model for Future Use

Finally, we'll generate our $k_9$ model and store it for future use.

In [None]:
# Saving model of choice in Google Drive folder:

dump(grid.best_estimator_, '/drive/My Drive/Colab Notebooks/knn_classifier.joblib')

# Using it in the future:

# loaded_knn = load('/drive/My Drive/Colab Notebooks/knn_classifier.joblib')

# loaded_knn.score(X_test, y_test)

# Exercises

1. Import the `penguins` data frame from the [`{palmerpenguins}`](https://allisonhorst.github.io/palmerpenguins/) package into `Python`.

2. Isolate observations from the latest `year` in `penguins`.

3. Develop a $k$-nearest neighbours **regressor** to predict a numeric outcome of interest. Report your algorithm's cross-validation score and out-of-sample performance.

4.  Develop a $k$-nearest neighbours **classifier** to predict a categorical outcome of interest. Report your algorithm's cross-validation score and out-of-sample performance.

