# kNN example with bike frame geometries 

## Clean data on bicycle types and frame geometry to fit kNN

General reminders about kNN:

* A type of pattern recognition algorithm 
* Non-parametric 
* A type of **instance-based learning**:
    * Fitting a model is easy (... because there is no model fitting, the data is just held in memory)
    * Model prediction is computationally expensive
* Simple but has been successful in several areas, e.g. handwritten digits

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# We have data on several measurements of bike geometries for 56cm sized bikes 
# (we need to stick with the same size bike across all observations otherwise we'll conflate differences in scale)
bike_geo = pd.read_csv("./data/bike_geometries_56cm.csv")

In [None]:
bike_geo.columns

In [None]:
bike_geo.rename(columns=lambda x: x.strip(), inplace=True)

In [None]:
bike_geo['Steer Cat'].value_counts()

In [None]:
bike_geo.info()

In [None]:
# Create a new dataframe that includes only our modeling columns 

bike_geo_model = bike_geo[['Steer Cat','Head Angle','Fork Offset','Seat Angle','Chain Stay','Wheelbase','Top Tube','BB Drop','Trail',
             'Flop']]

In [None]:
bike_geo_model.shape

In [None]:
bike_geo_model = bike_geo_model.dropna(subset=['Steer Cat'])

In [None]:
bike_geo_model.shape

In [None]:
bike_geo_model.head(5)

In [None]:
bike_geo_model.fillna('0.0', inplace=True)

In [None]:
bike_geo_model.head(6)

In [None]:
bike_geo_model = bike_geo_model.loc[(bike_geo_model['Steer Cat'] != 'crit') & (bike_geo_model['Steer Cat'] != 'tour'),:]

In [None]:
bike_geo_model.shape

In [None]:
bike_geo_model['Steer Cat'].value_counts()

# kNN classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [None]:
# create X matrix and y vector from columns 

X = bike_geo_model.drop(columns='Steer Cat')
y = bike_geo_model['Steer Cat']
print(X.shape , y.shape)

In [None]:
X.info()

In [None]:
# I had done other testing and found out .astype() wasn't working because of a ' '
# Replace the ' ' in Wheelbase with 0.0 so we can cast to numeric 

X.loc[X['Wheelbase'] == ' ', 'Wheelbase'] = 0.0
X[(X['Wheelbase']==' ')]

In [None]:
X = X.astype(float)

In [None]:
X.info()

## kNN with train/test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# fit on train
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# test on test
y_pred = knn.predict(X_test)

print(metrics.accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,y_pred)

## kNN with cross-validation: tuning for value of k
* What value of k gives us the best kNN model?

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
#######################
# Student exercise
#######################

# (1) Run a 10-fold cross-validation with K=7 for kNN (the n_neighbors parameter)

knn = ## some code
scores = ## more code
print(scores)

In [None]:
# (2) Get average accuracy as an estimate of out-of-sample accuracy



In [None]:
# (3) Search for an optimal value of k from 1-30 (write a loop)

k_range = list(range(1, 31))
k_scores = []

for k in k_range:
    ##### write some code here
    k_scores.append(scores.mean())
    
print(k_scores)

In [None]:
# plot the value of K for kNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.figure(dpi=150)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

In [None]:
# (4) now try with k-fold = 3 (cv parameter)

k_range = list(range(1, 100))
k_scores = []

for k in k_range:
    ## same code as above, just change cv=3
    k_scores.append(scores.mean())

plt.figure(dpi=150)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

## kNN with cross-validation: model selection of kNN vs. logistic

We know our optimal value of k, but let's use cross-validation to see how kNN compares against a logistic regression for model selection

In [None]:
# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=2)
print(cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())

In [None]:
# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear',multi_class='auto')
print(cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())

## Parameter tuning using `GridSearchCV`

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# define the parameter values that should be searched
k_range = list(range(1, 100))

In [None]:
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range)
print(param_grid)

In [None]:
# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')

In [None]:
# fit the grid with data
grid.fit(X, y);

In [None]:
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

## Reducing computational expense using `RandomizedSearchCV`

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# specify "parameter distributions" rather than a "parameter grid"

k_range = list(range(1, 100))

param_dist = dict(n_neighbors=k_range, weights=['uniform','distance'])

In [None]:
# n_iter controls the number of searches -- that is, how much of the potential grid do we search
# n_iter default = 10
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', \
                          n_iter=10, random_state=42)
rand.fit(X, y)

In [None]:
# examine the best model
print(rand.best_score_)
print(rand.best_params_)

In [None]:
# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in range(20):
    rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10)
    rand.fit(X, y)
    best_scores.append(round(rand.best_score_, 3))
print(best_scores)