K-nearest neighbours is a supervised ml algorithm used for both classification and regression tasks. KNN takes the K data points that are closest too a query and either makes a prediction based on the most frequent label of said data points (in the case of classification), or the average label (in the case of regression).


I could use scikit-learn's KNeighborsClassifier, but I am going to implement KNN from scratch.
This KNN model will predict the age (number of rings) of an abalone (mollusk) based on it's physical properties.

https://archive.ics.uci.edu/ml/datasets/abalone

https://realpython.com/knn-python/#use-knn-to-predict-the-age-of-sea-slugs


In [2]:
import numpy as np
import pandas as pd
import math
from collections import Counter


In [3]:
# load dataset
df = pd.read_csv('./data/abalone.csv')
df.columns = ["Sex", "Length", "Diameter", "Height", "Whole weight",
              "Shucked weight", "Viscera weight", "Shell weight", "Rings"]
df = df.drop("Sex", axis=1)
df.head(3)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10


In [68]:
# remove one sample from df to test KNN model with
sample = df.iloc[[-1]]
df.drop(df.tail(1).index, inplace=True)

sample_x = sample.drop('Rings', axis=1).values.squeeze()
sample_y = sample[['Rings']].values.squeeze()

print(sample_x)
print(sample_y)


[0.71   0.555  0.195  1.9485 0.9455 0.3765 0.495 ]
12


In [69]:
# split data into sets of independant and dependant variables
x = df.drop('Rings', axis=1)
y = df[['Rings']]


In [70]:
x.head(3)


Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21
2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155


In [71]:
y.head(3)


Unnamed: 0,Rings
0,7
1,9
2,10


In [72]:
# convert x and y to np arrays
x = x.values.squeeze()
y = y.values.squeeze()


In [73]:
# is there a correlation between physical measurements of an abalone and it's age?
# values in correlation matrix closer to 1 than 0 signify there is a correlation between said measurement and the number of rings
# therefore a KNN model is reasonable to use
c_m = df.corr()
c_m['Rings']


Length            0.557072
Diameter          0.574957
Height            0.558050
Whole weight      0.540831
Shucked weight    0.421222
Viscera weight    0.504217
Shell weight      0.628034
Rings             1.000000
Name: Rings, dtype: float64

In [81]:
# distance function - euclidean distance
# could also use numpy's linalg.norm
def get_dist(a, b):
    summed_squared_dist = 0
    for i in range(len(a)):
        summed_squared_dist +=  math.pow(a[i]-b[i], 2)
    dist = math.sqrt(summed_squared_dist)
    return dist


In [96]:
nearest_neighbors = []

# loop through all data points
for i, xx in enumerate(x):
    # find distance between data point and sample_x and add it and the data points age (# of rings) to nearest_neighbors array
    dist = get_dist(xx, sample_x)
    label = y[i]
    nearest_neighbors.append((label, dist))

# sort nearest neighbors
nearest_neighbors = sorted(nearest_neighbors, key=lambda x: x[1])


In [110]:
# should run some optimization method to find best k like scikit-learn's GridSearchCV
# doing this would also provide insight on how we should weight neighbours based on distance when making predictions
k = 4

# choose the k nearest neighbors of sample_x and get their labels
nearest_neighbors_labels = [i[0] for i in nearest_neighbors[:k]]
nearest_neighbors_labels


[13, 12, 11, 12]

In [116]:
# find mode of nearest_neighbors_labels. Could also use SciPy stats.mode
sample_y_pred = Counter(nearest_neighbors_labels).most_common(1)[0][0]
print(f'Predicted sample age is {sample_y_pred}')
print(f'Actual sample age is {sample_y}')


Predicted sample age is 12
Actual sample age is 12
