**KNN (K-nearest neighbors)** uses representations of neighbors in a metric/feature space, meaning all distances b/w members of a set are defined. Mode of KNN labels in tuples is used to classify test instances. *K is often set to and odd number to avoid ties.* Lookup KNN vs Simple Linear Regression for more information

#### Regression Problem: Use a person's height and set to predict weight
* Mean Absolute Error (MAE) is the absolute values of errors of the predictions
* Mean Squared Error (MSE) is the average of the square of errors of the predictions

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

# Define training data. Male is denoted as 1 while female is 0
X_train = np.array([
    [158, 1],
    [170, 1],
    [183, 1],
    [191, 1],
    [155, 0],
    [163, 0],
    [180, 0],
    [158, 0],
    [170, 0]
])
y_train = [64, 86, 84, 80, 49, 59, 67, 54, 67]

# Define test data
X_test = np.array([
    [168, 1],
    [180, 1],
    [160, 0],
    [169, 0]
])
y_test = [65, 96, 52, 67]

K = 3
clf = KNeighborsRegressor(n_neighbors=K)
clf.fit(X_train, y_train)
predictions = clf.predict(np.array(X_test))
print('Predicted weights: %s' % predictions)
print('Actual weights: %s' % y_test)
print('Coefficient of determination: %s' % r2_score(y_test, predictions))
print('Mean absolute error: %s' % mean_absolute_error(y_test, predictions))
print('Mean squared error: %s' % mean_squared_error(y_test, predictions))

Predicted weights: [70.66666667 79.         59.         70.66666667]
Actual weights: [65, 96, 52, 67]
Coefficient of determination: 0.6290565226735438
Mean absolute error: 8.333333333333336
Mean squared error: 95.8888888888889


In [2]:
# Demonstration of scaling features affecting prediction
# This shows a need for data normalisation
from scipy.spatial.distance import euclidean

# heights in millimeters
X_train = np.array([
    [1700, 1],
    [1600, 0]
])

x_test = np.array([1640, 1]).reshape(1, -1)
print(euclidean(X_train[0, :], x_test))
print(euclidean(X_train[1, :], x_test))

#heights in meters
X_train = np.array([
    [1.7, 1],
    [1.6, 0]
])
x_test = np.array([164, 1]).reshape(1, -1)
print(euclidean(X_train[0, :], x_test))
print(euclidean(X_train[1, :], x_test))

60.0
40.01249804748511
162.3
162.40307878855006


In [4]:
# SciKit-Learn's StandardScaler is a transformer that standarizes data to large numbers
# do not dominate the prediction models.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

X_train = np.array([
    [158, 1],
    [170, 1],
    [183, 1],
    [191, 1],
    [155, 0],
    [163, 0],
    [180, 0],
    [158, 0],
    [170, 0]
])
y_train = [64, 86, 84, 80, 49, 59, 67, 54, 67]

# Scale and compare the values pre- and post- scaling
X_train_scaled = ss.fit_transform(X_train)
print(X_train)
print(X_train_scaled)

X_test_scaled = ss.transform(X_test)
clf.fit(X_train_scaled, y_train)

predictions = clf.predict(X_test_scaled)
print('Predicted weights: %s' % predictions)
print('Coefficient of determination: %s' % r2_score(y_test, predictions))
print('Mean absolute error: %s' % mean_absolute_error(y_test, predictions))
print('Mean squared error: %s' % mean_squared_error(y_test, predictions))

[[158   1]
 [170   1]
 [183   1]
 [191   1]
 [155   0]
 [163   0]
 [180   0]
 [158   0]
 [170   0]]
[[-0.9908706   1.11803399]
 [ 0.01869567  1.11803399]
 [ 1.11239246  1.11803399]
 [ 1.78543664  1.11803399]
 [-1.24326216 -0.89442719]
 [-0.57021798 -0.89442719]
 [ 0.86000089 -0.89442719]
 [-0.9908706  -0.89442719]
 [ 0.01869567 -0.89442719]]
Predicted weights: [78.         83.33333333 54.         64.33333333]
Coefficient of determination: 0.6706425961745109
Mean absolute error: 7.583333333333336
Mean squared error: 85.13888888888893


