# Predicting Yacht Resistance with K Nearest Neighbors

## Introduction

This notebook is a simple demonstration of how to use scikit-learn to build a K-nearest neighbor (KNN) model for regression. It uses a dataset of 308 experiments and their various attributes. The goal is to predict the residuary resistance per unit weight of displacement based upon the attributes.

## The Data

The data has been taken from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) and the raw data and information can be found [here](https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics). 

The columns are as follow:

1. Longitudinal position of the center of buoyancy, adimensional.
2. Prismatic coefficient, adimensional.
3. Length-displacement ratio, adimensional.
4. Beam-draught ratio, adimensional.
5. Length-beam ratio, adimensional.
6. Froude number, adimensional.
7. Residuary resistance per unit weight of displacement, adimensional. 

Where column 7 is the target variable we are looking to predict.

We import python libraries

In [1]:
import pandas as pd
import numpy as np

We read in the data we've saved, passing the column names

In [2]:
yacht = pd.read_csv("data/yacht_hydrodynamics.csv", names=["longitudinal_pos", "presmatic_coef", "length_disp", "beam-draught_rt", 
                                                           "length-beam_rt", "froude_num", "resid_resist"], sep=" ")

Let's check out the first few rows of data

In [3]:
yacht.head()

Unnamed: 0,longitudinal_pos,presmatic_coef,length_disp,beam-draught_rt,length-beam_rt,froude_num,resid_resist
0,-2.3,0.568,4.78,3.99,3.17,0.125,0.11
1,-2.3,0.568,4.78,3.99,3.17,0.15,0.27
2,-2.3,0.568,4.78,3.99,3.17,0.175,0.47
3,-2.3,0.568,4.78,3.99,3.17,0.2,0.78
4,-2.3,0.568,4.78,3.99,3.17,0.225,1.18


We can quickly check if we have any null values in our data

In [4]:
yacht.isnull().values.any()

True

We do! Let's use the "describe" method to find them, amongst other interesting information

In [5]:
yacht.describe()

Unnamed: 0,longitudinal_pos,presmatic_coef,length_disp,beam-draught_rt,length-beam_rt,froude_num,resid_resist
count,308.0,252.0,308.0,308.0,308.0,308.0,308.0
mean,-2.381818,0.563944,4.008182,4.096364,3.341364,0.824318,8.476461
std,1.513219,0.022947,1.643974,0.653655,0.391571,1.1462,14.052367
min,-5.0,0.53,0.53,2.81,2.73,0.125,0.01
25%,-2.4,0.546,4.34,3.75,3.15,0.225,0.3675
50%,-2.3,0.565,4.78,3.99,3.17,0.325,1.79
75%,-2.3,0.574,4.78,4.77,3.53,0.425,8.0925
max,0.0,0.6,5.14,5.35,4.24,3.51,62.42


So... the column *presmatic_coef* has 56 missing values... we can deal with this in a few different ways. The simpliest solution is to remove them, though we lose many examples in doing so. Alternatively, we could impute the values, replacing the NaN values with an average (mean or median). For the purpose of this simple notebook, we will simply remove them.

In [6]:
yacht = yacht.dropna()

## Train & Test Data

The purpose of splitting the data is to be able to assess the quality of a predictive model when it is used on unseen data. When training, you will try to build a model that fits to the data as closely as possible, to be able to most accurately make a prediction. However, without a test set you run the risk of overfitting - the model works very well for the data it has seen but not for new data.

The split ratio is often debated and in practice you might split your data into three sets: train, validation and test. You would use the training data to understand which classifier you wish to use; the validation set to test on whilst tweaking parameters; and the test set to get an understanding of how your final model would work in practice. Furthermore, there are techniques such as K-Fold cross validation that also help to reduce bias.

For the purpose of this demonstration, we will only be randomly splitting our data into test and train, with a 80/20 split.

We import the required library from scikit-learn, [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [7]:
from sklearn.model_selection import train_test_split

We wish for all features to be used for training, therefore we are taking all columns except "class"

In [8]:
X = yacht.drop(["resid_resist"], axis=1)

The column "class" is our target variable, we set y as this column

In [9]:
y = yacht["resid_resist"]

We use the *train_test_split* function to create the appropriate train and test data for our features ("X_train" and "X_test" respectively) and target data ("Y_train" and "Y_test"). We are specifying our test data to be 20% of the total data. We are also providing a seed to be able to reproduce this split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

We can check the number of examples we have in each of our train and test data sets using "shape"

In [11]:
X_train.shape

(201, 6)

In [12]:
X_test.shape

(51, 6)

## Standardisation

All features are numeric so we do not need to worry about converting categorical data with techniques such as one-hot encoding. However, we will demonstrate how to standardise our data. Standardisation rescales our attributes so they have a mean of 0 and standard deviation of 1. It assumes that the distribution is Gaussian (it works better if it is), alternatively normalisation can be used to rescale between the range of 0 and 1

We use scikit-learn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [13]:
from sklearn.preprocessing import StandardScaler

We create the scaler, leaving parameters as default

In [14]:
scaler = StandardScaler()

We fit the scaler passing the training data but also request it transforms the data and returns it to a variable named "train_scaled"

In [15]:
train_scaled = scaler.fit_transform(X_train)

We then transform our test data with the same fitted scaler

In [16]:
test_scaled = scaler.transform(X_test)

## K Nearest Neighbors

K-Nearest Neighbor (KNN) makes a prediction for a new observation by searching for the most similar training observations and pooling their values.

We are using scikit-learn's [K Neighbors Regressor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

In [17]:
from sklearn.neighbors import KNeighborsRegressor

We create an KNN model

In [18]:
model = KNeighborsRegressor()

We train it with our scaled training data and target values

In [19]:
model.fit(train_scaled, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

## Model Evaluation

We wish to understand how good our model is; there are a few different metrics we can use. We will evaluate mean squared error (MSE) and mean absolute error (MAE)

We import [scikit-learn's mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) and [sckit-learn's mean absolute error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error)

In [20]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

We calculate the errors for our training data

In [21]:
mse = mean_squared_error(y_train, model.predict(train_scaled))
mae = mean_absolute_error(y_train, model.predict(train_scaled))

In [22]:
from math import sqrt

In [23]:
print("mse = ",mse," & mae = ",mae," & rmse = ", sqrt(mse))

mse =  83.07649245771145  & mae =  4.645452736318409  & rmse =  9.114630681366714


The easier metric to understand is the mean absolute error, this means that on average our prediction was 4.6 away from the true prediction. Mean squared error, and consequently root mean squared error (RMSE), results in predictions further and further from the true value are punished more.

We can calculate the same on the test data to understand how we the model is generalised.

In [24]:
test_mse = mean_squared_error(y_test, model.predict(test_scaled))
test_mae = mean_absolute_error(y_test, model.predict(test_scaled))
print("mse = ",test_mse," & mae = ",test_mae," & rmse = ", sqrt(test_mse))

mse =  30.68350007843137  & mae =  3.08164705882353  & rmse =  5.539268911908084


We are actually seeing better results on our test data!

## K Nearest Neighbors Parameters

More information on Nearest Neighbors can be found in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/neighbors.html)

There are a number of parameters that can be tuned that should be explored when trying to improve K Nearest Neighbor models. A common approach is to test many different parameters, building multiple models and testing their accuracy to find the best combination.

### Parameters
For KNN, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

Here is high-level information on the parameters, the documentation has more details:
- n_neighbors : default = 5
    - Number of neighbors to use by default for kneighbors queries.

- weights : default='uniform'
    - weight function used in prediction

- algorithm : default=‘auto’
    - Algorithm used to compute the nearest neighbors:

- leaf_size : default = 30
    - Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

- p : default = 2
    - Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

- metric : default ‘minkowski’
    - The distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.

- metric_params : default = None
    - Additional keyword arguments for the metric function.

- n_jobs : default = 1
    - The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Doesn’t affect fit method.


### Grid Search

To search for the best hyper-parameters for your algorithm and data, grid search cross validation is commonly used. The [scikit-learn documentation](http://scikit-learn.org/stable/modules/grid_search.html) provides more thorough information on how to use this. 

#### Data Citation

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 