# k Nearest Neighbours and cross-validation

In [1]:
import math
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn.metrics as metrics
from sklearn.model_selection import ParameterGrid
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('house-prices-train.csv')
data.SalePrice = np.log1p(data.SalePrice)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

## data cleaning (copy/pasted from the previous tutorial)

In [3]:
from sklearn.preprocessing import LabelEncoder
def encode_categories(df, mappers, dummies=False):
    le = LabelEncoder()
    for col in df.select_dtypes('object').columns:
        if col not in mappers and df[col].nunique() < 30:
            df[col] = df[col].fillna('NaN')
            df[col] = le.fit_transform(df[col])
            if dummies:
                prefix = 'd_' + col
                df = pd.concat([df.drop(columns=[col]), pd.get_dummies(df[col], prefix=prefix)], axis=1)
        elif col in mappers:
            df[col] = df[col].replace(mappers[col])
    return df

In [4]:
data = pd.read_csv('house-prices-train.csv')
data.SalePrice = np.log1p(data.SalePrice)
ordinal_cols_mappers = {
    'KitchenQual': {'Po' : 0, 'Fa' : 1, 'TA' : 2, 'Gd' : 3, 'Ex' : 4}
}
data = encode_categories(data, ordinal_cols_mappers, True)
data.shape

(1460, 303)

  * The nature of kNN algorithms means that using kNN with nominal features is troublesome.
  * To overcome this, one can adopt one of these strategies:
    * Drop nominal features (and possibly keep the ordinal one if there is some meaning for measuring the distance).
    * Replace nominal features with dummies using one-hot encoding.
    * Use some [more sophisticated metrics](https://www.researchgate.net/publication/220907006_Similarity_Measures_for_Categorical_Data_A_Comparative_Evaluation) capable of measuring the similarity of nominal features.
  * We will give a try to the first two approaches.

## First attempt

In [5]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5)

In [6]:
knn.fit(data.drop(columns=['SalePrice']), data.SalePrice)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

  * There is a problem with missing values of numeric features.

In [7]:
data.loc[:,data.isnull().sum() > 0].isnull().sum()

LotFrontage    259
MasVnrArea       8
GarageYrBlt     81
dtype: int64

What we can do:
  * Drop the data points with missing values. But we do not have enough data for this.
  * We can replace the missings with respective means. But it is too simple, and we have some dignity!
  * We can predict the missing values from the rest of the data! That's it! We will use the kNN algorithm.

### Task: predict the missing values using kNN

The idea is this (assume we want to fill missing values in `LotFrontage` column):
  * Split the dataset into two parts: 
    * `D1` = contaning the lines with missing values in `LotFrontage` column, 
    * `D2` = the rest of the data.
  * Save the column `D2.LotFrontage` to `Y` and the remaining columns to `X` (exclude some columns if needed). The same columns of `D1` save to `X2`.
  * Fit a model (we use the kNN) to predict `Y` using `X`.
  * Use this model to predict the missing values of `LotFrontage` using the `X2` data.

In [8]:
def replace_nans(df, cols_nan, params):
    ### your code goes here
    #D1 = missing
    #D2 = filled
    for col in cols_nan:
        missing = df[df[col].isnull()] # D1
        filled = df[df[col].notnull()] # D2
        
        filledX = filled.drop(columns = np.append(cols_nan, 'Id')) # X
        missingX = missing.drop(columns = np.append(cols_nan, 'Id')) # X2
        filledY = filled[col] # Y
        
        model = KNeighborsRegressor(**params)
        model.fit(filledX, filledY)
        
        missingY = model.predict(missingX)
        
        df.loc[ df[df[col].isnull()].index , col] = missingY
    
        # takhle ne:
        # df[df[col].isnull()][col] = missingY
    ###
    return df

Let us check that we have some meaningful results:

In [9]:
df = data.copy()
cols_nan = df.loc[:,data.isnull().sum() > 0].columns
params = {
        'n_neighbors': 5
}
dataNoNan = replace_nans(df, cols_nan, params)
display(data[cols_nan].describe())
display(dataNoNan[cols_nan].describe())

Unnamed: 0,LotFrontage,MasVnrArea,GarageYrBlt
count,1201.0,1452.0,1379.0
mean,70.049958,103.685262,1978.506164
std,24.284752,181.066207,24.689725
min,21.0,0.0,1900.0
25%,59.0,0.0,1961.0
50%,69.0,0.0,1980.0
75%,80.0,166.0,2002.0
max,313.0,1600.0,2010.0


Unnamed: 0,LotFrontage,MasVnrArea,GarageYrBlt
count,1460.0,1460.0,1460.0
mean,70.970137,104.274795,1977.566986
std,23.74526,181.315123,24.575473
min,21.0,0.0,1900.0
25%,60.0,0.0,1960.0
50%,70.0,0.0,1978.0
75%,81.6,166.0,2001.0
max,313.0,1600.0,2010.0


## Cross-validation and hyperparameter tuning

  * Assume we want to go through the following values of the kNN hyperparameters.
  * Beside this, we also want to see the effect of different strategies of 
    * how to deal with nominal features (ignoring them, using dummies), 
    * how to normalise the data (no normalising vs normalising).

In [10]:
data = df.copy()
data = data.drop(columns=['Id'])

### Task: implement cross validation

In [11]:
def cross_val(X, # design matrix
              y, # vektor vysvětlované proměnné
              folds, # kolikrát se má křížová validace provést
              model, # použitý model
              dummies = False # máme-li ponechat dummy příznaky (one hot encoding)
             ):
    averageRMSLE = 0
    np.random.seed(seed=654) # this must be here, explain WHY!
    ### Your code goes here
    
    if not dummies:
        X = X.loc[:, X.nunique() > 2]
  
    fold_idx = np.random.randint(folds, size=X.shape[0])
    
    for fold in range(folds):
        # rozsireni dat
        Xtrain = X[fold_idx != fold]
        Xval = X[fold_idx == fold]
        ytrain = y[fold_idx != fold]
        yval = y[fold_idx == fold]
        
        # pouziti modelu
        model.fit(Xtrain, ytrain)
        ypred = model.predict(Xval)
        averageRMSLE += math.sqrt(metrics.mean_squared_error(yval, ypred))
    
    ###
    return averageRMSLE / folds

### Task: try kNN with and without normalisation/dummies

In [None]:
from sklearn.model_selection import ParameterGrid, train_test_split
param_grid = {
    'n_neighbors' : range(1,20),
    'p': range(1,6),
    'weights': ['uniform', 'distance']
}
dummies = True
param_comb = ParameterGrid(param_grid)
Xtrain, Xtest, ytrain, ytest = train_test_split(data.drop(columns=['SalePrice']), 
                                                data.SalePrice, 
                                                test_size=0.25, 
                                                random_state=6548)
### your code doing normalisation goes here:

train_min = Xtrain.min(axis = 0)
train_max = Xtrain.max(axis = 0)

one_val_cols = Xtrain.loc[:, train_max - train_min == 0].columns
Xtrain.drop(columns = one_val_cols, inplace = True)
Xtest.drop(columns = one_val_cols, inplace = True)

train_min = Xtrain.min(axis = 0)
train_max = Xtrain.max(axis = 0)

Xtrain = (Xtrain - train_min) / (train_max - train_min) 
Xtest = (Xtest - train_max) / (train_max - train_min) 

###
crossval_err = []
for params in param_comb:
    kNN = KNeighborsRegressor(**params)
    averageRMSLE = cross_val(Xtrain.copy(), ytrain, 12, kNN, dummies)
    crossval_err.append(averageRMSLE)
crossval_err

In [None]:
%%time
best_params = param_comb[np.argmin(crossval_err)]
kNN = KNeighborsRegressor(**best_params)
if not dummies:
    Xtrain = Xtrain.loc[:, Xtrain.nunique() > 2]
    Xtest = Xtest.loc[:, Xtrain.columns]
Xtest.fillna(0, inplace=True)
print(Xtrain.shape, Xtest.shape)

kNN.fit(Xtrain, ytrain)
ypred = kNN.predict(Xtest)
best_RMSLE = math.sqrt(metrics.mean_squared_error(ytest, ypred))
print('RMSLE (test): {0:.6f}'.format(best_RMSLE))
print('best parameters:', best_params)

There are of course packages in `sklearn` for Cross-Validation and normalisation:
  * [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
  * [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
  * [cross_validate](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)
  * [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

### Curse of dimensionality

  * Normalised data are all localised in the $n$-dimensional cube with sides of length one.
  * The diagonal of this cube equals $\sqrt{n}$.
  * The curse of dimensionality says that higher the dimension the nearest neighbours get further and further.
  * To measure this effect, we will increase the dimension and observe the ration of the diagonal and the mean distance of the nearest neighbours.

**Try to experiment with the `n_neighbors` parameter!** What is the influence of the number of neigbours and the mean distance?

In [None]:
# Xtrain and Xtest should be normalized here
mean_dist_ratio = []
for k in range(1,30):
    kNN = KNeighborsRegressor(n_neighbors=150, p=2)
    kNN.fit(Xtrain.iloc[:,0:k], ytrain)
    dist, nn = kNN.kneighbors(Xtest.iloc[:,0:k])
    mean_dist_ratio.append(np.mean(dist)/math.sqrt(k))

In [None]:
plt.figure(figsize=(12,5))
plt.xlabel('dimensions')
plt.plot(range(1,len(mean_dist_ratio)+1),mean_dist_ratio,'bo-')