## KNN imputation

The missing values are estimated as the average value from the closest K neighbours.

[KNNImputer from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer)

- Same K will be used to impute all variables
- Can't really optimise K to better predict the missing values
- Could optimise K to better predict the target

**Note**

If what we want is to predict, as accurately as possible the values of the missing data, then, we would not use the KNN imputer, we would build individual KNN algorithms to predict 1 variable from the remaining ones. This is a common regression problem.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

# multivariate imputation
from sklearn.impute import KNNImputer

## Load data

In [2]:
# list with numerical varables

cols_to_use = [
    'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
    'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
    '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
    'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
    'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
    'WoodDeckSF',  'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold',
    'SalePrice'
]

In [3]:
# let's load the dataset with a selected variables

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)

# find variables with missing data
for var in data.columns:
    if data[var].isnull().sum() > 1:
        print(var, data[var].isnull().sum())

LotFrontage 259
MasVnrArea 8
GarageYrBlt 81


In [4]:
# let's separate into training and testing set

# first drop the target from the feature list
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [5]:
# reset index, so we can compare values later on
# in the demo

X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

## KNN imputation

In [6]:
imputer = KNNImputer(
    n_neighbors=5, # the number of neighbours K
    weights='distance', # the weighting factor
    metric='nan_euclidean', # the metric to find the neighbours
    add_indicator=False, # whether to add a missing indicator
)

In [7]:
imputer.fit(X_train)

In [8]:
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# sklearn returns a Numpy array
# lets make a dataframe
train_t = pd.DataFrame(train_t, columns=X_train.columns)
test_t = pd.DataFrame(test_t, columns=X_test.columns)

train_t.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60.0,70.115142,9375.0,7.0,5.0,1997.0,1998.0,573.0,739.0,0.0,...,645.0,576.0,36.0,0.0,0.0,0.0,0.0,0.0,2.0,2009.0
1,120.0,42.533053,2887.0,6.0,5.0,1996.0,1997.0,0.0,1003.0,0.0,...,431.0,307.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,2008.0
2,20.0,50.0,7207.0,5.0,7.0,1958.0,2008.0,0.0,696.0,0.0,...,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2010.0
3,50.0,60.0,9060.0,6.0,5.0,1939.0,1950.0,0.0,204.0,0.0,...,280.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,2009.0
4,30.0,60.0,8400.0,2.0,5.0,1920.0,1950.0,0.0,290.0,0.0,...,246.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2009.0


In [9]:
# variables without NA after the imputation

train_t[['LotFrontage', 'MasVnrArea', 'GarageYrBlt']].isnull().sum()

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

In [10]:
# the obseravtions with NA in the original train set

X_train[X_train['MasVnrArea'].isnull()]['MasVnrArea']

420   NaN
490   NaN
642   NaN
824   NaN
921   NaN
Name: MasVnrArea, dtype: float64

In [11]:
# the replacement values in the transformed dataset

train_t[X_train['MasVnrArea'].isnull()]['MasVnrArea']

420     99.765717
490     34.106592
642      0.000000
824    375.749332
921     85.817715
Name: MasVnrArea, dtype: float64

In [12]:
# the mean value of the variable (i.e., for mean imputation)

X_train['MasVnrArea'].mean()

103.55358898721731

In some cases, the imputation values are very different from the mean value we would have used in MeanMedianImputation.

## Imputing a slice of the dataframe

We can use Feature-engine to apply the KNNImputer to a slice of the dataframe.

In [14]:
from feature_engine.wrappers import SklearnTransformerWrapper

In [13]:
data = pd.read_csv('../houseprice.csv')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('SalePrice', axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 80), (438, 80))

In [15]:
# start the KNNimputer inside the SKlearnTransformerWrapper

imputer = SklearnTransformerWrapper(
    transformer = KNNImputer(weights='distance'),
    variables = cols_to_use,
)

In [16]:
# fit the wrapper + KNNImputer
imputer.fit(X_train)

# transform the data
train_t = imputer.transform(X_train)
test_t = imputer.transform(X_test)

# feature-engine returns a dataframe
train_t.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
64,65,60.0,RL,70.115142,9375.0,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,GdPrv,,0.0,2.0,2009.0,WD,Normal
682,683,120.0,RL,42.533053,2887.0,Pave,,Reg,HLS,AllPub,...,0.0,0.0,,,,0.0,11.0,2008.0,WD,Normal
960,961,20.0,RL,50.0,7207.0,Pave,,IR1,Lvl,AllPub,...,0.0,0.0,,,,0.0,2.0,2010.0,WD,Normal
1384,1385,50.0,RL,60.0,9060.0,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,MnPrv,,0.0,10.0,2009.0,WD,Normal
1100,1101,30.0,RL,60.0,8400.0,Pave,,Reg,Bnk,AllPub,...,0.0,0.0,,,,0.0,1.0,2009.0,WD,Normal


In [17]:
# no NA after the imputation

train_t['MasVnrArea'].isnull().sum()

0

In [18]:
# same imputation values as previously

train_t[X_train['MasVnrArea'].isnull()]['MasVnrArea']

1278     99.765717
936      34.106592
650       0.000000
234     375.749332
973      85.817715
Name: MasVnrArea, dtype: float64

## Automatically find best imputation parameters

We can optimise the parameters of the KNN imputation to better predict our outcome.

In [19]:
# import extra classes for modelling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [20]:
# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [21]:
pipe = Pipeline(steps=[
    ('imputer', KNNImputer(
        n_neighbors=5,
        weights='distance',
        add_indicator=False)),
    
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000)),
])

In [34]:
# now we create the grid with all the parameters that we would like to test

param_grid = {
    'imputer__n_neighbors': [3,5,10,20],
    'imputer__weights': ['uniform', 'distance'],
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [10, 100, 200, 300, 500, 1000, 3000, 5000, 10000],
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')

# cv=3 is the cross-validation
# no_jobs =-1 indicates to use all available cpus
# scoring='r2' indicates to evaluate using the r squared

# for more details in the grid parameters visit:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [35]:
# and now we train over all the possible combinations 
# of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


best linear regression from grid search: 0.841


In [36]:
# let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.731


In [37]:
# and find the best parameters

grid_search.best_params_

{'imputer__add_indicator': True,
 'imputer__n_neighbors': 3,
 'imputer__weights': 'distance',
 'regressor__alpha': 1000}

## Compare with univariate imputation

In [38]:
from sklearn.impute import SimpleImputer

In [39]:
# separate intro train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility

X_train.shape, X_test.shape

((1022, 36), (438, 36))

In [40]:
pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean', fill_value=-1)),
    ('scaler', StandardScaler()),
    ('regressor', Lasso(max_iter=2000)),
])

param_grid = {
    'imputer__strategy': ['mean', 'median', 'constant'],
    'imputer__add_indicator': [True, False],
    'regressor__alpha': [10, 100, 200],
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='r2')

# and now we train over all the possible combinations of the parameters above
grid_search.fit(X_train, y_train)

# and we print the best score over the train set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_train, y_train)))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


best linear regression from grid search: 0.845


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [41]:
# and finally let's check the performance over the test set
print(("best linear regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best linear regression from grid search: 0.729


In [42]:
# and find the best fit parameters like this
grid_search.best_params_

{'imputer__add_indicator': False,
 'imputer__strategy': 'constant',
 'regressor__alpha': 200}

We see that imputing the values with an arbitrary value of -1, returns approximately the same performance as doing KNN imputation, so we might not want to add the additional complexity of training models to impute NA, to then go ahead and predict the real target we are interested in.