## Day 32 Lecture 1 Assignment

In this assignment, we will learn about K nearest neighbor regression. We will use the absenteeism at work dataset loaded below and analyze the model generated for this dataset.

In [37]:
import math

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

In [2]:
absent = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Absenteeism_at_work.csv', sep=';')

In [3]:
absent.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2


Find which variables have the highest pairwise correlation and remove them from our dataset. Additionally, try to think of which variables are correlated by looking at the column names and remove those columns as well.

Note: When choosing between two categorical variables that are correlated, keep the one with fewer unique categories.

In [4]:
absent.corr().replace(1, np.nan).max().sort_values(ascending=False).head(10)

Weight                             0.904117
Body mass index                    0.904117
Age                                0.670979
Service time                       0.670979
Social drinker                     0.452196
Distance from Residence to Work    0.452196
Seasons                            0.407770
Month of absence                   0.407770
Pet                                0.400080
Transportation expense             0.400080
dtype: float64

In [5]:
rem_cols = ['ID', 'Weight', 'Age', 'Height', 'Seasons']
absent = absent.drop(columns=rem_cols)

Find out which columns contain categorical variables and turn those into dummy variables.

In [6]:
absent['Work load bins'] = pd.cut(absent['Work load Average/day '], 10, labels=range(1,11))
absent = absent.drop(columns='Work load Average/day ')

In [47]:
cat_cols = ['Reason for absence', 'Month of absence', 'Day of the week', 'Hit target', 'Disciplinary failure', 'Education', 'Son', 'Social drinker', 'Social smoker', 'Pet', 'Work load bins']
absent[cat_cols] = absent[cat_cols].astype(str)

Split the data into train and test with test containing 20% of the data.

In [49]:
X = absent.drop(columns='Absenteeism time in hours')
y = absent['Absenteeism time in hours']
num_cols = [c for c in X if c not in cat_cols]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=36
)


Train a KNN regression model using k=15 and compute the MSE for the training and test subsamples.

In [50]:
preprocess = ColumnTransformer([
    ('encoding cats', OneHotEncoder(drop='first', categories='auto'), cat_cols),
    ('scale numeric', MinMaxScaler(), num_cols)
], remainder='passthrough'
)

pipeline = Pipeline(
    [
        ("preprocess", preprocess),
        ("feat_select", SelectKBest(f_regression)),
        ("knn", KNeighborsRegressor()),
    ]
)

grid = {
    "feat_select__k": range(1, X.shape[1] + 1),
    "knn__n_neighbors": range(1, 21),
    "knn__weights": ["uniform", "distance"],
}

pipeline_cv = GridSearchCV(pipeline, grid, verbose=1)
pipeline_cv.fit(X_train, y_train)

pipeline_cv.best_params_


Fitting 3 folds for each of 600 candidates, totalling 1800 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


ValueError: Found unknown categories ['2'] in column 0 during transform

In [38]:
pipeline_cv.score(X_train, y_train)

0.17710732275076746

In [39]:
pipeline_cv.score(X_test, y_test)

0.29594356939876065