# Absenteesim at work dataset

#### A linear regression is used herein for estimating the number of hours a person would be absent from work given their available information. For this, we use the "absenteesim at work" dataset obtained from the UCI repository, which could be find in the repository. In the dataset, you can find information about individuals, such as their age, education, reasons for absence, etc., as well as the target variable, which is absenteesim time in hours.

First we want to check how many data points does the dataset includes


In [1]:
import pandas as pd
data = pd.read_csv('Absenteeism_at_work.csv', sep=',')
print(len(data))

740


Now we randomly split the data into train and test with the ratio 80/20, that is, use 80% of the data to fit the line, and the remaining 20% for testing, with the pre-specified random seed. We train a linear regression model on the training data. Then, we use the trained model to estimate hours of absence in the test data. Finally the average root mean squared error (RMSE) on the test data is reported

In [24]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [25]:
# setting independent variables
X = data.iloc[:, :-1].values
# setting the dependent variable
Y = data.iloc[:, 20].values
# splitting the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state= 1234)

In [26]:
# an object of the LinearRegression class
regressor = LinearRegression()
# now fitting the regressor to the training set (training a linear regression model on the training data)
regressor.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [27]:
# predicting the hours of absence using our test set
Y_pred = regressor.predict(X_test)

In [28]:
from sklearn.metrics import mean_squared_error
from math import sqrt
# calculating the RMSE
rmse = sqrt(mean_squared_error(Y_test, Y_pred))
print(rmse)

8.88834873358575


this number indicates how deviated is our predicted Y from the actual Y. 8 is a high deviation, this means the trained model isn't good enough

Now we perform 10-fold cross validation and report the RMSE obtained from each fold as well as their average.

In [29]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=1234)

regressor = LinearRegression()
sum_rmse = []
j =1
for i, (train_index, test_index) in enumerate(kf.split(X, Y)):
    print ('Round ', j)
    X_train, X_test = X[train_index], X[test_index] 
    Y_train, Y_test = Y[train_index], Y[test_index]
    regressor.fit(X_train, Y_train)
    Y_pred = regressor.predict(X_test)
    rmse = sqrt(mean_squared_error(Y_test, Y_pred))
    print('RMSE:', rmse)
    sum_rmse.append(rmse)
    j += 1
    
average = sum(sum_rmse)/len(sum_rmse)
print(average)   

Round  1
RMSE: 7.25627462941752
Round  2
RMSE: 10.343899074717738
Round  3
RMSE: 10.021177715979572
Round  4
RMSE: 10.978156066137798
Round  5
RMSE: 14.572395451600453
Round  6
RMSE: 13.642038915293057
Round  7
RMSE: 13.280048577940184
Round  8
RMSE: 9.662220360722165
Round  9
RMSE: 15.86624505271139
Round  10
RMSE: 17.18018304150529
12.280263888602516


the average rsme in the 10-fold is higher than the rsme of the simple train-test. This means that in average the k-fold cross validation method produces a more deviated Y compared to actual Y. Based on this analysis, it suffice to use cross-validation and don't simply train and test with a random split of the data

#### This time, we are interested in K neighbors regression instead of a regression on the whole dataset and we would like to analyze what would be reasonable number of neighbors and what distance to use based on the data. To do so, we're performing a 10-fold cross validation. In each fold, we fit a “weighted” linear regression in the following manner: For a given test data point, we would like to estimate its outcome based on its k ∈ {1, . . . , 10} nearest neighbors and a regression line weighted by the inverse of the distance of the neighbors of the test point. We use the Minkowski distance with degree p ∈ {1, . . . , 10}. For each fold, the k and p at which we obtain the lowest RMSE is reported as well as the average RMSE across all folds.

In [19]:
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np


data = pd.read_csv('Absenteeism_at_work.csv', sep=',')
X = data.iloc[:, :-1].values
# setting the dependent variable
Y = data.iloc[:, 20].values
# splitting the data into train and test
#X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1234)

#creating a list of K for KNN
neighbors = list(range(1,11))
degree = list(range(1,11))

kf = KFold(n_splits=10, random_state=1234)

#this is a definition adopted from https://stackoverflow.com/questions/8189169/nested-lists-python for flattening list of lists
def flatten(lists):
  results = []
  for numbers in lists:
    for numbers2 in numbers:
        results.append(numbers2) 
  return results

        
j =1 
for i, (train_index, test_index) in enumerate(kf.split(X, Y)):
    print ('FOLD ', j)
    x_train, x_test, y_train, y_test = X[train_index], X[test_index], Y[train_index], Y[test_index] 
    ALL=[]
    for k in neighbors:
        #print ('Number of neighbors: ', k)
        RMSE = [] 
        for p in degree:
            #print ('    ','Minkowski degree: ', p)
            knn = KNeighborsRegressor(n_neighbors=k, weights='distance', p = p, metric='minkowski')
            knn.fit(x_train, y_train)
            y_pred = knn.predict(x_test)
            # calculating the RMSE
            rmse = sqrt(mean_squared_error(y_test, y_pred))
            #print('     ', 'RMSE: ', rmse)
            RMSE.append(rmse)

        #print('MINIMUM RMSE', RMSE)
        ALL.append(RMSE)
        
    MIN_RMSE= min(min(ALL))
    print('Minimum RMSE at fold number ', j,' equals to ', MIN_RMSE)
    #this is to find index of MIN_RMSE in a list of lists
    print([(k, ALL.index(MIN_RMSE))
 for k, ALL in enumerate(ALL)
 if MIN_RMSE in ALL], " showing number of neighbors and Minkowski degree respectfully")
    print(' ')
    j += 1
    

FOLD  1
Minimum RMSE at fold number  1  equals to  7.161269704723024
[(0, 0)]  showing number of neighbors and Minkowski degree respectfully
 
FOLD  2
Minimum RMSE at fold number  2  equals to  11.319345593668793
[(9, 0)]  showing number of neighbors and Minkowski degree respectfully
 
FOLD  3
Minimum RMSE at fold number  3  equals to  10.332662817687131
[(9, 0)]  showing number of neighbors and Minkowski degree respectfully
 
FOLD  4
Minimum RMSE at fold number  4  equals to  11.887976592506996
[(9, 1)]  showing number of neighbors and Minkowski degree respectfully
 
FOLD  5
Minimum RMSE at fold number  5  equals to  9.815872419268262
[(0, 0)]  showing number of neighbors and Minkowski degree respectfully
 
FOLD  6
Minimum RMSE at fold number  6  equals to  14.849271244694119
[(9, 5)]  showing number of neighbors and Minkowski degree respectfully
 
FOLD  7
Minimum RMSE at fold number  7  equals to  13.83743786869479
[(3, 0)]  showing number of neighbors and Minkowski degree respectful

In [20]:
print('AVERAGE RMSE ACROSS ALL FOLDS: ', sum(flatten(ALL))/len(flatten(ALL)))

AVERAGE RMSE ACROSS ALL FOLDS:  18.871772708418366
