## Absenteeism project - Machine Learning on preprocessed data
### Logistic regression to predict absenteeism

### Logistic regression is a type of classification so we are going to group people into some classes

In [2]:
# Importing libraries
import pandas as pd
import numpy as np

In [101]:
data_preprocessed = pd.read_csv('data/absenteeism_preprocessed.csv')
data_preprocessed

Unnamed: 0,Rfa_group_1,Rfa_group_2,Rfa_group_3,Rfa_group_4,Month,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2


### Creating targets

The approach used in here is to create 2 classes. Using median we will divide absenteeism in hours into two groups.

In [102]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

Median in this case is 3.0. so everything below the median is the first group and everything above is in the second group.

In [103]:
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

In [104]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [105]:
data_preprocessed['Targets'] = targets
data_preprocessed

Unnamed: 0,Rfa_group_1,Rfa_group_2,Rfa_group_3,Rfa_group_4,Month,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Targets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2,0


The method used in here (dividing into two groups using median) is a "naive" method. This method gives us very good split of data (almost 50/50) so it's very good for numerical reasons. The aim of this part is to show the ML approach so we won't focus on the most appropriate split.

In [106]:
data_targets = data_preprocessed.drop('Absenteeism Time in Hours', axis = 1)
data_targets.head()

Unnamed: 0,Rfa_group_1,Rfa_group_2,Rfa_group_3,Rfa_group_4,Month,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Targets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0


### Selecting inputs

In [107]:
unscaled_inputs = data_targets.iloc[:,:-1]

### Standardizing the data

#### We shouldn't scale the dummy variables we created so we will make a class which lets us choose columns to be scaled. This could also be taken care of if we would scale the data before creating dummies.

In [108]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns):
        self.scaler = StandardScaler()
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [109]:
unscaled_inputs.columns.values

array(['Rfa_group_1', 'Rfa_group_2', 'Rfa_group_3', 'Rfa_group_4',
       'Month', 'Day of the week', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets'], dtype=object)

In [110]:
columns_to_scale = ['Month', 'Day of the week', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Children', 'Pets']

In [111]:
scaler = CustomScaler(columns_to_scale)

In [112]:
scaler.fit(unscaled_inputs)

CustomScaler(columns=['Month', 'Day of the week', 'Transportation Expense',
                      'Distance to Work', 'Age', 'Daily Work Load Average',
                      'Body Mass Index', 'Children', 'Pets'])

In [113]:
scaled_inputs = scaler.transform(unscaled_inputs)

In [114]:
scaled_inputs

Unnamed: 0,Rfa_group_1,Rfa_group_2,Rfa_group_3,Rfa_group_4,Month,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,-0.683704,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-0.683704,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.007725,-0.654143,1.426749,0.248310,-0.806331,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.668253,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,0.668253,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.007725,-0.654143,-0.533522,0.562059,-0.853789,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,-0.007725,0.040034,-0.263140,-1.320435,-0.853789,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,0.668253,1.624567,-0.939096,-1.320435,-0.853789,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.668253,0.190942,-0.939096,-0.692937,-0.853789,-0.408580,1,-0.919030,-0.589690


In [115]:
scaled_inputs.shape

(700, 14)

### Splitting data

In [116]:
from sklearn.model_selection import train_test_split

In [117]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size = 0.2, random_state = 20)

In [118]:
x_train.shape, y_train.shape

((560, 14), (560,))

In [119]:
x_test.shape, y_test.shape

((140, 14), (140,))

## Modeling

In [120]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [121]:
model = LogisticRegression()

In [122]:
model.fit(x_train, y_train)

LogisticRegression()

In [123]:
model.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [124]:
model.score(x_train, y_train)

0.775

#### Accuracy - manual check

Accuracy means that the % of the model outputs match the targets. In this case 78% outputs match targets.


In [125]:
model_outputs = model.predict(x_train)
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [126]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [127]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True, False,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [128]:
np.sum(model_outputs == y_train)

434

In [129]:
model_outputs.shape[0]

560

In [130]:
np.sum(model_outputs == y_train) / model_outputs.shape[0]

0.775

#### Intercept and coefficients

In [131]:
model.intercept_

array([-1.6561092])

In [132]:
model.coef_

array([[ 2.80096498e+00,  9.34857518e-01,  3.09561645e+00,
         8.56587468e-01,  1.66248119e-01, -8.43703301e-02,
         6.12732578e-01, -7.79685996e-03, -1.65922708e-01,
        -1.47005123e-04,  2.71811477e-01, -2.05738037e-01,
         3.61989880e-01, -2.85510745e-01]])

In [133]:
unscaled_inputs.columns.values

array(['Rfa_group_1', 'Rfa_group_2', 'Rfa_group_3', 'Rfa_group_4',
       'Month', 'Day of the week', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets'], dtype=object)

In [134]:
feature_name = unscaled_inputs.columns.values 

In [135]:
summary_table = pd.DataFrame(columns = ['Feature name'], data=feature_name )
summary_table['Coefficient'] = np.transpose(model.coef_) # we have to transpose becaues in default nd arrays are rows
summary_table

Unnamed: 0,Feature name,Coefficient
0,Rfa_group_1,2.800965
1,Rfa_group_2,0.934858
2,Rfa_group_3,3.095616
3,Rfa_group_4,0.856587
4,Month,0.166248
5,Day of the week,-0.08437
6,Transportation Expense,0.612733
7,Distance to Work,-0.007797
8,Age,-0.165923
9,Daily Work Load Average,-0.000147


#### Adding the intercept to the table

In [136]:
# We want to add Intercept value at the beginning of the table as one of the features. 

# 'Making place' for the intercept - when adding +1 to indexes we have index 0 not used
summary_table.index = summary_table.index + 1 

# Adding intercept at the index 0 position
summary_table.loc[0] = ['Intercept', model.intercept_[0]]

# Sorting by index (because even though we added intercept with index 0 it is placed at the end)
summary_table = summary_table.sort_index()

summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.656109
1,Rfa_group_1,2.800965
2,Rfa_group_2,0.934858
3,Rfa_group_3,3.095616
4,Rfa_group_4,0.856587
5,Month,0.166248
6,Day of the week,-0.08437
7,Transportation Expense,0.612733
8,Distance to Work,-0.007797
9,Age,-0.165923


#### Coefficients = weights - the closer they are to 0 the smaller the weight (for the models with the same scale like this one)
#### Intercept = bias

In [137]:
# Logistic regression equation in our case is:
# log(odds) = intercept + b1x1 + b2x2 + ... + b14x14
# so using values from the table:
# log(odds) = -0.22 + 2.07* Rfa_group_1 + 0.33* Rfa_group_2 + ... + (-0.33)* Pets

# Log(odds) is the logarithm of odds. Odds is the probability of success (0 - not secceeded, 1- succeeded)

In [138]:
summary_table['Odds_ratio'] = np.exp(summary_table['Coefficient'])
summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-1.656109,0.19088
1,Rfa_group_1,2.800965,16.460523
2,Rfa_group_2,0.934858,2.546851
3,Rfa_group_3,3.095616,22.100858
4,Rfa_group_4,0.856587,2.35511
5,Month,0.166248,1.180866
6,Day of the week,-0.08437,0.919091
7,Transportation Expense,0.612733,1.845467
8,Distance to Work,-0.007797,0.992233
9,Age,-0.165923,0.847112


In [139]:
summary_table.sort_values('Odds_ratio', ascending = False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,Rfa_group_3,3.095616,22.100858
1,Rfa_group_1,2.800965,16.460523
2,Rfa_group_2,0.934858,2.546851
4,Rfa_group_4,0.856587,2.35511
7,Transportation Expense,0.612733,1.845467
13,Children,0.36199,1.436184
11,Body Mass Index,0.271811,1.31234
5,Month,0.166248,1.180866
10,Daily Work Load Average,-0.000147,0.999853
8,Distance to Work,-0.007797,0.992233


#### If the coefficient is close to 0 or the odds ratio is around 1, the feature is not really important

Considering what's above we can say that those features are not really important:
* Daily Work Load Average
* Distance to Work
* Day of the week

The features that are important those from the top and the bottom. The higher the value the bigger importance.

`The Reason for Absence interpretation`
Our baseline model is the absence where no reason is given (group 0 which was dropped). Every interpretation of RFA features is a comparison to the group 0. For example for Rfa_group_1 the probability of excessive absence is 16 times higher than for the absence without giving a reason.

`Transportation Expense`
We cannot interpret it directly because this is one of the standardized features. Only thing we can say is that for one standard deviation increase in transportation expenses (one standardized unit) it is almost 2 times more likely for the person to be excessively absent.

`Pets`
Just like with the Transportation Expense this is a standardized feature. We can say that for 0.75 increase in the number of pets it is less likely for the person to be excessively absent.

### Backward elimination

Backward elimination is simplyfying a model by removing features which have no impact on model (they are not important).

### I will create a new notebook (which will be the copy of existing one) and I will drop 3 variables which are not important: 
* Daily Work Load Average
* Distance to Work
* Day of the week

### I will not do this in this notebook to keep those results and compare them to those I will get. New file will be absenteeism_project_ML_backward_elimination.ipynb