# Logistic Regression to Predict Absenteeism

##### Motivation for Model

To predict absenteeism behavior amongst the employees in the firm.

##### Motivation for Logistic Regression

The problem is a classification problem with several factors that most certainly cannot all be accounted for via linear regression.

In [1]:
# Import relevant libraries
import pandas as pd
import numpy as np

In [2]:
# Load the data
data = pd.read_csv('Absenteeism_preprocessed.csv')

In [3]:
data.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Day,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2


In [4]:
# Creating the targets for the model

data['Absenteeism Time in Hours'].median() # The median is the cut-off value for classification. Values above median will be
                                           # classified as 'excessive', while those under it will be classified as 'moderate'.
                                           # Median is chosen as it is numerically stable, and gives a balanced result 
                                           # [i.e. roughly 50-50 split between target classes]

3.0

In [5]:
# For targets, memebers of the excessive class will be labelled as 1, moderate as 0.
targets = np.where(data['Absenteeism Time in Hours'] > data['Absenteeism Time in Hours'].median(), 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [6]:
# Add targets to dataframe
data['Excessive Absenteeism'] = targets
data.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Day,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2,0


In [7]:
data_with_targets = data.drop(['Absenteeism Time in Hours'],axis=1) # Absenteeism Time in Hours is no longer required
data_with_targets.head() 

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Day,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,0


In [8]:
targets.sum() / targets.shape[0] # Roughly 45-55 split between tragets 1's and 0's is achieved. This is good enough.

0.45571428571428574

In [9]:
# Selecting inputs for regression

unscaled_inputs=data_with_targets.iloc[:,:-1]
unscaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Day,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,2,2,0
696,1,0,0,0,5,2,225,26,28,237.656,24,1,1,2
697,1,0,0,0,5,3,330,16,28,237.656,25,2,0,0
698,0,0,0,1,5,3,235,16,32,237.656,25,3,0,0


In [34]:
# Standardization

# from sklearn.preprocessing import StandardScaler

# absenteeism_scaler = StandardScaler()

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
        self.scaler = StandardScaler(copy, with_mean, with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler, transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [35]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month', 'Day',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [36]:
columns_to_scale = ['Month', 'Day',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets']

In [11]:
scaled_inputs.shape

(700, 14)

In [12]:
# Split into train and test data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size=0.8, random_state=20)

In [13]:
print(x_train.shape, y_train.shape)

(560, 14) (560,)


In [14]:
print(x_test.shape, y_test.shape) # 

(140, 14) (140,)


In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [16]:
reg = LogisticRegression()
reg.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
reg.score(x_train, y_train)

0.7857142857142857

##### Manual Accuracy Check

In [18]:
# Manually compare matches between the training dataset and the results of the model's predictions
model_outputs = reg.predict(x_train)
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True, False,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [19]:
true_predictions = np.sum(model_outputs==y_train)
true_predictions

440

In [20]:
true_predictions / model_outputs.shape[0] # Accuracy check gives us same result as the reg.score() method

0.7857142857142857

##### Find intercepts and coefficients

In [21]:
reg.intercept_

array([-0.22294719])

In [22]:
reg.coef_

array([[ 2.07497405,  0.32866118,  1.55748421,  1.32926089,  0.18900507,
        -0.06599819,  0.70101952, -0.04027595, -0.20272797, -0.0063544 ,
         0.32452205, -0.13449813,  0.38128906, -0.33133716]])

In [23]:
feature_name = unscaled_inputs.columns.values # Extract feature name column from unscaled_inputs variable

In [24]:
summary_table = pd.DataFrame(columns=['Feature name'], data = feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_) # transpose to get column data
summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.074974
1,Reason_2,0.328661
2,Reason_3,1.557484
3,Reason_4,1.329261
4,Month,0.189005
5,Day,-0.065998
6,Transportation Expense,0.70102
7,Distance to Work,-0.040276
8,Age,-0.202728
9,Daily Work Load Average,-0.006354


In [25]:
summary_table.index = summary_table.index + 1 # Index incremented to prepend feature ceofficient column in dataframe
summary_table.loc[0] = ['Intercept', reg.intercept_[0]] # Empty 0th index filled by intercept data
summary_table = summary_table.sort_index() # Sorted summary table with necessary coefficient and intercept information
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-0.222947
1,Reason_1,2.074974
2,Reason_2,0.328661
3,Reason_3,1.557484
4,Reason_4,1.329261
5,Month,0.189005
6,Day,-0.065998
7,Transportation Expense,0.70102
8,Distance to Work,-0.040276
9,Age,-0.202728


##### Linear Regression Model

The equation of the regression model is: $\log(odds) = \beta_0 + \beta_1x_1 + \beta_2x_2 +\ldots +\beta_kx_k$. $\log(odds)$ gives us the coefficient values. Due to the nature of the logistic model, the values are all logarithms and their exponent is a metric called odds ratio.

In [26]:
# Calculate odds ratio 
summary_table['Odds ratio'] = np.exp(summary_table.Coefficient)

In [33]:
summary_table = summary_table.sort_values(['Odds ratio'], ascending=False)
summary_table

Unnamed: 0,Feature name,Coefficient,Odds ratio
1,Reason_1,2.074974,7.96434
3,Reason_3,1.557484,4.746864
4,Reason_4,1.329261,3.77825
7,Transportation Expense,0.70102,2.015807
13,Children,0.381289,1.464171
2,Reason_2,0.328661,1.389107
11,Body Mass Index,0.324522,1.383369
5,Month,0.189005,1.208047
10,Daily Work Load Average,-0.006354,0.993666
8,Distance to Work,-0.040276,0.960524


##### Interpretation of Coefficient data

A coefficient is not significant if it close to zero, or its odds ration is close to 1. Higher value coefficients, such as those of the features near the top of the table are significant. In terms of weights, this is intuitive, as coefficient close to zero implies multiplication with zero regardless of the coefficient value, according to the equation of the model.  


In terms of odds ratio, for a unit change in the standardized feature, odds ratio changes by a multiple equal to odds ratio (odds ratio of 1 = no change). From the table, we understand that Daily Work load average, day of the week and distance to work are the closest to 1 and thus seem to be least significant. 

Furthmore, Reason_0, i.e. the table we dropped earlier, is the case for base model where no reason is specified. The 4 reason groups are all significant predictors, confirmed by their high odds-ratio values. 