Table of Content
- 1 Import Dataset
- 2 Data Preprocessing
    - 2.1 Remove Irrelevant Data
    - 2.2 Convert Variables to Dummies
        - 2.2.1 Group the Reasons for Absence
        - 2.2.2 Concatenate Column Values
    - 2.3 Convert Timestamp
        - 2.3.1 Extract the Month Value
        - 2.3.2 Extract the Day of the Week
    - 2.4 Convert Catagorisation Variable
- 3 Load the Processed Data
    - 3.1 Create the Target
    - 3.2 Standardize the Data
    - 3.3 Train Test Split
- 4 Applying Logistic Regression
    - 4.1 Find Intercept and Coefficients
    - 4.2 Test the Model
- 5 Save the Model

# 1 Import Dataset

In [4]:
import pandas as pd
import numpy as np

In [5]:
raw_csv_data = pd.read_csv('Absenteeism_data.csv')

In [6]:
# Copy to a new df
df = raw_csv_data.copy()

In [7]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None

In [8]:
df.head(1)

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4


# 2 Data Preprocessing

## 2.1 Remove Irrelevant Data

In [9]:
# Check Null data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
ID                           700 non-null int64
Reason for Absence           700 non-null int64
Date                         700 non-null object
Transportation Expense       700 non-null int64
Distance to Work             700 non-null int64
Age                          700 non-null int64
Daily Work Load Average      700 non-null float64
Body Mass Index              700 non-null int64
Education                    700 non-null int64
Children                     700 non-null int64
Pets                         700 non-null int64
Absenteeism Time in Hours    700 non-null int64
dtypes: float64(1), int64(10), object(1)
memory usage: 65.7+ KB


In [10]:
# Drop Column 'ID'
df = df.drop(['ID'], axis = 1)

## 2.2 Convert Variables to Dummies

In [11]:
# Check Column 'Reason for Absence'
df['Reason for Absence'].max()

28

In [12]:
# Check Column 'Reason for Absence'
df['Reason for Absence'].min()

0

In [13]:
# Check Column 'Reason for Absence'
df['Reason for Absence'].unique()
# or pd.unique(df['Reason for Absence'])

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16], dtype=int64)

In [14]:
# Check Column 'Reason for Absence'
len(df['Reason for Absence'].unique())

28

In [15]:
sorted(df['Reason for Absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

In [16]:
# Convert into dummies
reason_columns = pd.get_dummies(df['Reason for Absence'])

In [17]:
# Check each row only has one reason
reason_columns ['check'] = reason_columns.sum(axis=1)

In [18]:
# Check if contain 700 records
reason_columns['check'].sum(axis=0)

700

In [19]:
# Check if only one unique value in Column 'check'
reason_columns['check'].unique()

array([1], dtype=int64)

In [20]:
# Normalise by removing the Column 'check'
reason_columns = reason_columns.drop(['check'], axis =1)

In [21]:
# Drop first column '0' to aviod potential multicollinearity issues
reason_columns = pd.get_dummies(df['Reason for Absence'], drop_first=True)

### 2.2.1 Group the Reasons for Absence

In [22]:
df.columns.values

array(['Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours'], dtype=object)

In [23]:
reason_columns.columns.values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 21, 22, 23, 24, 25, 26, 27, 28], dtype=int64)

In [24]:
# Drop Column 'Reason for Absence'
df = df.drop(['Reason for Absence'], axis = 1)

In [25]:
reason_type_1 = reason_columns.loc[:, 1:14].max(axis=1)
reason_type_2 = reason_columns.loc[:, 15:17].max(axis=1)
reason_type_3 = reason_columns.loc[:, 18:21].max(axis=1)
reason_type_4 = reason_columns.loc[:, 22:].max(axis=1)

### 2.2.2 Concatenate Column Values

In [26]:
#Concat
df = pd.concat([df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis=1)

In [27]:
df.columns.values

array(['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [28]:
column_names=['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']

In [29]:
df.columns = column_names

In [30]:
#Reorder Columns
column_names_reordered = [ 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']
df=df[column_names_reordered]
df.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4


## 2.3 Converting Timestamp

In [31]:
df_reason_mod = df.copy()
type(df_reason_mod['Date'][0])
# Each date is stored as string

str

In [32]:
# Convert datetime
df_reason_mod['Date'] = pd.to_datetime(df_reason_mod['Date'], format='%d/%m/%Y')
df_reason_mod.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4


In [33]:
type(df_reason_mod['Date'][0])
# Now it is stored as Timestamp

pandas._libs.tslibs.timestamps.Timestamp

### 2.3.1 Extract the Month Value

In [34]:
df_reason_mod['Date'][0].month

7

In [35]:
# Create an array to store all month values
list_months=[]
for i in range (df_reason_mod.shape[0]):
    list_months.append(df_reason_mod['Date'][i].month)

In [36]:
# Add list months into current df
df_reason_mod['Month Value'] = list_months
df_reason_mod.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7


### 2.3.2 Extract the Day of the Week

In [37]:
df_reason_mod['Date'][699].weekday()

3

In [38]:
def date_to_weekday(date_value):
    return date_value.weekday()
df_reason_mod['Day of the Week'] = df_reason_mod['Date'].apply(date_to_weekday)
df_reason_mod.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1


In [39]:
# Sorting abit
df_reason_mod = df_reason_mod.drop(['Date'], axis = 1)
df_reason_mod.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Month Value',
       'Day of the Week'], dtype=object)

In [40]:
sorted_columns = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']
df_reason_mod = df_reason_mod[sorted_columns]
df_reason_mod.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4


In [41]:
# Checkpoint
df_reason_date_mod = df_reason_mod.copy()

## 2.4 Convert Catagorisation Variable

In [42]:
df_reason_date_mod['Education'].unique()

array([1, 3, 2, 4], dtype=int64)

In [43]:
df_reason_date_mod['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

In [44]:
df_reason_date_mod['Education']=df_reason_date_mod['Education'].map({1:0,2:1,3:1,4:1})
df_reason_date_mod['Education'].unique()

array([0, 1], dtype=int64)

In [45]:
df_reason_date_mod['Education'].value_counts()

0    583
1    117
Name: Education, dtype: int64

In [46]:
# Final Checkpoint
df_preprocessed = df_reason_date_mod.copy()
df_preprocessed.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4


# 3 Load the Processed Data

In [47]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

In [48]:
data_preprocessed.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4


## 3.1 Create the Target

In [49]:
# Find the median of the dependent variable, Y
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

In [50]:
# Another method of mapping
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] >
                  data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [51]:
# Add targets into df
data_preprocessed['Excessive Absenteeism'] = targets

In [52]:
data_preprocessed.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1


In [53]:
# Drop original column
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis=1)
data_with_targets.head(1)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1


## 3.2 Standardize the Data

In [54]:
# Take the independent variable
unscaled_inputs = data_with_targets.iloc[:,:-1]

In [55]:
from sklearn.preprocessing import StandardScaler

# Subtract the mean and divide by the standard deviation
absenteeism_scaler = StandardScaler()

In [56]:
# import the libraries needed to create the Custom Scaler
# note that all of them are a part of the sklearn package
# moreover, one of them is actually the StandardScaler module, 
# so you can imagine that the Custom Scaler is build on it

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy,with_mean,with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    # the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        
        # record the initial order of the columns
        init_col_order = X.columns
        
        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [57]:
# Check what are all columns that we've got
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work',
       'Age', 'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pet'], dtype=object)

In [58]:
# Choose the columns to scale
# We later augmented this code and put it in comments
# Columns_to_scale = ['Month Value','Day of the Week', 'Transportation Expense', 'Distance to Work',
       #'Age', 'Daily Work Load Average', 'Body Mass Index', 'Children', 'Pet']
    
# Select the columns to omit
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','Education']

In [59]:
# Create the columns to scale, based on the columns to omit
# Use list comprehension to iterate over the list
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [60]:
# Declare a scaler object, specifying the columns you want to scale
absenteeism_scaler = CustomScaler(columns_to_scale)

In [61]:
# Fit the data (calculate mean and standard deviation); they are automatically stored inside the object 
absenteeism_scaler.fit(unscaled_inputs)

  return self.partial_fit(X, y)


CustomScaler(columns=['Month Value', 'Day of the Week', 'Transportation Expense', 'Distance to Work', 'Age', 'Daily Work Load Average', 'Body Mass Index', 'Children', 'Pet'],
       copy=None, with_mean=None, with_std=None)

In [62]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)
scaled_inputs.head()



Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet
0,0,0,0,1,0.030796,-0.80095,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.030796,-0.80095,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,0.030796,-0.2329,-0.654143,1.426749,0.24831,-0.806331,1.002633,0,-0.91903,-0.58969
3,1,0,0,0,0.030796,0.335149,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.030796,0.335149,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487


In [63]:
scaled_inputs.shape

(700, 14)

## 3.3 Train Test Split

In [64]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size=0.8, random_state=20)



In [65]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(560, 14) (140, 14) (560,) (140,)


# 4 Applying Logistic Regression

In [66]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Train the model
reg = LogisticRegression()
reg.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [67]:
reg.score(x_train, y_train)
# 78% of accuracy

0.7660714285714286

## 4.1 Find Intercept and Coefficients

In [68]:
reg.intercept_

array([-1.43101781])

In [69]:
reg.coef_

array([[ 2.61893423,  0.83461948,  2.95258195,  0.64428488,  0.01123706,
        -0.0748093 ,  0.62180009, -0.02934223, -0.17585164, -0.02583315,
         0.27705024, -0.29385863,  0.3549178 , -0.27486307]])

In [70]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work',
       'Age', 'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pet'], dtype=object)

In [71]:
feature_name=unscaled_inputs.columns.values
summary_table=pd.DataFrame(columns=['Feature name'], data=feature_name)
summary_table['Coefficient']=np.transpose(reg.coef_)
summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.618934
1,Reason_2,0.834619
2,Reason_3,2.952582
3,Reason_4,0.644285
4,Month Value,0.011237
5,Day of the Week,-0.074809
6,Transportation Expense,0.6218
7,Distance to Work,-0.029342
8,Age,-0.175852
9,Daily Work Load Average,-0.025833


In [72]:
summary_table.index = summary_table.index + 1
summary_table.loc[0]=['Intercept', reg.intercept_[0]]
summary_table=summary_table.sort_index()
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.431018
1,Reason_1,2.618934
2,Reason_2,0.834619
3,Reason_3,2.952582
4,Reason_4,0.644285
5,Month Value,0.011237
6,Day of the Week,-0.074809
7,Transportation Expense,0.6218
8,Distance to Work,-0.029342
9,Age,-0.175852


In [73]:
summary_table['Odds_ratio']=np.exp(summary_table.Coefficient)
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,Reason_3,2.952582,19.155348
1,Reason_1,2.618934,13.721092
2,Reason_2,0.834619,2.303937
4,Reason_4,0.644285,1.904625
7,Transportation Expense,0.6218,1.862277
13,Children,0.354918,1.426063
11,Body Mass Index,0.27705,1.319233
5,Month Value,0.011237,1.0113
10,Daily Work Load Average,-0.025833,0.974498
8,Distance to Work,-0.029342,0.971084


A feature is not important:
- coefficient ~ 0
- odds ratio ~ 1

## 4.2 Test the Model

In [74]:
reg.score(x_test, y_test)
# Slightly lower accuracy than the train model

0.75

# 5 Save the Model

In [75]:
import pickle
with open('model', 'wb') as file: # Model is file name, wb is "write bytes"
    pickle.dump(reg, file) # Save 'reg'

In [76]:
# Pickle the scaler file
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)