## Predicting Absentism using Machine Learning


This notebook will introduce some foundation machine learning and data science concepts by exploring the problem of Absentism **classification**.

It is intended to be an end-to-end example of what a data science and machine learning **proof of concept** might look like.

## What is classification?

Classification involves deciding whether a sample is part of one class or another (**single-class classification**). If there are multiple class options, it's referred to as **multi-class classification**.

More specifically, we'll look at the following topics.

* **Exploratory data analysis (EDA)** - the process of going through a dataset and finding out more about it.
* **Model training** - create model(s) to learn to predict a target variable based on other variables.
* **Model evaluation** - evaluating a models predictions using problem-specific evaluation metrics. 

To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.

##  Problem Definition
In our case, the problem we will be exploring is **binary classification** (a sample can only be one of two things). 

This is because we're going to be using a number of differnet **features** (pieces of information) about a person to predict whether they have heart disease or not.

## Preparing the tools

At the start of any project, it's custom to see the required libraries imported in a big chunk like you can see below.

However, in practice, your projects may import libraries as you go. After you've spent a couple of hours working on your problem, you'll probably want to do some tidying up. This is where you may want to consolidate every library you've used at the top of your notebook (like the cell below).

The libraries you use will differ from project to project. But there are a few which will you'll likely take advantage of during almost every structured data project. 

* [pandas](https://pandas.pydata.org/) for data analysis.
* [NumPy](https://numpy.org/) for numerical operations.
* [Matplotlib](https://matplotlib.org/)/[seaborn](https://seaborn.pydata.org/) for plotting or data visualization.
* [Scikit-Learn](https://scikit-learn.org/stable/) for machine learning modelling and evaluation.

In [1]:
# Data analysis and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Data preprocessing and model building libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Evaluation libraries
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import precision_score,recall_score,f1_score

# Other libraries
import warnings
warnings.filterwarnings('ignore')

## Let's begin with loading the data


In [2]:
# We will use pandas built-in function to read .csv files
data = pd.read_csv('Absentism.csv')

In [3]:
# Let's check the top rows of our data
data.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [4]:
data.shape

(700, 12)

In [5]:
# data.describe: Generates descriptive statistics.
data.describe()

Unnamed: 0,ID,Reason for Absence,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,17.951429,19.411429,222.347143,29.892857,36.417143,271.801774,26.737143,1.282857,1.021429,0.687143,6.761429
std,11.028144,8.356292,66.31296,14.804446,6.379083,40.021804,4.254701,0.66809,1.112215,1.166095,12.670082
min,1.0,0.0,118.0,5.0,27.0,205.917,19.0,1.0,0.0,0.0,0.0
25%,9.0,13.0,179.0,16.0,31.0,241.476,24.0,1.0,0.0,0.0,2.0
50%,18.0,23.0,225.0,26.0,37.0,264.249,25.0,1.0,1.0,0.0,3.0
75%,28.0,27.0,260.0,50.0,40.0,294.217,31.0,1.0,2.0,1.0,8.0
max,36.0,28.0,388.0,52.0,58.0,378.884,38.0,4.0,4.0,8.0,120.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


There are no missing values and all the columns are in numerical in nature except Date

We will drop column ID since it's not important for our analysis or building model.

In [7]:
data = data.drop(['ID'],axis=1)

In [8]:
data.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


We will see the top reasons for absense

In [9]:
pd.unique(data['Reason for Absence'])

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16], dtype=int64)

In [10]:
len(pd.unique(data['Reason for Absence']))

28

There are` total 28 Reasons different reasons for absence from work, We will create dummies for reasons now.

In [11]:
# We will drop the first column inorder to avoide multicolinearity
reasons = pd.get_dummies(data['Reason for Absence'],drop_first=True)

In [12]:
reasons

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
697,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


Using the method above we will generate age dummies aswell.

In [13]:
age_dummies = pd.get_dummies(data['Age'],drop_first=True)

In [14]:
age_dummies

Unnamed: 0,28,29,30,31,32,33,34,36,37,38,39,40,41,43,46,47,48,49,50,58
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
696,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
697,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [15]:
reasons

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
697,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


As you can see the reason now has 28 different columns it's recommended to shrink them, for that we will group them into`Four different categories based on their type.

According to the dataset these 28 reasons are grouped in 4 they are:

- reason type 1: Disease (1 to 14)
- reason type 2: Pregnancy (15 to 17)
- reason type 3: External causes (18 to 21)
- reason type 4: Follow up (22 to 28)

In [16]:
reason_type1 = reasons.iloc[:,1:14].max(axis=1)

In [17]:
reason_type1

0      0
1      0
2      0
3      1
4      0
      ..
695    1
696    1
697    1
698    0
699    0
Length: 700, dtype: uint8

In [18]:
reason_type2 = reasons.iloc[:,15:17].max(axis=1)

In [19]:
reason_type3 = reasons.iloc[:,18:21].max(axis=1)

In [20]:
reason_type4 = reasons.iloc[:,22:].max(axis=1)

It's time to concat (Merge) the dataset with the above preprocessed data.

In [21]:
data = pd.concat([data,reason_type1,reason_type2,reason_type3,reason_type4],axis=1)

In [22]:
data

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,0
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,10,23/05/2018,179,22,40,237.656,22,2,2,0,8,1,0,0,0
696,6,23/05/2018,225,26,28,237.656,24,1,1,2,3,1,0,0,0
697,10,24/05/2018,330,16,28,237.656,25,2,0,0,8,1,0,0,0
698,23,24/05/2018,235,16,32,237.656,25,3,0,0,2,0,0,0,0


In [23]:
# Drop the 'Reason for Absence' column from the dataset
data.drop('Reason for Absence',axis=1,inplace=True)

In [24]:
data

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,0
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,23/05/2018,179,22,40,237.656,22,2,2,0,8,1,0,0,0
696,23/05/2018,225,26,28,237.656,24,1,1,2,3,1,0,0,0
697,24/05/2018,330,16,28,237.656,25,2,0,0,8,1,0,0,0
698,24/05/2018,235,16,32,237.656,25,3,0,0,2,0,0,0,0


Rename the newly added reasons columns

In [25]:
data.columns.values

array(['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [26]:
columns = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']

In [27]:
data.columns = columns

In [28]:
data

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_1,Reason_2,Reason_3,Reason_4
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,0
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,23/05/2018,179,22,40,237.656,22,2,2,0,8,1,0,0,0
696,23/05/2018,225,26,28,237.656,24,1,1,2,3,1,0,0,0
697,24/05/2018,330,16,28,237.656,25,2,0,0,8,1,0,0,0
698,24/05/2018,235,16,32,237.656,25,3,0,0,2,0,0,0,0


In [29]:
columns_rearranged =  ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours' ]

In [30]:
data = data[columns_rearranged]

In [31]:
data

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,0,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,0,23/07/2015,289,36,33,239.554,30,1,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,23/05/2018,179,22,40,237.656,22,2,2,0,8
696,1,0,0,0,23/05/2018,225,26,28,237.656,24,1,1,2,3
697,1,0,0,0,24/05/2018,330,16,28,237.656,25,2,0,0,8
698,0,0,0,0,24/05/2018,235,16,32,237.656,25,3,0,0,2


In [32]:
# formatting date to pandas date format
data['Date'] = pd.to_datetime(data['Date'],format='%d/%m/%Y')

In [33]:
type(data['Date'])

pandas.core.series.Series

In [34]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Reason_1                   700 non-null    uint8         
 1   Reason_2                   700 non-null    uint8         
 2   Reason_3                   700 non-null    uint8         
 3   Reason_4                   700 non-null    uint8         
 4   Date                       700 non-null    datetime64[ns]
 5   Transportation Expense     700 non-null    int64         
 6   Distance to Work           700 non-null    int64         
 7   Age                        700 non-null    int64         
 8   Daily Work Load Average    700 non-null    float64       
 9   Body Mass Index            700 non-null    int64         
 10  Education                  700 non-null    int64         
 11  Children                   700 non-null    int64         
 12  Pets    

In [35]:
data['Date'][0].month

7

In [36]:
list_month=[]
list_month

[]

In [37]:
for i in range(data.shape[0]):
    list_month.append(data['Date'][i].month)
    

In [38]:
list_month

[7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 7,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 8,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 5,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 6,
 7,
 

In [39]:
len(list_month)

700

In [40]:
data['Month'] = list_month

In [41]:
data.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7
2,0,0,0,0,2015-07-15,179,51,38,239.554,31,1,0,0,2,7
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7
4,0,0,0,0,2015-07-23,289,36,33,239.554,30,1,2,1,2,7


In [42]:
data['Date'][0].weekday()

1

In [43]:
def date_to_weekday(date_value):
    return date_value.weekday()

In [44]:
data['Day of the week'] = data['Date'].apply(date_to_weekday)
data.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month,Day of the week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,0,2015-07-15,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,0,2015-07-23,289,36,33,239.554,30,1,2,1,2,7,3


In [45]:
data['Education'].unique()

array([1, 3, 2, 4], dtype=int64)

In [46]:
data['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

In [47]:
data['Education'] = data['Education'].map({1:0,2:1,3:1,4:1})

In [48]:
data['Education'].value_counts()

0    583
1    117
Name: Education, dtype: int64

In [49]:
data

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month,Day of the week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,0,2,1,4,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,0,1,0,0,7,1
2,0,0,0,0,2015-07-15,179,51,38,239.554,31,0,0,0,2,7,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,0,2,0,4,7,3
4,0,0,0,0,2015-07-23,289,36,33,239.554,30,0,2,1,2,7,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,2018-05-23,179,22,40,237.656,22,1,2,0,8,5,2
696,1,0,0,0,2018-05-23,225,26,28,237.656,24,0,1,2,3,5,2
697,1,0,0,0,2018-05-24,330,16,28,237.656,25,1,0,0,8,5,3
698,0,0,0,0,2018-05-24,235,16,32,237.656,25,1,0,0,2,5,3


We have preprocessed our dataset and now we can preoceed towards building a LogisticRegression model

In [50]:
data['Absenteeism Time in Hours'].median()

3.0

In [51]:
targets = np.where(data['Absenteeism Time in Hours']>data['Absenteeism Time in Hours'].median(),1,0)

In [52]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [53]:
data['Eccessive Absentism'] = targets

In [54]:
data.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month,Day of the week,Eccessive Absentism
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,0,2,1,4,7,1,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,0,1,0,0,7,1,0
2,0,0,0,0,2015-07-15,179,51,38,239.554,31,0,0,0,2,7,2,0
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,0,2,0,4,7,3,1
4,0,0,0,0,2015-07-23,289,36,33,239.554,30,0,2,1,2,7,3,0


In [55]:
# check if dataset is balanced (what % of targets are 1s)
# targets.sum() will give us the number of 1s that there are
# the shape[0] will give us the length of the targets array
targets.sum()/targets.shape[0]

0.45571428571428574

In [56]:
data.columns

Index(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Date',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education', 'Children',
       'Pets', 'Absenteeism Time in Hours', 'Month', 'Day of the week',
       'Eccessive Absentism'],
      dtype='object')

In [57]:
# create a checkpoint by dropping the unnecessary variables
# also drop the variables we 'eliminated' after exploring the weights
data_with_targets = data.drop(['Absenteeism Time in Hours'],axis=1)

In [58]:
data_with_targets

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Month,Day of the week,Eccessive Absentism
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,0,2,1,7,1,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,0,1,0,7,1,0
2,0,0,0,0,2015-07-15,179,51,38,239.554,31,0,0,0,7,2,0
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,0,2,0,7,3,1
4,0,0,0,0,2015-07-23,289,36,33,239.554,30,0,2,1,7,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,2018-05-23,179,22,40,237.656,22,1,2,0,5,2,1
696,1,0,0,0,2018-05-23,225,26,28,237.656,24,0,1,2,5,2,0
697,1,0,0,0,2018-05-24,330,16,28,237.656,25,1,0,0,5,3,1
698,0,0,0,0,2018-05-24,235,16,32,237.656,25,1,0,0,5,3,0


## Select the input for the regression

In [59]:
unscaled_inputs=data_with_targets.iloc[:,:-1]

In [60]:
unscaled_inputs.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Month,Day of the week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,0,2,1,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,0,1,0,7,1
2,0,0,0,0,2015-07-15,179,51,38,239.554,31,0,0,0,7,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,0,2,0,7,3
4,0,0,0,0,2015-07-23,289,36,33,239.554,30,0,2,1,7,3


In [61]:
unscaled_inputs.drop('Date',axis=1,inplace=True)
unscaled_inputs.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Month,Day of the week
0,0,0,0,1,289,36,33,239.554,30,0,2,1,7,1
1,0,0,0,0,118,13,50,239.554,31,0,1,0,7,1
2,0,0,0,0,179,51,38,239.554,31,0,0,0,7,2
3,1,0,0,0,279,5,39,239.554,24,0,2,0,7,3
4,0,0,0,0,289,36,33,239.554,30,0,2,1,7,3


## Standerdizing the data

In [62]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(unscaled_inputs)
scaled_inputs = scaler.fit_transform(unscaled_inputs)

In [63]:
scaled_inputs

array([[-0.54212562, -0.0758098 , -0.34381807, ...,  0.26848661,
         0.18272635, -0.68370352],
       [-0.54212562, -0.0758098 , -0.34381807, ..., -0.58968976,
         0.18272635, -0.68370352],
       [-0.54212562, -0.0758098 , -0.34381807, ..., -0.58968976,
         0.18272635, -0.00772546],
       ...,
       [ 1.84459094, -0.0758098 , -0.34381807, ..., -0.58968976,
        -0.3882935 ,  0.66825259],
       [-0.54212562, -0.0758098 , -0.34381807, ..., -0.58968976,
        -0.3882935 ,  0.66825259],
       [-0.54212562, -0.0758098 , -0.34381807, ...,  0.26848661,
        -0.3882935 ,  0.66825259]])

In [64]:
scaled_inputs.shape

(700, 14)

## Split the data for training and testing

In [65]:
x_train,x_test,y_train,y_test = train_test_split(scaled_inputs,targets,test_size=0.3,random_state=49)

## Create the regression model

In [66]:
reg = LogisticRegression()
reg.fit(x_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [67]:
reg.score(x_train,y_train)

0.7653061224489796

In [68]:
model_output = reg.predict(x_train)

In [69]:
np.sum(model_output==y_train)/model_output.shape[0]

0.7653061224489796

In [70]:
reg.intercept_

array([-0.19257193])

In [71]:
reg.coef_

array([[ 0.73661346, -0.02437763,  0.83564455,  0.00974289,  0.56757181,
         0.00487983, -0.05131194,  0.09232266,  0.13707803,  0.03548501,
         0.39021612, -0.38696859,  0.071148  , -0.33892258]])

In [72]:
feature_name=unscaled_inputs.columns.values

In [73]:
summary_table = pd.DataFrame(columns=['Features'],data=feature_name)

In [74]:
summary_table['Coefficient'] = np.transpose(reg.coef_)

In [75]:
summary_table

Unnamed: 0,Features,Coefficient
0,Reason_1,0.736613
1,Reason_2,-0.024378
2,Reason_3,0.835645
3,Reason_4,0.009743
4,Transportation Expense,0.567572
5,Distance to Work,0.00488
6,Age,-0.051312
7,Daily Work Load Average,0.092323
8,Body Mass Index,0.137078
9,Education,0.035485


In [76]:
# do a little Python trick to move the intercept to the top of the summary table
# move all indices by 1
summary_table.index = summary_table.index + 1

# add the intercept at index 0
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

# sort the df by index
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Features,Coefficient
0,Intercept,-0.192572
1,Reason_1,0.736613
2,Reason_2,-0.024378
3,Reason_3,0.835645
4,Reason_4,0.009743
5,Transportation Expense,0.567572
6,Distance to Work,0.00488
7,Age,-0.051312
8,Daily Work Load Average,0.092323
9,Body Mass Index,0.137078


## Interpreting the coefficients

In [77]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

In [78]:
summary_table

Unnamed: 0,Features,Coefficient,Odds_ratio
0,Intercept,-0.192572,0.824835
1,Reason_1,0.736613,2.08885
2,Reason_2,-0.024378,0.975917
3,Reason_3,0.835645,2.3063
4,Reason_4,0.009743,1.009791
5,Transportation Expense,0.567572,1.763979
6,Distance to Work,0.00488,1.004892
7,Age,-0.051312,0.949982
8,Daily Work Load Average,0.092323,1.096719
9,Body Mass Index,0.137078,1.146918


In [79]:
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Features,Coefficient,Odds_ratio
3,Reason_3,0.835645,2.3063
1,Reason_1,0.736613,2.08885
5,Transportation Expense,0.567572,1.763979
11,Children,0.390216,1.4773
9,Body Mass Index,0.137078,1.146918
8,Daily Work Load Average,0.092323,1.096719
13,Month,0.071148,1.07374
10,Education,0.035485,1.036122
4,Reason_4,0.009743,1.009791
6,Distance to Work,0.00488,1.004892


## Testing the model

In [80]:
# assess the test accuracy of the model
reg.score(x_test,y_test)

0.7666666666666667

In [81]:
# find the predicted probabilities of each class
# the first column shows the probability of a particular observation to be 0, while the second one - to be 1
predicted_proba = reg.predict_proba(x_test)

# let's check that out
predicted_proba

array([[0.09115308, 0.90884692],
       [0.06948741, 0.93051259],
       [0.85837255, 0.14162745],
       [0.29073226, 0.70926774],
       [0.79308621, 0.20691379],
       [0.88985349, 0.11014651],
       [0.66869188, 0.33130812],
       [0.71792184, 0.28207816],
       [0.84306338, 0.15693662],
       [0.84701017, 0.15298983],
       [0.68039287, 0.31960713],
       [0.78728142, 0.21271858],
       [0.35655413, 0.64344587],
       [0.08541165, 0.91458835],
       [0.44625695, 0.55374305],
       [0.31150078, 0.68849922],
       [0.8480045 , 0.1519955 ],
       [0.22452346, 0.77547654],
       [0.80112015, 0.19887985],
       [0.08103934, 0.91896066],
       [0.79174622, 0.20825378],
       [0.01728104, 0.98271896],
       [0.8401077 , 0.1598923 ],
       [0.08899379, 0.91100621],
       [0.82952513, 0.17047487],
       [0.40442731, 0.59557269],
       [0.43964201, 0.56035799],
       [0.70259503, 0.29740497],
       [0.30796633, 0.69203367],
       [0.4683974 , 0.5316026 ],
       [0.

In [82]:
predicted_proba.shape

(210, 2)

In [83]:
# select ONLY the probabilities referring to 1s
predicted_proba[:,1]

array([0.90884692, 0.93051259, 0.14162745, 0.70926774, 0.20691379,
       0.11014651, 0.33130812, 0.28207816, 0.15693662, 0.15298983,
       0.31960713, 0.21271858, 0.64344587, 0.91458835, 0.55374305,
       0.68849922, 0.1519955 , 0.77547654, 0.19887985, 0.91896066,
       0.20825378, 0.98271896, 0.1598923 , 0.91100621, 0.17047487,
       0.59557269, 0.56035799, 0.29740497, 0.69203367, 0.5316026 ,
       0.15458342, 0.70378729, 0.8421028 , 0.28421491, 0.63740925,
       0.75857127, 0.15115742, 0.51957474, 0.88008731, 0.14902198,
       0.55405879, 0.12973875, 0.62917702, 0.48002423, 0.58045221,
       0.50796327, 0.87919872, 0.28327471, 0.36339596, 0.86131321,
       0.07809243, 0.20535402, 0.31068216, 0.14815131, 0.28485807,
       0.23411838, 0.30865093, 0.06363876, 0.77013941, 0.24957454,
       0.11427081, 0.16538541, 0.2007108 , 0.33994464, 0.62619826,
       0.79202438, 0.93574947, 0.1291098 , 0.41395175, 0.16150274,
       0.23851383, 0.24462256, 0.15973416, 0.16092049, 0.34953

## Save the model

In [84]:
import pickle

In [85]:
# pickle the model file
with open('model', 'wb') as file:
    pickle.dump(reg, file)