## Absenteeism project - Machine Learning on preprocessed data
### Logistic regression to predict absenteeism

### Logistic regression is a type of classification so we are going to group people into some classes

In [2]:
# Importing libraries
import pandas as pd
import numpy as np

In [3]:
data_preprocessed = pd.read_csv('data/absenteeism_preprocessed.csv')
data_preprocessed

Unnamed: 0,Rfa_group_1,Rfa_group_2,Rfa_group_3,Rfa_group_4,Month,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2


### Creating targets

The approach used in here is to create 2 classes. Using median we will divide absenteeism in hours into two groups.

In [4]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

Median in this case is 3.0. so everything below the median is the first group and everything above is in the second group.

In [5]:
data_preprocessed['Targets'] = np.where(data_preprocessed['Absenteeism Time in Hours'] > 3, 1, 0)
data_preprocessed

Unnamed: 0,Rfa_group_1,Rfa_group_2,Rfa_group_3,Rfa_group_4,Month,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Targets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2,0


The method used in here (dividing into two groups using median) is a "naive" method. This method gives us very good split of data (almost 50/50) so it's very good for numerical reasons. The aim of this part is to show the ML approach so we won't focus on the most appropriate split.

In [6]:
data_targets = data_preprocessed.drop('Absenteeism Time in Hours', axis = 1)
data_targets.head()

Unnamed: 0,Rfa_group_1,Rfa_group_2,Rfa_group_3,Rfa_group_4,Month,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Targets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0
