# Prediction of Absenteeism in Employees 


## Introduction to the Problem

In todays highly competitive business environment, there is increased pressure on employees of companies to constantly perform better, ultimately leading to increased stress levels. **High amounts of stress** may have a **negative** impact on the employees **health**, resulting in various short term to long term illnesses, such has Fatigue and Depression. This gives rise to **Absenteeism** in Employees.

Absenteeism brings down the productivity of any organization, and consistent efforts must be made to minimize it.

From the point of view of a **Productivity Incharge**, we would like to know whether an employee is likely to be away from work, for a specific time frame, on any given work day. Predicting the likelihood of Absenteeism, will benefit the organization, as it will allow reorganizing and restructuring the work process, in a way to avoid lack of productivity and increase the quality of work.

### Absenteeism
In this study, Absenteeism can be defined as **"Absence from work during normal working hours, resulting in temporary incapacity to execute regular working activity"**.

### Purpose of the Study
To explore whether a person presenting certain characterictics is expected to be away from work at some point in time or not.

### Data
The data was taken from the **UCI Machine Learning Repository**. The database was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil.




	






## Importing Libraries and Loading the Data

In [1]:
import pandas as pd
pd.options.display.max_columns= None
pd.options.display.max_rows= None
import numpy as np

In [2]:
raw_csv_data=pd.read_csv('Desktop/Absenteeism_at_work_AAA/Absenteeism_at_work.csv',delimiter=';')
df=raw_csv_data.copy()
df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2


## Attribute Information

The description of the various attributes as seen above, is:

1. Individual identification (ID)
2. Reason for absence (ICD).
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

I Certain infectious and parasitic diseases  
II Neoplasms  
III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism  
IV Endocrine, nutritional and metabolic diseases  
V Mental and behavioural disorders  
VI Diseases of the nervous system  
VII Diseases of the eye and adnexa  
VIII Diseases of the ear and mastoid process  
IX Diseases of the circulatory system  
X Diseases of the respiratory system  
XI Diseases of the digestive system  
XII Diseases of the skin and subcutaneous tissue  
XIII Diseases of the musculoskeletal system and connective tissue  
XIV Diseases of the genitourinary system  
XV Pregnancy, childbirth and the puerperium  
XVI Certain conditions originating in the perinatal period  
XVII Congenital malformations, deformations and chromosomal abnormalities  
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified  
XIX Injury, poisoning and certain other consequences of external causes  
XX External causes of morbidity and mortality  
XXI Factors influencing health status and contact with health services.

And 7 categories without (ICD) are patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day 
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)



## Data Preprocessing

Let's clean the data for it to be suitable for the Machine Learning Algorithm.

In [3]:
df=df.drop(['ID'],axis=1)  #dropping id column

Let's check the data types and missing values, if any

In [4]:
df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 20 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Reason for absence               740 non-null    int64  
 1   Month of absence                 740 non-null    int64  
 2   Day of the week                  740 non-null    int64  
 3   Seasons                          740 non-null    int64  
 4   Transportation expense           740 non-null    int64  
 5   Distance from Residence to Work  740 non-null    int64  
 6   Service time                     740 non-null    int64  
 7   Age                              740 non-null    int64  
 8   Work load Average/day            740 non-null    float64
 9   Hit target                       740 non-null    int64  
 10  Disciplinary failure             740 non-null    int64  
 11  Education                        740 non-null    int64  
 12  Son                   

There are no missing values and all the columns are numeric.

### Analyzing 'Reason for absence'  

In [5]:
df['Reason for absence']

0      26
1       0
2      23
3       7
4      23
5      23
6      22
7      23
8      19
9      22
10      1
11      1
12     11
13     11
14     23
15     14
16     23
17     21
18     11
19     23
20     10
21     11
22     13
23     28
24     18
25     25
26     23
27     28
28     18
29     23
30     18
31     18
32     23
33     18
34     23
35     23
36     24
37     11
38     28
39     23
40     23
41     23
42     23
43     19
44     23
45     23
46     23
47     23
48     22
49     14
50      0
51      0
52     23
53     23
54      0
55      0
56     18
57     23
58      0
59     23
60     23
61     23
62     23
63     23
64      0
65     23
66     23
67     23
68     23
69     23
70     23
71     23
72     23
73     23
74     19
75     14
76     28
77     26
78     23
79     28
80     23
81     23
82     13
83     21
84     23
85     10
86     22
87     14
88     23
89      6
90     23
91     21
92     13
93     28
94     28
95     28
96      7
97     23
98     23
99     19


In [6]:
df['Reason for absence'].describe()

count    740.000000
mean      19.216216
std        8.433406
min        0.000000
25%       13.000000
50%       23.000000
75%       26.000000
max       28.000000
Name: Reason for absence, dtype: float64

In [7]:
sorted(df['Reason for absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

Reason 20 i.e.**"XX External causes of morbidity and mortality"** has never been the reason for Absence in employees in the period of 3 years.

Let's **one-hot encode** the "Reasons for absence" column

In [8]:
reason_columns=pd.get_dummies(df['Reason for absence'])
reason_columns.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


Let's check if there is exactly one reason per employee

In [9]:
reason_columns['check']=reason_columns.sum(axis=1) 
reason_columns

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,check
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1


In [10]:
reason_columns['check'].sum(axis=0)

740

In [11]:
reason_columns['check'].unique()

array([1])

In [12]:
reason_columns=reason_columns.drop(['check'],axis=1)

Let's add the param **"drop_first=True"**. It removes the extra column created during dummy variable creation, thereby reducing the correlations created among dummy variables.

In [13]:
reason_columns=pd.get_dummies(df['Reason for absence'],drop_first=True)
reason_columns

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


Now let's group the dummy variables into similar reasons, add them to the dataframe, and remove the "Reason for absence" column.


In [14]:
# Reasons 1-14 can be grouped into one category, as all of them relate to serious illnesses and diseases
reason_type_1=reason_columns.loc[:,1:14].max(axis=1)

# Reasons 15-17 relate to Pregnancy and Childbirth
reason_type_2=reason_columns.loc[:,15:17].max(axis=1)

# Reasons 18-21 relate to Poisoning, Injury and unexplained reasons
reason_type_3=reason_columns.loc[:,18:21].max(axis=1)

# Reasons 22-28 are common reasons such as a Medical checkup, Dental consultation etc
reason_type_4=reason_columns.loc[:,21:].max(axis=1)

In [15]:
df=df.drop(['Reason for absence'],axis=1)
df=pd.concat([df,reason_type_1,reason_type_2,reason_type_3,reason_type_4],axis=1)
df.head()

Unnamed: 0,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,0,1,2,3
0,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4,0,0,0,1
1,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0,0,0,0,0
2,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2,0,0,0,1
3,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4,1,0,0,0
4,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2,0,0,0,1


In [16]:
df.columns.values

array(['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours', 0, 1, 2, 3], dtype=object)

In [17]:
column_names=['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours','Major Illness','Pregnancy/Childbirth','Injury/Poisoning/Others','General Reasons']

In [18]:
df.columns=column_names
df.head()

Unnamed: 0,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,Major Illness,Pregnancy/Childbirth,Injury/Poisoning/Others,General Reasons
0,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4,0,0,0,1
1,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0,0,0,0,0
2,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2,0,0,0,1
3,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4,1,0,0,0
4,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2,0,0,0,1


In [19]:
df_reason_mod=df.copy()

### Analyzing 'Education' 

In [20]:
df_reason_mod['Education'].unique() 

array([1, 3, 2, 4])

In the Education column, **High school** is **1**, **Graduate** is **2**, **Postgraduate** is **3**, & **Master and Doctor** is **4**.

Let's see the education of employees:

In [21]:
df_reason_mod['Education'].value_counts()

1    611
3     79
2     46
4      4
Name: Education, dtype: int64

As observed, a majority of employees are less educated i.e. 'High School' being their highest education.
So, we can group employees into 2 major groups, 'High school' and 'Graduate & above'.

In [22]:
df_reason_mod['Education']=df_reason_mod['Education'].map({1:0,3:1,2:1,4:1})

In [23]:
df_reason_mod['Education'].unique()

array([0, 1])

In [24]:
df_reason_mod['Education'].value_counts()

0    611
1    129
Name: Education, dtype: int64

In [25]:
df_preprocessed=df_reason_mod.copy()

In [26]:
df_preprocessed.head()

Unnamed: 0,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,Major Illness,Pregnancy/Childbirth,Injury/Poisoning/Others,General Reasons
0,7,3,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,4,0,0,0,1
1,7,3,1,118,13,18,50,239.554,97,1,0,1,1,0,0,98,178,31,0,0,0,0,0
2,7,4,1,179,51,18,38,239.554,97,0,0,0,1,0,0,89,170,31,2,0,0,0,1
3,7,5,1,279,5,14,39,239.554,97,0,0,2,1,1,0,68,168,24,4,1,0,0,0
4,7,5,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,2,0,0,0,1


In [41]:
df_preprocessed.to_csv('Downloads/Absenteeism_prepped.csv',index= False)

## Data Modeling

Now that the data is preprocessed, let's create a Machine learning model for **classifying** our employees into **two** major groups, those who are **'Moderately Absent'** and those who are **'Excessively Absent'**.

The following classification algorithms will be used, and the one with the highest accuracy will be chosen:
* Logistic Regression
* Decision Trees
* K Nearest Neighbours (KNNs)
* SVM



Let's decide how to split the employees into 'Moderately Absent' and 'Excessively Absent' employees

In [27]:
data= df_preprocessed.copy()
data['Absenteeism time in hours'].median()

3.0

**3 hours** is the Median Absence Time of Employees.
So let's use this value as the cutoff point. 
Any employee absent less than 3 hours, is 'Moderately Absent' and more than 3 hours is 'Excessively Absent'.

### Creating Targets

Let's map all hours below 3 hours as 0, and above 3 as 1, in the targets column.

In [28]:
targets=np.where(data['Absenteeism time in hours']>data['Absenteeism time in hours'].median(),1,0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [29]:
data['Excessive Absenteeism']= targets #adding the mapped targets to the dataframe
data=data.drop(['Absenteeism time in hours','Work load Average/day ','Day of the week','Distance from Residence to Work','Hit target','Seasons','Disciplinary failure'],axis=1)
data.head()

Unnamed: 0,Month of absence,Transportation expense,Service time,Age,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Major Illness,Pregnancy/Childbirth,Injury/Poisoning/Others,General Reasons,Excessive Absenteeism
0,7,289,13,33,0,2,1,0,1,90,172,30,0,0,0,1,1
1,7,118,18,50,0,1,1,0,0,98,178,31,0,0,0,0,0
2,7,179,18,38,0,0,1,0,0,89,170,31,0,0,0,1,0
3,7,279,14,39,0,2,1,1,0,68,168,24,1,0,0,0,1
4,7,289,13,33,0,2,1,0,1,90,172,30,0,0,0,1,0


Let's check what percentage of employees are excessively absent

In [30]:
targets.sum()/targets.shape[0]

0.4581081081081081

Around **46%** of the employees are **excessively absent** from work.

### Creating Inputs

In [31]:
unscaled_inputs=data.iloc[:,:-1]
unscaled_inputs.head()

Unnamed: 0,Month of absence,Transportation expense,Service time,Age,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Major Illness,Pregnancy/Childbirth,Injury/Poisoning/Others,General Reasons
0,7,289,13,33,0,2,1,0,1,90,172,30,0,0,0,1
1,7,118,18,50,0,1,1,0,0,98,178,31,0,0,0,0
2,7,179,18,38,0,0,1,0,0,89,170,31,0,0,0,1
3,7,279,14,39,0,2,1,1,0,68,168,24,1,0,0,0
4,7,289,13,33,0,2,1,0,1,90,172,30,0,0,0,1


### Standardizing Inputs

We need to scale only selected columns from the dataframe, as all columns relating to "reasons" are categorical.
Let's create a custom scaler, based on sklearns StandardScaler object.

In [32]:
from sklearn.preprocessing import StandardScaler

In [33]:
from sklearn.base import BaseEstimator, TransformerMixin
# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    
    def transform(self, X, y=None, copy=None):
        
        # record the initial order of the columns
        init_col_order = X.columns
        
        # scale all selected columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # columns not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        
        # return all columns in the original order 
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [34]:
unscaled_inputs.columns

Index(['Month of absence', 'Transportation expense', 'Service time', 'Age',
       'Education', 'Son', 'Social drinker', 'Social smoker', 'Pet', 'Weight',
       'Height', 'Body mass index', 'Major Illness', 'Pregnancy/Childbirth',
       'Injury/Poisoning/Others', 'General Reasons'],
      dtype='object')

In [35]:
columns_to_omit=['Major Illness', 'Pregnancy/Childbirth', 'Injury/Poisoning/Others','General Reasons','Education']
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [36]:
absenteeism_scaler = CustomScaler(columns_to_scale)



In [37]:
absenteeism_scaler.fit(unscaled_inputs)



CustomScaler(columns=['Month of absence', 'Transportation expense',
                      'Service time', 'Age', 'Son', 'Social drinker',
                      'Social smoker', 'Pet', 'Weight', 'Height',
                      'Body mass index'],
             copy=None, with_mean=None, with_std=None)

In [38]:
scaled_inputs=absenteeism_scaler.transform(unscaled_inputs)

In [39]:
scaled_inputs.head()

Unnamed: 0,Month of absence,Transportation expense,Service time,Age,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Major Illness,Pregnancy/Childbirth,Injury/Poisoning/Others,General Reasons
0,0.196763,1.011408,0.10177,-0.532868,0,0.893723,0.872872,-0.280566,0.19285,0.851673,-0.019046,0.775932,0,0,0,1
1,0.196763,-1.544379,1.242825,2.09286,0,-0.017234,0.872872,-0.280566,-0.56624,1.473056,0.975828,1.009438,0,0,0,0
2,0.196763,-0.632665,1.242825,0.239405,0,-0.928191,0.872872,-0.280566,-0.56624,0.774,-0.350671,1.009438,0,0,0,1
3,0.196763,0.861947,0.329981,0.393859,0,0.893723,0.872872,3.564226,-0.56624,-0.857131,-0.682295,-0.6251,1,0,0,0
4,0.196763,1.011408,0.10177,-0.532868,0,0.893723,0.872872,-0.280566,0.19285,0.851673,-0.019046,0.775932,0,0,0,1


### Train Test Split

Let' split the data into a 80/20 ratio.

In [40]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(scaled_inputs,targets,train_size=0.8,shuffle=True,random_state=20)

## Outlining the Model

### 1) Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [42]:
reg=LogisticRegression()
reg.fit(x_train,y_train)

LogisticRegression()

In [43]:
reg.score(x_train,y_train) #train accuracy

0.768581081081081

Let's create a summary table for better understanding

In [47]:
feature_names=scaled_inputs.columns.values

summary_table=pd.DataFrame(columns=['Features'],data=feature_names)
summary_table['Coefficient']=np.transpose(reg.coef_)
summary_table

Unnamed: 0,Features,Coefficient
0,Month of absence,0.086401
1,Transportation expense,0.44622
2,Service time,-0.393747
3,Age,0.076426
4,Education,0.023121
5,Son,0.373397
6,Social drinker,0.312719
7,Social smoker,0.152802
8,Pet,-0.324468
9,Weight,0.517346


Adding the intercept to our summary table

In [48]:
summary_table.index=summary_table.index+1
summary_table.loc[0]=['Intercept',reg.intercept_[0]]
summary_table.sort_index()
summary_table.sort_index(ascending=True)

Unnamed: 0,Features,Coefficient
0,Intercept,-1.806932
1,Month of absence,0.086401
2,Transportation expense,0.44622
3,Service time,-0.393747
4,Age,0.076426
5,Education,0.023121
6,Son,0.373397
7,Social drinker,0.312719
8,Social smoker,0.152802
9,Pet,-0.324468


### Odds Ratio

For a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio.
An odd ratio of 1 means **No Change**.

In [49]:
summary_table['Odds_Ratio']=np.exp(summary_table.Coefficient)
summary_table.sort_values('Odds_Ratio',ascending=False)
summary_table

Unnamed: 0,Features,Coefficient,Odds_Ratio
1,Month of absence,0.086401,1.090243
2,Transportation expense,0.44622,1.562395
3,Service time,-0.393747,0.674525
4,Age,0.076426,1.079422
5,Education,0.023121,1.023391
6,Son,0.373397,1.452661
7,Social drinker,0.312719,1.367138
8,Social smoker,0.152802,1.165095
9,Pet,-0.324468,0.722912
10,Weight,0.517346,1.67757


From the above table, we observe that the features **'Major Illness'** and **'Injury/Poisoning/Others'** are the major cause of Absenteeism.

### Testing the Model

In [52]:
reg.score(x_test,y_test)

0.7837837837837838

#### The accuracy of Logistic Regression is:
* Train Accuracy: 0.768581081081081 
* Test Accuracy: 0.7837837837837838

In [53]:
predicted_probs=reg.predict_proba(x_test)
predicted_probs[:,1]

array([0.35755889, 0.90400826, 0.13728916, 0.15288027, 0.46458386,
       0.35180016, 0.50933484, 0.44586697, 0.22841809, 0.14804082,
       0.77347826, 0.14804082, 0.21967069, 0.249739  , 0.58469612,
       0.67413945, 0.24199311, 0.77539644, 0.15600407, 0.63259532,
       0.21967069, 0.249739  , 0.24199311, 0.18598553, 0.66445909,
       0.50226541, 0.8438185 , 0.39423964, 0.47084771, 0.59239233,
       0.18982486, 0.61979692, 0.92025731, 0.90828712, 0.8279616 ,
       0.13433621, 0.40499625, 0.27866688, 0.18598553, 0.92865167,
       0.95227658, 0.20578766, 0.8466119 , 0.71291145, 0.56526311,
       0.39293034, 0.38694506, 0.69991796, 0.249739  , 0.36335903,
       0.22710991, 0.68964212, 0.36542672, 0.63695456, 0.75321271,
       0.23187176, 0.13143711, 0.71138301, 0.70323908, 0.93570289,
       0.21967069, 0.23740784, 0.58092422, 0.18220639, 0.22272379,
       0.23740784, 0.65882644, 0.61644986, 0.91548766, 0.18598553,
       0.70773439, 0.32315252, 0.63413294, 0.42739243, 0.21967

### 2) Decision Tree

In [58]:
from sklearn.tree import DecisionTreeClassifier

dTree = DecisionTreeClassifier(criterion="entropy", max_depth = 6)
dTree 

DecisionTreeClassifier(criterion='entropy', max_depth=6)

In [59]:
dTree.fit(x_train,y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=6)

In [60]:
pred = dTree.predict(x_test)

In [61]:
from sklearn import metrics
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, pred))

DecisionTrees's Accuracy:  0.8175675675675675


#### The Decision Trees Accuracy is 0.8175675675675675.

### 3) K Nearest Neighbor (KNN)

In [77]:
from sklearn.neighbors import KNeighborsClassifier

k = 12
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
neigh

KNeighborsClassifier(n_neighbors=12)

In [78]:
yhat = neigh.predict(x_test)

In [79]:
print("Train Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(x_train)))
print("Test Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train Accuracy:  0.7297297297297297
Test Accuracy:  0.7297297297297297


#### The accuracy of KNN is:
* Train Accuracy: 0.7297297297297297
* Test Accuracy:  0.7297297297297297


### 4) Support Vector Machine (SVM)

In [80]:
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(x_train, y_train) 

SVC(kernel='linear')

In [81]:
yhat1 = clf.predict(x_test)

In [82]:
from sklearn.metrics import f1_score
f1_score(y_test, yhat1, average='weighted') 

0.7914057056629339

In [83]:
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat1)

0.5866666666666667

#### The accuracy of SVM is:
* F1 Score: 0.7914057056629339
* Jaccard Score: 0.5866666666666667


## Summarizing Results

**Logistic Regression:**
   * Train Accuracy: 0.768581081081081
   * Test Accuracy: 0.7837837837837838

**Decision Tree**: 0.8175675675675675.

**KNN**:
   * Train Accuracy: 0.7297297297297297
   * Test Accuracy: 0.7297297297297297

**SVM**:
   * F1 Score: 0.7914057056629339
   * Jaccard Score: 0.5866666666666667
    
#### Therefore, Decision Tree has the highest accuracy: 81.75%.

## Saving the Model

In [85]:
import pickle

with open('modelD','wb') as file:
    pickle.dump(dTree,file)
    
    

In [86]:
with open('scalerD','wb') as file:
    pickle.dump(absenteeism_scaler,file)

## Conclusion

The main reasons for the Absence of Employees are Major health issues and Injury/Poisoning. Other factors are responsible too, but not as much as the above. The odds of someone being excessively absent are 18 times higher when they have a Major Illness/Poisoning/Injury.
A good way an organization can deal with this, is to provide a flexible work schedule to such employees, and suitable health plans.