<a href="https://colab.research.google.com/github/ritvikanandi/Absenteeism/blob/main/Absenteeism_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The following analysis has been done to predict the absenteeism at a workplace during the office hours.**

Attribute Information:

1. Individual identification (ID)
2. Reason for absence (ICD).
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

I Certain infectious and parasitic diseases

II Neoplasms

III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism

IV Endocrine, nutritional and metabolic diseases

V Mental and behavioural disorders

VI Diseases of the nervous system

VII Diseases of the eye and adnexa

VIII Diseases of the ear and mastoid process

IX Diseases of the circulatory system

X Diseases of the respiratory system

XI Diseases of the digestive system

XII Diseases of the skin and subcutaneous tissue

XIII Diseases of the musculoskeletal system and connective tissue

XIV Diseases of the genitourinary system

XV Pregnancy, childbirth and the puerperium

XVI Certain conditions originating in the perinatal period

XVII Congenital malformations, deformations and chromosomal abnormalities

XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified

XIX Injury, poisoning and certain other consequences of external causes

XX External causes of morbidity and mortality

XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons (summer (1), autumn (2), winter (3), spring (4))
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Work load Average/day
11. Hit target
12. Disciplinary failure (yes=1; no=0)
13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
14. Son (number of children)
15. Social drinker (yes=1; no=0)
16. Social smoker (yes=1; no=0)
17. Pet (number of pet)
18. Weight
19. Height
20. Body mass index
21. Absenteeism time in hours (target)

<h1>Importing Libraries</h1>



In [202]:
import pandas as pd
import numpy as np

<h1>Reading Data</h1>

In [203]:
data = pd.read_csv('Absenteeism_at_work.csv', delimiter=';')

In [204]:
data.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2


In [205]:
data.describe()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
count,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0
mean,18.017568,19.216216,6.324324,3.914865,2.544595,221.32973,29.631081,12.554054,36.45,271.490235,94.587838,0.054054,1.291892,1.018919,0.567568,0.072973,0.745946,79.035135,172.114865,26.677027,6.924324
std,11.021247,8.433406,3.436287,1.421675,1.111831,66.952223,14.836788,4.384873,6.478772,39.058116,3.779313,0.226277,0.673238,1.098489,0.495749,0.260268,1.318258,12.883211,6.034995,4.285452,13.330998
min,1.0,0.0,0.0,2.0,1.0,118.0,5.0,1.0,27.0,205.917,81.0,0.0,1.0,0.0,0.0,0.0,0.0,56.0,163.0,19.0,0.0
25%,9.0,13.0,3.0,3.0,2.0,179.0,16.0,9.0,31.0,244.387,93.0,0.0,1.0,0.0,0.0,0.0,0.0,69.0,169.0,24.0,2.0
50%,18.0,23.0,6.0,4.0,3.0,225.0,26.0,13.0,37.0,264.249,95.0,0.0,1.0,1.0,1.0,0.0,0.0,83.0,170.0,25.0,3.0
75%,28.0,26.0,9.0,5.0,4.0,260.0,50.0,16.0,40.0,294.217,97.0,0.0,1.0,2.0,1.0,0.0,1.0,89.0,172.0,31.0,8.0
max,36.0,28.0,12.0,6.0,4.0,388.0,52.0,29.0,58.0,378.884,100.0,1.0,4.0,4.0,1.0,1.0,8.0,108.0,196.0,38.0,120.0


In [206]:
type(data)

pandas.core.frame.DataFrame

In [207]:
#checking for null values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               740 non-null    int64  
 1   Reason for absence               740 non-null    int64  
 2   Month of absence                 740 non-null    int64  
 3   Day of the week                  740 non-null    int64  
 4   Seasons                          740 non-null    int64  
 5   Transportation expense           740 non-null    int64  
 6   Distance from Residence to Work  740 non-null    int64  
 7   Service time                     740 non-null    int64  
 8   Age                              740 non-null    int64  
 9   Work load Average/day            740 non-null    float64
 10  Hit target                       740 non-null    int64  
 11  Disciplinary failure             740 non-null    int64  
 12  Education             

In [208]:
data.isnull().sum()

ID                                 0
Reason for absence                 0
Month of absence                   0
Day of the week                    0
Seasons                            0
Transportation expense             0
Distance from Residence to Work    0
Service time                       0
Age                                0
Work load Average/day              0
Hit target                         0
Disciplinary failure               0
Education                          0
Son                                0
Social drinker                     0
Social smoker                      0
Pet                                0
Weight                             0
Height                             0
Body mass index                    0
Absenteeism time in hours          0
dtype: int64

<h1>Data Preprocessing</h1>

In [209]:
#Drop ID column
data = data.drop(['ID'], axis=1)

In [210]:
data

Unnamed: 0,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,26,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,0,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,23,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,23,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,14,7,3,1,289,36,13,33,264.604,93,0,1,2,1,0,1,90,172,30,8
736,11,7,3,1,235,11,14,37,264.604,93,0,3,1,0,0,1,88,172,29,4
737,0,0,3,1,118,14,13,40,271.219,95,0,1,1,1,0,8,98,170,34,0
738,0,0,4,2,231,35,14,39,271.219,95,0,1,2,1,0,2,100,170,35,0


In [211]:
#grouping reasons for absence
sorted(data['Reason for absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

In [212]:
data['Reason for absence'].value_counts()

23    149
28    112
27     69
13     55
0      43
19     40
22     38
26     33
25     31
11     26
10     25
18     21
14     19
1      16
7      15
6       8
12      8
8       6
21      6
9       4
5       3
24      3
16      3
4       2
15      2
3       1
2       1
17      1
Name: Reason for absence, dtype: int64

In [213]:
data.columns.values

array(['Reason for absence', 'Month of absence', 'Day of the week',
       'Seasons', 'Transportation expense',
       'Distance from Residence to Work', 'Service time', 'Age',
       'Work load Average/day ', 'Hit target', 'Disciplinary failure',
       'Education', 'Son', 'Social drinker', 'Social smoker', 'Pet',
       'Weight', 'Height', 'Body mass index', 'Absenteeism time in hours'],
      dtype=object)

In [214]:
reasons_dummies = pd.get_dummies(data['Reason for absence'])
reasons_dummies = pd.get_dummies(data['Reason for absence'], drop_first=True)
reasons_dummies

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
736,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
737,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
738,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [215]:
reasons_dummies.columns.values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 21, 22, 23, 24, 25, 26, 27, 28])

In [216]:
data = data.drop(['Reason for absence'], axis=1)
data.head()

Unnamed: 0,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2


In [217]:
reason_type_1 = reasons_dummies.loc[:, 1:14].max(axis=1)
reason_type_2 = reasons_dummies.loc[:, 15:17].max(axis=1)
reason_type_3 = reasons_dummies.loc[:, 18:21].max(axis=1)
reason_type_4 = reasons_dummies.loc[:, 22:].max(axis=1)

In [218]:
#concatenation
data = pd.concat([data, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis=1)

In [219]:
data.head()

Unnamed: 0,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,0,1,2,3
0,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4,0,0,0,1
1,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0,0,0,0,0
2,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2,0,0,0,1
3,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4,1,0,0,0
4,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2,0,0,0,1


In [220]:
data.columns.values

array(['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours', 0, 1, 2, 3], dtype=object)

In [221]:
data_columns = ['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']

In [222]:
data.columns = data_columns
data.head()

Unnamed: 0,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,Reason_1,Reason_2,Reason_3,Reason_4
0,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4,0,0,0,1
1,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0,0,0,0,0
2,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2,0,0,0,1
3,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4,1,0,0,0
4,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2,0,0,0,1


In [223]:
#reordering columns
column_names = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours']

In [224]:
data = data[column_names]
data.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,0,0,0,1,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,0,0,0,0,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,0,0,0,1,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,1,0,0,0,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,0,0,0,1,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2


In [225]:
#month of absence
data['Month of absence'].unique()
data['Month of absence'].value_counts()
#copy
df = data.loc[data['Month of absence'] != 0]
df.shape

(737, 23)

In [226]:
df.tail()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
732,0,0,0,1,7,4,1,361,52,3,28,264.604,93,0,1,1,1,0,4,80,172,27,8
733,0,0,0,1,7,4,1,225,26,9,28,264.604,93,0,1,1,0,0,2,69,169,24,8
734,1,0,0,0,7,2,1,369,17,12,31,264.604,93,0,1,3,1,0,0,70,169,25,80
735,1,0,0,0,7,3,1,289,36,13,33,264.604,93,0,1,2,1,0,1,90,172,30,8
736,1,0,0,0,7,3,1,235,11,14,37,264.604,93,0,3,1,0,0,1,88,172,29,4


In [227]:
df['Day of the week'].unique()

array([3, 4, 5, 6, 2])

In [228]:
df["Seasons"].unique()

array([1, 4, 2, 3])

In [229]:
df['Seasons'].value_counts()

4    195
2    191
3    182
1    169
Name: Seasons, dtype: int64

In [230]:
df['Hit target'].unique()

array([ 97,  92,  93,  95,  99,  96,  94,  98,  81,  88, 100,  87,  91])

In [231]:
df['Hit target'].value_counts()

93     105
99     102
97      89
92      79
96      75
95      72
98      66
91      45
94      34
88      28
81      19
87      12
100     11
Name: Hit target, dtype: int64

In [232]:
df['Education'].value_counts()

1    608
3     79
2     46
4      4
Name: Education, dtype: int64

In [233]:
#classifying education to only 2 groups
df['Education'] = df['Education'].map({1:0, 2:1, 3:1, 4:1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [234]:
df['Education'].value_counts()

0    608
1    129
Name: Education, dtype: int64

In [235]:
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,0,0,0,1,7,3,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,4
1,0,0,0,0,7,3,1,118,13,18,50,239.554,97,1,0,1,1,0,0,98,178,31,0
2,0,0,0,1,7,4,1,179,51,18,38,239.554,97,0,0,0,1,0,0,89,170,31,2
3,1,0,0,0,7,5,1,279,5,14,39,239.554,97,0,0,2,1,1,0,68,168,24,4
4,0,0,0,1,7,5,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,2


In [236]:
df['Son'].unique()

array([2, 1, 0, 4, 3])

In [237]:
df['Son'].value_counts()

0    298
1    227
2    155
4     42
3     15
Name: Son, dtype: int64

In [238]:
df_preprocessed = df.copy()

In [239]:
df_preprocessed.to_csv('Absenteeism_preprocessed.csv', index=False)

<h1>Machine Learning</h1>

In [240]:
#creating the targets
df_preprocessed['Absenteeism time in hours'].median()

3.0

In [241]:
#dividing into two classes
targets = np.where(df_preprocessed['Absenteeism time in hours'] > 3, 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [242]:
df_preprocessed['Excessive Absenteeism'] = targets
df_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,Excessive Absenteeism
0,0,0,0,1,7,3,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,4,1
1,0,0,0,0,7,3,1,118,13,18,50,239.554,97,1,0,1,1,0,0,98,178,31,0,0
2,0,0,0,1,7,4,1,179,51,18,38,239.554,97,0,0,0,1,0,0,89,170,31,2,0
3,1,0,0,0,7,5,1,279,5,14,39,239.554,97,0,0,2,1,1,0,68,168,24,4,1
4,0,0,0,1,7,5,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,2,0


In [243]:
targets.sum()/targets.shape[0]

0.4599728629579376

In [244]:
df_preprocessed = df_preprocessed.drop(['Absenteeism time in hours'], axis=1)
df_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Excessive Absenteeism
0,0,0,0,1,7,3,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,1
1,0,0,0,0,7,3,1,118,13,18,50,239.554,97,1,0,1,1,0,0,98,178,31,0
2,0,0,0,1,7,4,1,179,51,18,38,239.554,97,0,0,0,1,0,0,89,170,31,0
3,1,0,0,0,7,5,1,279,5,14,39,239.554,97,0,0,2,1,1,0,68,168,24,1
4,0,0,0,1,7,5,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30,0


**Input Selection**

In [245]:
unscaled_inputs = df_preprocessed.iloc[:, :-1]

In [246]:
unscaled_inputs.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index
0,0,0,0,1,7,3,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30
1,0,0,0,0,7,3,1,118,13,18,50,239.554,97,1,0,1,1,0,0,98,178,31
2,0,0,0,1,7,4,1,179,51,18,38,239.554,97,0,0,0,1,0,0,89,170,31
3,1,0,0,0,7,5,1,279,5,14,39,239.554,97,0,0,2,1,1,0,68,168,24
4,0,0,0,1,7,5,1,289,36,13,33,239.554,97,0,0,2,1,0,1,90,172,30


In [247]:
unscaled_inputs.shape

(737, 22)

**Data Scaling**

In [248]:
#Standardize the data
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy,with_mean,with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    # the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        
        # record the initial order of the columns
        init_col_order = X.columns
        
        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [249]:
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','Education']

In [250]:
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [251]:
absenteeism_scaler = CustomScaler(columns_to_scale)
absenteeism_scaler.fit(unscaled_inputs)



CustomScaler(columns=['Month of absence', 'Day of the week', 'Seasons',
                      'Transportation expense',
                      'Distance from Residence to Work', 'Service time', 'Age',
                      'Work load Average/day ', 'Hit target',
                      'Disciplinary failure', 'Son', 'Social drinker',
                      'Social smoker', 'Pet', 'Weight', 'Height',
                      'Body mass index'],
             copy=None, with_mean=None, with_std=None)

In [252]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [253]:
scaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index
0,0,0,0,1,0.190199,-0.642562,-1.39155,1.008522,0.429824,0.102611,-0.529563,-0.816581,0.637849,-0.239560,0,0.893556,0.873589,-0.281181,0.205869,0.856747,-0.019315,0.782414
1,0,0,0,0,0.190199,-0.642562,-1.39155,-1.546940,-1.120707,1.241526,2.103332,-0.816581,0.637849,4.174326,0,-0.016045,0.873589,-0.281181,-0.568242,1.478916,0.973858,1.016535
2,0,0,0,1,0.190199,0.061105,-1.39155,-0.635342,1.441040,1.241526,0.244818,-0.816581,0.637849,-0.239560,0,-0.925645,0.873589,-0.281181,-0.568242,0.778976,-0.350373,1.016535
3,1,0,0,0,0.190199,0.764773,-1.39155,0.859080,-1.660022,0.330394,0.399694,-0.816581,0.637849,-0.239560,0,0.893556,0.873589,3.556424,-0.568242,-0.854215,-0.681431,-0.622310
4,0,0,0,1,0.190199,0.764773,-1.39155,1.008522,0.429824,0.102611,-0.529563,-0.816581,0.637849,-0.239560,0,0.893556,0.873589,-0.281181,0.205869,0.856747,-0.019315,0.782414
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
732,0,0,0,1,0.190199,0.061105,-1.39155,2.084506,1.508454,-2.175221,-1.303944,-0.176097,-0.419137,-0.239560,0,-0.016045,0.873589,-0.281181,2.528202,0.079037,-0.019315,0.080052
733,0,0,0,1,0.190199,0.061105,-1.39155,0.052092,-0.244320,-0.808522,-1.303944,-0.176097,-0.419137,-0.239560,0,-0.016045,-1.144703,-0.281181,0.979980,-0.776444,-0.515902,-0.622310
734,1,0,0,0,0.190199,-1.346230,-1.39155,2.204059,-0.851050,-0.125173,-0.839315,-0.176097,-0.419137,-0.239560,0,1.803157,0.873589,-0.281181,-0.568242,-0.698673,-0.515902,-0.388189
735,1,0,0,0,0.190199,-0.642562,-1.39155,1.008522,0.429824,0.102611,-0.529563,-0.176097,-0.419137,-0.239560,0,0.893556,0.873589,-0.281181,0.205869,0.856747,-0.019315,0.782414


In [254]:
scaled_inputs.shape

(737, 22)

**Data Splitting**

In [255]:
from sklearn.model_selection import train_test_split

In [256]:
xtrain, xtest, ytrain, ytest = train_test_split(scaled_inputs, targets, test_size=0.20, random_state=20)

In [257]:
xtrain.shape

(589, 22)

In [258]:
xtest.shape

(148, 22)

In [259]:
ytest.shape

(148,)

In [260]:
ytrain.shape

(589,)

**Logistic Regression**

In [261]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

**Training**

In [262]:
reg = LogisticRegression()

In [263]:

reg.fit(xtrain, ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [264]:
reg.score(xtrain, ytrain)

0.7538200339558574

In [265]:
ypredict = reg.predict(xtest)

In [266]:
from sklearn.metrics import accuracy_score

log_reg = accuracy_score(ytest, ypredict)
log_reg

0.8175675675675675

**Finding Coefficients and intercept**

In [267]:
reg.intercept_

array([0.0773481])

In [268]:
reg.coef_

array([[ 0.68473232, -0.1644601 ,  1.00419148, -1.21070252,  0.22787066,
        -0.18972768, -0.26770267,  0.63532522, -0.02424812, -0.14845702,
        -0.0868686 ,  0.0569517 , -0.01121035, -1.35465521,  0.08163765,
         0.38075204,  0.31000209,  0.08764499, -0.32668696,  0.41803048,
        -0.11897767, -0.21072596]])

In [269]:
feature_name = unscaled_inputs.columns.values

In [270]:
summary_table = pd.DataFrame(columns=['Feature_name'], data=feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_)
summary_table

Unnamed: 0,Feature_name,Coefficient
0,Reason_1,0.684732
1,Reason_2,-0.16446
2,Reason_3,1.004191
3,Reason_4,-1.210703
4,Month of absence,0.227871
5,Day of the week,-0.189728
6,Seasons,-0.267703
7,Transportation expense,0.635325
8,Distance from Residence to Work,-0.024248
9,Service time,-0.148457


In [271]:
summary_table.index = summary_table.index+1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature_name,Coefficient
0,Intercept,0.077348
1,Reason_1,0.684732
2,Reason_2,-0.16446
3,Reason_3,1.004191
4,Reason_4,-1.210703
5,Month of absence,0.227871
6,Day of the week,-0.189728
7,Seasons,-0.267703
8,Transportation expense,0.635325
9,Distance from Residence to Work,-0.024248


In [272]:
summary_table['odds_ratio'] = np.exp(summary_table.Coefficient)
summary_table

Unnamed: 0,Feature_name,Coefficient,odds_ratio
0,Intercept,0.077348,1.080418
1,Reason_1,0.684732,1.983241
2,Reason_2,-0.16446,0.848352
3,Reason_3,1.004191,2.729699
4,Reason_4,-1.210703,0.297988
5,Month of absence,0.227871,1.255923
6,Day of the week,-0.189728,0.827184
7,Seasons,-0.267703,0.765135
8,Transportation expense,0.635325,1.887636
9,Distance from Residence to Work,-0.024248,0.976044


In [273]:
summary_table.sort_values('odds_ratio', ascending=False)

Unnamed: 0,Feature_name,Coefficient,odds_ratio
3,Reason_3,1.004191,2.729699
1,Reason_1,0.684732,1.983241
8,Transportation expense,0.635325,1.887636
20,Weight,0.41803,1.518967
16,Son,0.380752,1.463385
17,Social drinker,0.310002,1.363428
5,Month of absence,0.227871,1.255923
18,Social smoker,0.087645,1.091601
15,Education,0.081638,1.085063
0,Intercept,0.077348,1.080418


**Save the model**

In [277]:
import pickle

In [279]:
with open('model', 'wb') as file:
  pickle.dump(reg, file)

In [280]:
with open('scaler', 'wb') as file:
  pickle.dump(absenteeism_scaler, file)