# Project: Absenteeism_at_work

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#Machine learning">Machine learning Models</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Descriptions:** This dataset was created with records of absence at work from July 2007 to July 2010 at a courier company in Brazil.


 **Dataset columns description:**

- **Individual identification (ID)**:the unique identifier for each employee.
- **Reason for absence (ICD)**: 
> **Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:**
   > * (0) Ther is no absence.  
   > * (1) Certain infectious and parasitic diseases  
   > * (2) Neoplasms  
   > * (2) Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism        > * (4) Endocrine, nutritional and metabolic diseases  
   > * (5) Mental and behavioural disorders  
   > * (6) Diseases of the nervous system  
   > * (7) Diseases of the eye and adnexa  
   > * (8) Diseases of the ear and mastoid process  
   > * (9) Diseases of the circulatory system  
   > * (10) Diseases of the respiratory system  
   > * (11) Diseases of the digestive system  
   > * (12) Diseases of the skin and subcutaneous tissue  
   > * (13) Diseases of the musculoskeletal system and connective tissue  
   > * (14) Diseases of the genitourinary system  
   > * (15) Pregnancy, childbirth and the puerperium   
   > * (16) Certain conditions originating in the perinatal period  
   > * (17) Congenital malformations, deformations and chromosomal abnormalities  
   > * (18) Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified  
   > * (19) Injury, poisoning and certain other consequences of external causes  
   > * (20) External causes of morbidity and mortality  
   > * (21) Factors influencing health status and contact with health services.
   > * (22) Factors influencing health status and contact with health services.
   
    > **And 7 categories without (CID) patient follow-up as follows:**

   > * (23) blood donation.
   > * (24) aboratory examination.
   > * (25) unjustified absence.
   > * (26) physiotherapy.
   > * (27) dental consultation.
   > * (28) Month of absence.

    
    
- **Month of absence**: (Jan (1), Feb (2), Mar (3) ... ect)
- **Day of the week**: (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
- **Seasons**: (summer (1), autumn (2), winter (3), spring (4)) 
- **Transportation expense**.
- **Distance from Residence to Work**: (kilometers)
- **Service time**.
- **Age**.
- **Work load Average/day**.
- **Hit target**.
- **Disciplinary failure**: (yes=1; no=0)
- **Education**: (high school (1), graduate (2), postgraduate (3), master and doctor (4))
- **Son**: (number of children)
- **Social drinker**: (yes=1; no=0)
- **Social  smoker**: (yes=1; no=0)
- **Pet**: (number of pet)
- **Weight**.
- **Height**.
- **Body mass index**.
- **Absenteeism time in hours**.


 **Source:**

Creators original owner and donors: Andrea Martiniano (1), Ricardo Pinto Ferreira (2), and Renato Jose Sassi (3).

E-mail address: 
andrea.martiniano'@'gmail.com (1) - PhD student;
log.kasparov'@'gmail.com (2) - PhD student;
sassi'@'uni9.pro.br (3) - Prof. Doctor.

Universidade Nove de Julho - Postgraduate Program in Informatics and Knowledge Management.

Address: Rua Vergueiro, 235/249 Liberdade, Sao Paulo, SP, Brazil. Zip code: 01504-001.

Website: http://www.uninove.br/curso/informatica-e-gestao-do-conhecimento/




## **Target:**

**The Main Target from this project is to build a Machine learning model which can predict employees absence with high accuracy and ensure the precision and recall are perfect.**

### importing libraries that will be used to investigate Dataset

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import datetime
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split 
np.random.seed(42)

<a id='wrangling'></a>
## Data Wrangling

 **This is a three step process:**

*  Gathering the data from Dataset and investegate it trying to understand more details about it. 


*  Assessing data to identify any issues with data types, structure, or quality.


*  Cleaning data by changing data types, replacing values, removing unnecessary data and modifying Dataset for easy and fast analysis.


### Gathering Data

In [3]:
# loading CSV file into 3 Dataframe //df

df = pd.read_csv("Absenteeism_at_work.csv", sep=";")

In [4]:
#checking 5 rows sample from Dataframes

df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2


### Assessing Data

In [5]:
#checking Dataframe basic informations (columns names, number of values, data types ......)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               740 non-null    int64  
 1   Reason for absence               740 non-null    int64  
 2   Month of absence                 740 non-null    int64  
 3   Day of the week                  740 non-null    int64  
 4   Seasons                          740 non-null    int64  
 5   Transportation expense           740 non-null    int64  
 6   Distance from Residence to Work  740 non-null    int64  
 7   Service time                     740 non-null    int64  
 8   Age                              740 non-null    int64  
 9   Work load Average/day            740 non-null    float64
 10  Hit target                       740 non-null    int64  
 11  Disciplinary failure             740 non-null    int64  
 12  Education             

In [6]:
#checking Dataframe shape (number of rows and columns)

df.shape

(740, 21)

In [7]:
#checking more information and descriptive statistics

df.describe()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
count,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,...,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0
mean,18.017568,19.216216,6.324324,3.914865,2.544595,221.32973,29.631081,12.554054,36.45,271.490235,...,0.054054,1.291892,1.018919,0.567568,0.072973,0.745946,79.035135,172.114865,26.677027,6.924324
std,11.021247,8.433406,3.436287,1.421675,1.111831,66.952223,14.836788,4.384873,6.478772,39.058116,...,0.226277,0.673238,1.098489,0.495749,0.260268,1.318258,12.883211,6.034995,4.285452,13.330998
min,1.0,0.0,0.0,2.0,1.0,118.0,5.0,1.0,27.0,205.917,...,0.0,1.0,0.0,0.0,0.0,0.0,56.0,163.0,19.0,0.0
25%,9.0,13.0,3.0,3.0,2.0,179.0,16.0,9.0,31.0,244.387,...,0.0,1.0,0.0,0.0,0.0,0.0,69.0,169.0,24.0,2.0
50%,18.0,23.0,6.0,4.0,3.0,225.0,26.0,13.0,37.0,264.249,...,0.0,1.0,1.0,1.0,0.0,0.0,83.0,170.0,25.0,3.0
75%,28.0,26.0,9.0,5.0,4.0,260.0,50.0,16.0,40.0,294.217,...,0.0,1.0,2.0,1.0,0.0,1.0,89.0,172.0,31.0,8.0
max,36.0,28.0,12.0,6.0,4.0,388.0,52.0,29.0,58.0,378.884,...,1.0,4.0,4.0,1.0,1.0,8.0,108.0,196.0,38.0,120.0


In [8]:
# checking for NaN values patients

df.isna().sum()

ID                                 0
Reason for absence                 0
Month of absence                   0
Day of the week                    0
Seasons                            0
Transportation expense             0
Distance from Residence to Work    0
Service time                       0
Age                                0
Work load Average/day              0
Hit target                         0
Disciplinary failure               0
Education                          0
Son                                0
Social drinker                     0
Social smoker                      0
Pet                                0
Weight                             0
Height                             0
Body mass index                    0
Absenteeism time in hours          0
dtype: int64

In [9]:
#checking for duplicated rows 

df.duplicated().sum()

34

In [10]:
# check month of absence value_counts 
df["Month of absence"].value_counts().sort_index()

0      3
1     50
2     72
3     87
4     53
5     64
6     54
7     67
8     54
9     53
10    71
11    63
12    49
Name: Month of absence, dtype: int64

In [11]:
# check day of week value_counts 
df["Day of the week"].value_counts().sort_index()

2    161
3    154
4    156
5    125
6    144
Name: Day of the week, dtype: int64

In [12]:
# check Seasons value_counts 
df["Seasons"].value_counts().sort_index()

1    170
2    192
3    183
4    195
Name: Seasons, dtype: int64

In [13]:
# check Pets value_counts 
df.Pet.value_counts().sort_index()

0    460
1    138
2     96
4     32
5      6
8      8
Name: Pet, dtype: int64

**We have 2 columns to check if employee made absence or not (Reason for absence and Absenteeism time in hours) so we need to assure there is no conflict between both of them.**

In [14]:
df[(df["Reason for absence"] == 0) &(df["Absenteeism time in hours"] != 0)]

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours


In [15]:
df[(df["Reason for absence"] != 0) &(df["Absenteeism time in hours"] == 0)]

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
134,34,27,1,2,2,118,10,10,37,308.593,...,0,1,0,0,0,0,83,172,28,0


## From above data we can observe that: 

> **1.There are 34 duplicate values but we cannot consider this as duplicate records because same employee can make absence due to same reason specially for medical reasons.**

> **2. Dataset has 740 records without any missing or empty data.**

> **3. There are 8 employes who have 8 pets and other 6  who have 6 pets which are large numbers of pets to have. We need to ensure this information is correct.**

> **4. There are 3 records on Month of absence column have value of 0 while other 49 have value of 12. I Assume this is a typo mistake where 0 inserted instead of 1.**

> **5. The employee with ID 34 has dental consultation reason for absence but his Absenteeism time in hours is 0 so we can consider him as not absent. I will use Absenteeism time in hours to concider there is absence or not.**

##  Cleaning Data

### <font color='blue'>Missing Data</font>

 * **(There is no missing records in this dataset)**
 

### <font color='blue'>Tidiness issues</font>

 * **(This dataset is tidy)**
 
 
### <font color='blue'>Quality issues</font>

##### 1. **Change columns names for readability and easy accessibility (remove spaces and make names shorter)**


##### 2. **Make new column for abcence and no absence (abcence = 1 and no absence = 0 ).**


##### 3. **Make new columns for high, medium, low absence levels.**


##### 4. **Make new column for medical and other column for  non medical reasons.**

#####  5. Drop reason_for_absence column: 

     - Data already extracted to medical_reasons and not_medical_reasons columns.

In [77]:
df_clean

Unnamed: 0,id,reason_for_absence,month_of_absence,day_of_the_week,seasons,transportation_expense,distance_from_residence_to_work,service_time,age,work_load_average_on_day,...,weight,height,body_mass_index,absenteeism_time_in_hours,absence_status,low_absence_level,medium_absence_level,high_absence_level,medical_reasons,not_medical_reasons
0,11,26,7,3,1,289,36,13,33,239.554,...,90,172,30,4,1,0,1,0,0,1
1,36,0,7,3,1,118,13,18,50,239.554,...,98,178,31,0,0,1,0,0,0,0
2,3,23,7,4,1,179,51,18,38,239.554,...,89,170,31,2,1,1,0,0,0,1
3,7,7,7,5,1,279,5,14,39,239.554,...,68,168,24,4,1,0,1,0,1,0
4,11,23,7,5,1,289,36,13,33,239.554,...,90,172,30,2,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,11,14,7,3,1,289,36,13,33,264.604,...,90,172,30,8,1,0,1,0,1,0
736,1,11,7,3,1,235,11,14,37,264.604,...,88,172,29,4,1,0,1,0,1,0
737,4,0,0,3,1,118,14,13,40,271.219,...,98,170,34,0,0,1,0,0,0,0
738,8,0,0,4,2,231,35,14,39,271.219,...,100,170,35,0,0,1,0,0,0,0


In [56]:
# make copy of origenal dataframes to clean them

df_clean = df.copy()

### <font color='blue'>Fixing Quality issues</font>

#### 1. Change columns names for readability and easy accessibility (remove spaces and make names shorter).

##### Solution
* Change names using for loop and `df.columns`method.

##### Code

In [57]:
# creat new list with current columns names 

col_names = list(df_clean.columns)
col_names

['ID',
 'Reason for absence',
 'Month of absence',
 'Day of the week',
 'Seasons',
 'Transportation expense',
 'Distance from Residence to Work',
 'Service time',
 'Age',
 'Work load Average/day ',
 'Hit target',
 'Disciplinary failure',
 'Education',
 'Son',
 'Social drinker',
 'Social smoker',
 'Pet',
 'Weight',
 'Height',
 'Body mass index',
 'Absenteeism time in hours']

In [58]:
# replace space with "_"

new_col_names=[]
for i in col_names:
    i = i.replace("/","_on_").strip()
    new_col_names.append (i.replace(" ","_").lower())
new_col_names

['id',
 'reason_for_absence',
 'month_of_absence',
 'day_of_the_week',
 'seasons',
 'transportation_expense',
 'distance_from_residence_to_work',
 'service_time',
 'age',
 'work_load_average_on_day',
 'hit_target',
 'disciplinary_failure',
 'education',
 'son',
 'social_drinker',
 'social_smoker',
 'pet',
 'weight',
 'height',
 'body_mass_index',
 'absenteeism_time_in_hours']

In [59]:
# assign new names to df_clean

df_clean.columns = new_col_names

##### Test

In [60]:
# confirm changes 

df_clean.columns

Index(['id', 'reason_for_absence', 'month_of_absence', 'day_of_the_week',
       'seasons', 'transportation_expense', 'distance_from_residence_to_work',
       'service_time', 'age', 'work_load_average_on_day', 'hit_target',
       'disciplinary_failure', 'education', 'son', 'social_drinker',
       'social_smoker', 'pet', 'weight', 'height', 'body_mass_index',
       'absenteeism_time_in_hours'],
      dtype='object')

#### 2. Make new column for abcence and no absence (abcence = 1 and no absence = 0 )

##### Solution
* make new column using `np.where` depending on absenteeism_time_in_hours column

##### Code

In [61]:
df_clean["absence_status"] = np.where(df_clean.absenteeism_time_in_hours == 0 , 0, 1)

##### Test

In [62]:
# Confirm changes

df_clean.absence_status.value_counts(normalize=True)

1    0.940541
0    0.059459
Name: absence_status, dtype: float64

#### 3.Make new columns for high, medium, low absence levels.

##### Solution
* make new column using `np.where()` function depending on minimum, first quartile, second quartile, third quartile and maximum.

##### Code

In [63]:
# checking absenteeism_time_in_hours describtion 

df_clean.absenteeism_time_in_hours.describe()

count    740.000000
mean       6.924324
std       13.330998
min        0.000000
25%        2.000000
50%        3.000000
75%        8.000000
max      120.000000
Name: absenteeism_time_in_hours, dtype: float64

In [64]:
# creating absence levels based on first quartile, second quartile, third quartile and maximum.

df_clean["low_absence_level"] = np.where(df_clean.absenteeism_time_in_hours <= 3, 1, 0)


df_clean["medium_absence_level"] = np.where(np.logical_and(df_clean.absenteeism_time_in_hours > 3,
                                                           df_clean.absenteeism_time_in_hours <= 8) , 1, 0)


df_clean["high_absence_level"] = np.where(df_clean.absenteeism_time_in_hours > 8 , 1, 0)

##### Test

In [65]:
df_clean.head(20)

Unnamed: 0,id,reason_for_absence,month_of_absence,day_of_the_week,seasons,transportation_expense,distance_from_residence_to_work,service_time,age,work_load_average_on_day,...,social_smoker,pet,weight,height,body_mass_index,absenteeism_time_in_hours,absence_status,low_absence_level,medium_absence_level,high_absence_level
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,90,172,30,4,1,0,1,0
1,36,0,7,3,1,118,13,18,50,239.554,...,0,0,98,178,31,0,0,1,0,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,0,89,170,31,2,1,1,0,0
3,7,7,7,5,1,279,5,14,39,239.554,...,1,0,68,168,24,4,1,0,1,0
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,90,172,30,2,1,1,0,0
5,3,23,7,6,1,179,51,18,38,239.554,...,0,0,89,170,31,2,1,1,0,0
6,10,22,7,6,1,361,52,3,28,239.554,...,0,4,80,172,27,8,1,0,1,0
7,20,23,7,6,1,260,50,11,36,239.554,...,0,0,65,168,23,4,1,0,1,0
8,14,19,7,2,1,155,12,14,34,239.554,...,0,0,95,196,25,40,1,0,0,1
9,1,22,7,2,1,235,11,14,37,239.554,...,0,1,88,172,29,8,1,0,1,0


In [69]:
# Confirm changes 

df_clean[["absenteeism_time_in_hours","low_absence_level","medium_absence_level","high_absence_level"]].head()

Unnamed: 0,absenteeism_time_in_hours,low_absence_level,medium_absence_level,high_absence_level
0,4,0,1,0
1,0,1,0,0
2,2,1,0,0
3,4,0,1,0
4,2,1,0,0


#### 4 Make new column for medical and other column for  non medical reasons.

##### Solution
* make new column using `np.where()` function depending on reason_for_absence column

##### Code

In [73]:
df_clean["medical_reasons"] = np.where(np.logical_and(df_clean.reason_for_absence > 0 ,df_clean.reason_for_absence < 23 ) , 1, 0)
df_clean["not_medical_reasons"] = np.where(np.logical_and(df_clean.reason_for_absence > 22, df_clean.reason_for_absence != 0) , 1, 0)

##### Test

In [75]:
# Confirm changes 

df_clean[["reason_for_absence","medical_reasons","not_medical_reasons"]].head()

Unnamed: 0,reason_for_absence,medical_reasons,not_medical_reasons
0,26,0,1
1,0,0,0
2,23,0,1
3,7,1,0
4,23,0,1


#### 5. Drop reason_for_absence column.

##### Solution
* Drop column using `.drop()` method.

##### Code

In [78]:
df_clean = df_clean.drop(["reason_for_absence"], axis=1)

##### Test

In [84]:
"reason_for_absence"  in (df_clean.columns)

False

### Now this dataset is tidy and clean so let's explore it.

In [85]:
df_clean.head()

Unnamed: 0,id,month_of_absence,day_of_the_week,seasons,transportation_expense,distance_from_residence_to_work,service_time,age,work_load_average_on_day,hit_target,...,weight,height,body_mass_index,absenteeism_time_in_hours,absence_status,low_absence_level,medium_absence_level,high_absence_level,medical_reasons,not_medical_reasons
0,11,7,3,1,289,36,13,33,239.554,97,...,90,172,30,4,1,0,1,0,0,1
1,36,7,3,1,118,13,18,50,239.554,97,...,98,178,31,0,0,1,0,0,0,0
2,3,7,4,1,179,51,18,38,239.554,97,...,89,170,31,2,1,1,0,0,0,1
3,7,7,5,1,279,5,14,39,239.554,97,...,68,168,24,4,1,0,1,0,1,0
4,11,7,5,1,289,36,13,33,239.554,97,...,90,172,30,2,1,1,0,0,0,1


<a id='Machine learning'></a>
## Machine learning model.

> Now I'm going to build a model to split data to (training and testing data) then train it to predict employees absence depending on the data we have and finally test it to check precision, recall and accuracy.

In [86]:
# prepare x and y variables for trainig model 

# Drop absenteeism_time_in_hours because we already made dummy variables from it into low_absence_level, 
#medium_absence_level and high_absence_level.

# Drop low_absence_level as one of our three dummy variables.

X = df_clean.drop(["absence_status","absenteeism_time_in_hours", "low_absence_level"], axis=1)
y= df_clean.absence_status

In [93]:
# Train model with 75% of data for training and 25% for testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [97]:
log_mod = LogisticRegression()
log_mod.fit(X_train, y_train)
preds = log_mod.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [98]:
confusion_matrix(y_test, preds) 

array([[ 11,   1],
       [  0, 173]])

In [99]:
precision_score(y_test, preds)

0.9942528735632183

In [100]:
recall_score(y_test, preds)

1.0

In [101]:
accuracy_score(y_test, preds)

0.9945945945945946

<a id='conclusions'></a>
## Conclusions


> **1. The model accuracy in predicting absence is 99.45%.**

> **2. The precision_score is 99.4% and that means:**

        - from all employees that the model predict them as absent 99.4% of them are actually absent.

> **3. The recall is 100% and that means:**

        - from all employees who actually were absent the model correctly observe them with ratio 100%.
        
        
        
> **4. This is a perfect model for predicting the employee absence in advance based on data that we have.**