## Data Manipulation on Absenteeism At Work DataSet

* The commands that I am using in this project
    * df.head()
    * df.columns
    * df.shape
    * df['col_name'].value_counts()
    * df.dtypes
* Check for missing values
    * Solution 1  - df.isan().any()
    * Solution 2 - df.isnull().sum()
    * Solution 3 - Loop through each column of a dataframe and sum the missing value
        * for col in df.columns: 
            * print(col, df[col].isnull().sum())
* Drop a column
    * df_new = df.drop('col_name', axis =1)
* Rename a column 
    * df.rename(column={'col_name' : 'new col name}, inplace =True)
* Encoding or Mapping a column values with other values using function() and df['col_name'].apply(fun_name)

* Filtering data
    * new_variable = df['col_name']>=50 (define criteria)
    * df.loc[new_variable, col_name]

* Answering business questions
    * Q1. What is the average time of Absenteesim on daily basis/(Absenteeism time in hours by Day of Week)
    * Q2. General stats(count,avg, max, std of  Absenteeism time in hours by Day of Week
    * Q3. Average absenteeism for everyone with age of 50+
    * Q4. Most Occuring Reason for absence
    * Q5. Average workload for (absenteeism time in hours > 50 and age of 50+)
    
    

### Import Libraries

In [29]:
import pandas as pd

In [30]:
### Download Dataset
# https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work
 
# !curl "https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"
# !inzip Absenteeism_at_work_AAA.zip

### Load data

In [31]:
data = pd.read_csv('dataset/Absenteeism_at_work.csv', sep=';')


In [32]:
data.head(2)

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0


In [33]:
data.columns

Index(['ID', 'Reason for absence', 'Month of absence', 'Day of the week',
       'Seasons', 'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours'],
      dtype='object')

In [34]:
data.head(2)

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0


In [35]:
data.shape

(740, 21)

In [36]:
data['Reason for absence'].value_counts()

23    149
28    112
27     69
13     55
0      43
19     40
22     38
26     33
25     31
11     26
10     25
18     21
14     19
1      16
7      15
6       8
12      8
8       6
21      6
9       4
5       3
24      3
16      3
4       2
15      2
3       1
2       1
17      1
Name: Reason for absence, dtype: int64

In [37]:
data['Reason for absence'].value_counts()

23    149
28    112
27     69
13     55
0      43
19     40
22     38
26     33
25     31
11     26
10     25
18     21
14     19
1      16
7      15
6       8
12      8
8       6
21      6
9       4
5       3
24      3
16      3
4       2
15      2
3       1
2       1
17      1
Name: Reason for absence, dtype: int64

In [38]:
data.head(2)

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0


In [39]:
data.dtypes #Check for data types of each column

ID                                   int64
Reason for absence                   int64
Month of absence                     int64
Day of the week                      int64
Seasons                              int64
Transportation expense               int64
Distance from Residence to Work      int64
Service time                         int64
Age                                  int64
Work load Average/day              float64
Hit target                           int64
Disciplinary failure                 int64
Education                            int64
Son                                  int64
Social drinker                       int64
Social smoker                        int64
Pet                                  int64
Weight                               int64
Height                               int64
Body mass index                      int64
Absenteeism time in hours            int64
dtype: object

### Check for Missing Values

In [40]:
#solution1
data.isna().any() #Check for missing values

ID                                 False
Reason for absence                 False
Month of absence                   False
Day of the week                    False
Seasons                            False
Transportation expense             False
Distance from Residence to Work    False
Service time                       False
Age                                False
Work load Average/day              False
Hit target                         False
Disciplinary failure               False
Education                          False
Son                                False
Social drinker                     False
Social smoker                      False
Pet                                False
Weight                             False
Height                             False
Body mass index                    False
Absenteeism time in hours          False
dtype: bool

In [41]:
#Solution2
data.isnull().sum()

ID                                 0
Reason for absence                 0
Month of absence                   0
Day of the week                    0
Seasons                            0
Transportation expense             0
Distance from Residence to Work    0
Service time                       0
Age                                0
Work load Average/day              0
Hit target                         0
Disciplinary failure               0
Education                          0
Son                                0
Social drinker                     0
Social smoker                      0
Pet                                0
Weight                             0
Height                             0
Body mass index                    0
Absenteeism time in hours          0
dtype: int64

In [42]:
#Solution3
#loop through each column of a dataframe and sum the missing value
for col in data.columns: 
    print(col, data[col].isnull().sum())

ID 0
Reason for absence 0
Month of absence 0
Day of the week 0
Seasons 0
Transportation expense 0
Distance from Residence to Work 0
Service time 0
Age 0
Work load Average/day  0
Hit target 0
Disciplinary failure 0
Education 0
Son 0
Social drinker 0
Social smoker 0
Pet 0
Weight 0
Height 0
Body mass index 0
Absenteeism time in hours 0


### Drop a column

In [43]:
data1 = data.drop('ID', axis=1) #axis=1 specifies it is a column and inplace=True will save the result in the dataframe

In [44]:
data1.head()

Unnamed: 0,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,26,7,3,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,0,7,3,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,23,7,4,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,7,5,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,23,7,5,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2


In [45]:
data1.describe()

Unnamed: 0,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
count,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0
mean,19.216216,6.324324,3.914865,2.544595,221.32973,29.631081,12.554054,36.45,271.490235,94.587838,0.054054,1.291892,1.018919,0.567568,0.072973,0.745946,79.035135,172.114865,26.677027,6.924324
std,8.433406,3.436287,1.421675,1.111831,66.952223,14.836788,4.384873,6.478772,39.058116,3.779313,0.226277,0.673238,1.098489,0.495749,0.260268,1.318258,12.883211,6.034995,4.285452,13.330998
min,0.0,0.0,2.0,1.0,118.0,5.0,1.0,27.0,205.917,81.0,0.0,1.0,0.0,0.0,0.0,0.0,56.0,163.0,19.0,0.0
25%,13.0,3.0,3.0,2.0,179.0,16.0,9.0,31.0,244.387,93.0,0.0,1.0,0.0,0.0,0.0,0.0,69.0,169.0,24.0,2.0
50%,23.0,6.0,4.0,3.0,225.0,26.0,13.0,37.0,264.249,95.0,0.0,1.0,1.0,1.0,0.0,0.0,83.0,170.0,25.0,3.0
75%,26.0,9.0,5.0,4.0,260.0,50.0,16.0,40.0,294.217,97.0,0.0,1.0,2.0,1.0,0.0,1.0,89.0,172.0,31.0,8.0
max,28.0,12.0,6.0,4.0,388.0,52.0,29.0,58.0,378.884,100.0,1.0,4.0,4.0,1.0,1.0,8.0,108.0,196.0,38.0,120.0


In [46]:
data1['Day of the week'].value_counts()

2    161
4    156
3    154
6    144
5    125
Name: Day of the week, dtype: int64

### Encoding or Mapping a column values with other values using function() and apply()

In [47]:
def encode_day(x):
    day = {
        2: 'Monday',
        3: 'Tuesday',
        4: 'Wednesday',
        5: 'Thursday',
        6: 'Friday'
    }
    return day[x]
data1['Day of the week'].apply(encode_day)

0        Tuesday
1        Tuesday
2      Wednesday
3       Thursday
4       Thursday
         ...    
735      Tuesday
736      Tuesday
737      Tuesday
738    Wednesday
739       Friday
Name: Day of the week, Length: 740, dtype: object

In [48]:
data1['Day of the week'] = data1['Day of the week'].apply(encode_day)

In [49]:
data1.head()

Unnamed: 0,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,26,7,Tuesday,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,0,7,Tuesday,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,23,7,Wednesday,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,7,Thursday,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,23,7,Thursday,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2


In [50]:
data1['Month of absence'].unique()

array([ 7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  0], dtype=int64)

### Renaming a Column

In [51]:
data1.rename(columns={"Son" : "Number of Sons"}, inplace=True)

In [52]:
data1

Unnamed: 0,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Number of Sons,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,26,7,Tuesday,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,0,7,Tuesday,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0
2,23,7,Wednesday,1,179,51,18,38,239.554,97,0,1,0,1,0,0,89,170,31,2
3,7,7,Thursday,1,279,5,14,39,239.554,97,0,1,2,1,1,0,68,168,24,4
4,23,7,Thursday,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,14,7,Tuesday,1,289,36,13,33,264.604,93,0,1,2,1,0,1,90,172,30,8
736,11,7,Tuesday,1,235,11,14,37,264.604,93,0,3,1,0,0,1,88,172,29,4
737,0,0,Tuesday,1,118,14,13,40,271.219,95,0,1,1,1,0,8,98,170,34,0
738,0,0,Wednesday,2,231,35,14,39,271.219,95,0,1,2,1,0,2,100,170,35,0


### FIltering data

In [53]:
age_filter = data1['Age']>=50
data1.loc[data1['Age']>=50,'Absenteeism time in hours']

1        0
55       0
64       0
100      2
144      8
145      8
192      1
216      0
218     24
228      8
230      1
231     80
237      8
255      8
263      3
266      1
284      1
293      0
306      1
309      1
316      1
343      1
366      4
402      2
403      3
404      8
407      0
419      1
420    120
434      8
454      2
473      1
521      1
620      3
622    112
640      2
662      3
672      2
685      2
686      3
688      0
711     24
716      3
727      8
729    120
739      0
Name: Absenteeism time in hours, dtype: int64

## Answering Business Questions

### Q1. What is the average time of Absenteesim on daily basis/(Absenteeism time in hours by Day of Week)?

In [54]:
#Solution1

In [55]:
day_of_week_group = data1[['Day of the week','Absenteeism time in hours']].groupby('Day of the week')

In [56]:
day_of_week_group.agg('mean')

Unnamed: 0_level_0,Absenteeism time in hours
Day of the week,Unnamed: 1_level_1
Friday,5.125
Monday,9.248447
Thursday,4.424
Tuesday,7.980519
Wednesday,7.147436


In [57]:
#Solution2
data1.groupby('Day of the week')['Absenteeism time in hours'].mean()

Day of the week
Friday       5.125000
Monday       9.248447
Thursday     4.424000
Tuesday      7.980519
Wednesday    7.147436
Name: Absenteeism time in hours, dtype: float64

### Q2. General stats(count,avg, max, std) of  Absenteeism time in hours by Day of Week ?

In [58]:
data1.groupby('Day of the week')['Absenteeism time in hours'].count()

Day of the week
Friday       144
Monday       161
Thursday     125
Tuesday      154
Wednesday    156
Name: Absenteeism time in hours, dtype: int64

In [59]:
#Solution1

In [60]:
agg_of_week_group = data1[['Day of the week','Absenteeism time in hours']].groupby('Day of the week')

In [61]:
agg_of_week_group.agg(['count','mean','max','std']) 

Unnamed: 0_level_0,Absenteeism time in hours,Absenteeism time in hours,Absenteeism time in hours,Absenteeism time in hours
Unnamed: 0_level_1,count,mean,max,std
Day of the week,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Friday,144,5.125,64,7.91111
Monday,161,9.248447,120,15.972645
Thursday,125,4.424,24,4.265889
Tuesday,154,7.980519,120,18.027383
Wednesday,156,7.147436,120,13.267863


### Q3. Average absenteeism for everyone with age of 50+ ?

In [62]:
data1.head(2)

Unnamed: 0,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Number of Sons,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,26,7,Tuesday,1,289,36,13,33,239.554,97,0,1,2,1,0,1,90,172,30,4
1,0,7,Tuesday,1,118,13,18,50,239.554,97,1,1,1,1,0,0,98,178,31,0


In [63]:
data1[data1.Age > 50]['Absenteeism time in hours'].mean()

29.11111111111111

### Q4. Most Occuring Reason for absence?

In [64]:
#Most Occuring Reason for absence.
data1['Reason for absence'].value_counts()[:1]

23    149
Name: Reason for absence, dtype: int64

### Q5. Average workload for (absenteeism time in hours > 50 and age of 50+)?

In [65]:
#Average workload for (absenteeism time in hours > 50 and age of 50+)
#Soltion1
data1[(data1['Absenteeism time in hours'] > 50) & (data1.Age >= 50)]['Work load Average/day '].mean()


275.93975

In [66]:
data1.columns

Index(['Reason for absence', 'Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Number of Sons', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours'],
      dtype='object')

In [67]:
#Solution2
filter1 = (data1['Age']>=50)
filter2 = (data1['Absenteeism time in hours']>50)
data1.loc[filter1 & filter2, 'Work load Average/day '].mean()


275.93975