<a href="https://colab.research.google.com/github/vard-uhi/Absenteeism-at-Work/blob/master/Absenteeism_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Preprocessing**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

  import pandas.util.testing as tm


In [2]:
#load the data
from google.colab import files
data_to_load = files.upload()

Saving Absenteeism_data.csv to Absenteeism_data.csv


In [3]:
import io
# .read_csv() assigns the information from the initial *.csv file to this variable
raw_data = pd.read_csv(io.BytesIO(data_to_load['Absenteeism_data.csv']))

***Checking the content of the dataset***

In [4]:
raw_data.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


*Making a copy of a raw/original dataset, in order to refer it at the later stage of the analysis, if needed.*

In [5]:
df = raw_data.copy()

*Setting the preferred display options to see the dataframe.*

In [6]:
#we want to make all available values visible, hence no limit on maximum value, that is why we use None keyword
pd.options.display.max_columns = None
pd.options.display.max_rows = None

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


In [8]:
df.describe()

Unnamed: 0,ID,Reason for Absence,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,17.951429,19.411429,222.347143,29.892857,36.417143,271.801774,26.737143,1.282857,1.021429,0.687143,6.761429
std,11.028144,8.356292,66.31296,14.804446,6.379083,40.021804,4.254701,0.66809,1.112215,1.166095,12.670082
min,1.0,0.0,118.0,5.0,27.0,205.917,19.0,1.0,0.0,0.0,0.0
25%,9.0,13.0,179.0,16.0,31.0,241.476,24.0,1.0,0.0,0.0,2.0
50%,18.0,23.0,225.0,26.0,37.0,264.249,25.0,1.0,1.0,0.0,3.0
75%,28.0,27.0,260.0,50.0,40.0,294.217,31.0,1.0,2.0,1.0,8.0
max,36.0,28.0,388.0,52.0,58.0,378.884,38.0,4.0,4.0,8.0,120.0


In [12]:
df.shape

(700, 12)

**Summary of the Dataframe**
 
> Our Dataset doesn't have missing values. Its shape is 700 rows and 12 features.
 
> Dependent variable: 'Absenteeism Time in Hours' feature in our dataset represents the phenomenon of the research question. Hence it would be our dependent variable. It is a categorical variable with labeled categories. Which leads us to use Logistic Regression for our predictions.
 
> Independent variables: All other columns represent independent variables which could potentially be used in our equation with the hope that they will help us predict whether an individual with particular characteristics is expected to be absent from work for a certain amount of time or not.

#**Data Preprocessing**


***'ID'*** column

In [13]:
df['ID'].unique()

array([11, 36,  3,  7, 10, 20, 14,  1, 24,  6, 33, 18, 30,  2, 19, 27, 34,
        5, 15, 29, 28, 13, 22, 17, 31, 23, 32,  9, 26, 21,  8, 25, 12, 16])

In [14]:
sorted(df['ID'].unique())

[1,
 2,
 3,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 36]

*ID column- It is labeled variable which uniquely identifies each employee, and has no numeric information, it doesn't have impact on improvement of our analysis. Morovere, it might have the opposite effect. We will drop this column, because it will harm the precision of our estimations.*

In [15]:
df = df.drop(['ID'], axis=1)

In [16]:
df.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


***'Reason for Absence'*** column

In [None]:
#I would like to explore this column in detail, make visible all values
#df['Reason for Absence']

In [17]:
#checking for the lowest and highest values
df['Reason for Absence'].min()

0

In [18]:
df['Reason for Absence'].max()

28

In [19]:
#extract a list containing distinct values only
df['Reason for Absence'].unique()

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16])

In [20]:
#length of the unique numbers
len(df['Reason for Absence'].unique())

28

In [21]:
#Data type
df['Reason for Absence'].dtype

dtype('int64')

In [22]:
#sorting the unique list, to identify missing value
sorted(df['Reason for Absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

We have a description chart for this feature, where each number represents a certain reason.
 
We have 1-28 reasons and we see that the missing value is 20, this might be an interesting insight for the further analysis.
 
As this column is insightful, we will create a dummy variables from it.
 
 
> This data collection and study is conducted in a way that we can be certain that an individual has been absent from work because of one and only one particular reason.

***.get_dummies***

In [23]:
reason_columns = pd.get_dummies(df['Reason for Absence'])

In [24]:
reason_columns.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


In [25]:
reason_columns['check'] = reason_columns.sum(axis=1)

In [26]:
reason_columns.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,check
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1


In [27]:
reason_columns['check'].sum(axis=0)

700

This means that each individual has unique reason of absence

In [28]:
reason_columns = reason_columns.drop(['check'], axis=1)

To avoid potential multicollinearity issues, we will also drop 0 column of the dummy variable, we haven't done it in the beginning to conduct checking

In [29]:
reason_columns = pd.get_dummies(df['Reason for Absence'], drop_first=True)

In [30]:
reason_columns.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


27 columns, one for each reason is too many, that is why we will group them according to a common characteristics

***Group the Reason of Absence***

In [31]:
#to see once again features in our DataFrame
df.columns.values

array(['Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours'], dtype=object)

As we see, we still have 'Reason for Absence' there

In [32]:
reason_columns.columns.values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 21, 22, 23, 24, 25, 26, 27, 28])

Next we should do the following 3 steps
 
1.  drop 'Reason for Absence', To avoid multicollinearity, which is to have the same duplicate columns in our DataFrame.
2.  Grouping dummy variables into small groups/classes
* Group 1: Columns 1 to 14
* Group 2: Columns 15, 16, and 17
* Group 3: Columns 18, 19, 20, and 21
* Group 4: Columns 22 to 28
3.  Concatenate them into our 'df' main DataFrame

In [33]:
#Drop 
df = df.drop(['Reason for Absence'], axis=1)

In [34]:
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [35]:
#Group columns into common feature classes
reason_type1 = reason_columns.loc[:, 1:14].max(axis=1)
reason_type2 = reason_columns.loc[:, 15:17].max(axis=1)
reason_type3 = reason_columns.loc[:, 18:21].max(axis=1)
reason_type4 = reason_columns.loc[:, 22:].max(axis=1)

In [36]:
#Concatenate small df classes to our DatFrame
df = pd.concat([df, reason_type1, reason_type2, reason_type3, reason_type4], axis=1)

In [37]:
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


To change a column names 0 to 3 into meaningful names, we do the following

In [38]:
df.columns.values

array(['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [39]:
#creating new column names
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']

In [40]:
#assigning new column names to our DataFrame
df.columns = column_names

In [41]:
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_1,Reason_2,Reason_3,Reason_4
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


Reorder columns in the DataFrame

In [42]:
column_names_reordered = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [43]:
df = df[column_names_reordered]
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


***Create a Checkpoint***

We will create a copy of the current state of the df DataFrame to save our preprocessed data.

In [44]:
df_reason_mod = df.copy()

***'Date'*** column

In [45]:
df_reason_mod['Date'].head()

0    07/07/2015
1    14/07/2015
2    15/07/2015
3    16/07/2015
4    23/07/2015
Name: Date, dtype: object

In [46]:
#checking data type of a single row
type(df_reason_mod['Date'][0])

str

We can conclude that our date values are stored as text.
To convert all values to Date and Time we should use Timestamp data type 

In [47]:
#converts values into timestamp
df_reason_mod['Date'] = pd.to_datetime(df_reason_mod['Date'], format= '%d/%m/%Y')

In [48]:
df_reason_mod['Date'].head()

0   2015-07-07
1   2015-07-14
2   2015-07-15
3   2015-07-16
4   2015-07-23
Name: Date, dtype: datetime64[ns]

In [49]:
type(df_reason_mod['Date'][0])

pandas._libs.tslibs.timestamps.Timestamp

***Extract the Month Value***

In [50]:
df_reason_mod['Date'][0]

Timestamp('2015-07-07 00:00:00')

In [51]:
#exctract only month value
df_reason_mod['Date'][0].month

7

In [52]:
#creating emty list
list_months = []
list_months

[]

In [53]:
#shape
df_reason_mod.shape

(700, 14)

In [54]:
#we will use a loop that will interactively extract the month value of every date we have in a 'Date' column
for i in range(df_reason_mod.shape[0]):
  list_months.append(df_reason_mod['Date'][i].month)

In [None]:
#list_months

In [57]:
#checking the number of elements
len(list_months)

700

In [58]:
#creating new column to store Months
df_reason_mod['Month Value'] = list_months

In [59]:
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2,7
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2,7


From the perspective of our analysis, this new 'Month Value' column will allow us to check whether in specific months of the year employees tend to be absent more often compared to other months.
Following the same logic, it may turn out that on certain days of the week workers may be prone to be away from the desk, than on other days.

***Extract the Day of the week***

In [60]:
#day of the week, result is labled with 0-6 acording to the names of the week
df_reason_mod['Date'][699].weekday()

3

In [61]:
def date_to_weekday(date_value):
  return date_value.weekday()

In [62]:
#new column
df_reason_mod['Day of the Week'] = df_reason_mod['Date'].apply(date_to_weekday)

In [63]:
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2,7,3


Drop 'Date' column

In [64]:
df_reason_mod = df_reason_mod.drop(['Date'], axis=1)

In [65]:
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,0,0,0,1,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,289,36,33,239.554,30,1,2,1,2,7,3


In [66]:
df_reason_mod.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Month Value',
       'Day of the Week'], dtype=object)

In [67]:
col_reorder = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [68]:
df_reason_mod['Education'].head()

0    1
1    1
2    1
3    1
4    1
Name: Education, dtype: int64

In [69]:
df_reason_mod = df_reason_mod[col_reorder]

In [70]:
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2


New Checkpoint

In [71]:
df_reason_date_mod = df_reason_mod.copy()

***Next, we will explore following 5 columns and their importance for our analysis.***
* Transportation Expense
* Distance to Work
* Age
* Daily Work Load Average
* Body Mass Index



In [72]:
type(df_reason_date_mod['Transportation Expense'][0])

numpy.int64

In [73]:
type(df_reason_date_mod['Distance to Work'][0])

numpy.int64

In [74]:
type(df_reason_date_mod['Age'][0])

numpy.int64

In [75]:
type(df_reason_date_mod['Daily Work Load Average'][0])

numpy.float64

In [76]:
type(df_reason_date_mod['Body Mass Index'][0])

numpy.int64

***'Education', 'Children', 'Pets'***

All these 3 columns have numeric values. Numbers under 'Children' and 'Pets' indicate how many children and pets a person has. However, education is a feature, which numbers doesn't have numerical meaning, it represents categories. Hence, we will continue to preprocess only the education column, and will transform education into a dummy variable.

In [77]:
df_reason_date_mod['Education'].unique()

array([1, 3, 2, 4])

In [78]:
df_reason_date_mod['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

It make sense to combine these 3 into 1 category, so we will have 2 broad categories, high school and university.

In [79]:
df_reason_date_mod['Education'] = df_reason_date_mod['Education'].map({1:0, 2:1, 3:1, 4:1})

In [80]:
df_reason_date_mod['Education'].unique()

array([0, 1])

In [81]:
df_reason_date_mod['Education'].value_counts()

0    583
1    117
Name: Education, dtype: int64

***Final Checkpoint***

In [82]:
df_preprocessed = df_reason_date_mod.copy()

In [83]:
df_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


In [84]:
from google.colab import files
df_preprocessed.to_csv('Absenteeism_preprocessed.csv', index=False) 
files.download('Absenteeism_preprocessed.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>