<h1 align="center">Predicting Absenteeism at Work — Data Preprocessing</h1>

## **1. Project Overview**

### Objective

Absenteeism refers to an employee's habitual non-presence at work during normal working hours, often resulting in decreased productivity. By analyzing historical employee data, we aim to identify the factors contributing to absenteeism and prepare the dataset for building predictive models that help businesses proactively manage workforce availability.

### Notebook Purpose

This notebook focuses solely on the preprocessing stage of the project. The primary objective here is to transform the raw absenteeism data into a clean, structured, and model-ready format. This involves handling missing values, engineering meaningful features, encoding categorical variables, and ensuring the data is suitable for downstream analysis and machine learning.

---

## **2. Initial Setup and Dataset Overview**

### Dataset Source
The dataset is derived from a study on workplace absenteeism conducted at a courier company in Brazil. It includes 700 records of employees across 12 attributes, such as demographic, lifestyle, and work-related variables that may influence absenteeism. The dataset used is **“Absenteeism at Work”** from the UCI Machine Learning Repository (Martiniano & Ferreira, 2012).

**File Name:** *absenteeism_raw_data.csv*

### Import Libraries and Load Dataset

In [5]:
# Core libraries
import pandas as pd
import numpy as np
from IPython.display import display, Markdown

# Display settings
pd.set_option('display.max_columns', None)

# Load dataset
df = pd.read_csv('absenteeism_raw_data.csv')

# Initial checkpoint for backup
df_raw = df.copy()

# Preview first few rows
df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


### High-Level Feature Overview
Each row represents an instance of absenteeism, with features such as:

- **Reason for Absence:** Categorical codes indicating the reason (e.g., sickness, dental issues). Absences attested by the [International Code of Diseases (ICD)](https://www.who.int/classifications/classification-of-diseases) stratified into 21 categories (1 to 21) as follows and 7 categories without ICD (22 to 28).

- **Date:** When the absence occurred

- **Transportation Expense, Distance to Work:** Numerical indicators of commute burden

- **Age, Education, Children, Pets:** Demographic characteristics

- **Body Mass Index (BMI), Daily Work Load, Absenteeism Time in Hours**: Health and job-related data

---

## **3. Exploratory Data Review**

### Dataset Shape and Types

In [9]:
# Dataset shape
print(f"Dataset shape: {df.shape}")

# Column data types and non-null counts
df.info()

Dataset shape: (700, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


#### 🧾 Notes on Variables and Modeling

* **No missing values** detected in the dataset.
* **Target variable:** `Absenteeism time in hours` is the dependent variable we aim to predict.
* **ID:** Identifies each individual employee and can be used to track records but **should not be used as a feature** during modeling.
* **Independent variables:** All other columns in the dataset are potential predictors of absenteeism and will be evaluated/processed accordingly.

### Summary Statistics
Get a general sense of distributions and ranges in the dataset.

In [12]:
df.describe()

Unnamed: 0,ID,Reason for Absence,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,17.951429,19.411429,222.347143,29.892857,36.417143,271.801774,26.737143,1.282857,1.021429,0.687143,6.761429
std,11.028144,8.356292,66.31296,14.804446,6.379083,40.021804,4.254701,0.66809,1.112215,1.166095,12.670082
min,1.0,0.0,118.0,5.0,27.0,205.917,19.0,1.0,0.0,0.0,0.0
25%,9.0,13.0,179.0,16.0,31.0,241.476,24.0,1.0,0.0,0.0,2.0
50%,18.0,23.0,225.0,26.0,37.0,264.249,25.0,1.0,1.0,0.0,3.0
75%,28.0,27.0,260.0,50.0,40.0,294.217,31.0,1.0,2.0,1.0,8.0
max,36.0,28.0,388.0,52.0,58.0,378.884,38.0,4.0,4.0,8.0,120.0


In [13]:
print("Date range:", pd.to_datetime(df['Date'], format = '%d/%m/%Y').min(), "-", pd.to_datetime(df['Date'], format = '%d/%m/%Y').max())

Date range: 2015-07-06 00:00:00 - 2018-05-31 00:00:00


- **Age:** ranges from 27 to 58, which is well within the expected working-age range for employees.
- **Date:** range from 2015 to 2018.
- **Absenteeism Time:** ranges from 0 to 120 hours. These values are plausible, and no extreme outliers are immediately apparent.

*At this point, there are no obvious data entry issues in these key fields.*

### Target Inspection
Check the value range and frequency distribution for the `Reason for Absence` column.

In [16]:
# Range and unique count
display(Markdown(f"**`Reason for Absence`** ranges from: **{df['Reason for Absence'].min()} to {df['Reason for Absence'].max()}**"))
display(Markdown(f"Number of unique reason codes: **{len(df['Reason for Absence'].unique())}**"))

# Unique reason codes
print(sorted(df['Reason for Absence'].unique()))

**`Reason for Absence`** ranges from: **0 to 28**

Number of unique reason codes: **28**

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28]


**Observation:**

* The `Reason for Absence` column contains **28 unique values**, ranging from **0 to 28**.
* This implies that one of the reason codes (specifically **20**) is **not present** in this dataset.
* The code **0** likely represents **"No reason given"** or a **placeholder**, and may need to be handled separately depending on the modeling strategy.

---

## **4. Data Cleaning**

### Standardize Column Names
- We use `snake_case` for column names to ensure consistency and ease of use in code.
- If any columns need clarification beyond formatting (e.g., ambiguous names), we manually rename them as well.

In [20]:
df.columns

Index(['ID', 'Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average', 'Body Mass Index',
       'Education', 'Children', 'Pets', 'Absenteeism Time in Hours'],
      dtype='object')

In [21]:
updated_columns = []

# Use lowercase and replace space with underscore
for col in df.columns:
    new_col = col.replace(' ', '_').lower()
    updated_columns.append(new_col)

df.columns = updated_columns

In [22]:
# Map column names with their substitution
column_name_map = {'date': 'date_of_absence',
                        'transportation_expense': 'transportaion_expense_dollars',
                        'distance_to_work': 'distance_to_work_miles',
                        'absenteeism_time_in_hours': 'absenteeism_time_hours'
                       }

# Rename specified columns using the dictionary
df.rename(columns = column_name_map, inplace = True)

# Review updated column names
df.columns

Index(['id', 'reason_for_absence', 'date_of_absence',
       'transportaion_expense_dollars', 'distance_to_work_miles', 'age',
       'daily_work_load_average', 'body_mass_index', 'education', 'children',
       'pets', 'absenteeism_time_hours'],
      dtype='object')

### Drop Unnecessary Columns:

In [24]:
# Drop the 'id' column
df = df.drop('id', axis=1)

- The `id` column uniquely identifies each employee and is useful for tracking purposes.
- However, it is a **nominal label** and does not carry any predictive or explanatory power related to absenteeism.
- Including it in the model could introduce noise or overfitting without providing real value.

*We drop it from the dataset before proceeding further.*

### Convert Date to Datetime Format

In [27]:
# Check data type of the first value
type(df['date_of_absence'][0])

str

In [28]:
# Convert to datetime format
df['date_of_absence'] = pd.to_datetime(df['date_of_absence'], format='%d/%m/%Y')

### Create Checkpoint

In [30]:
# Create a checkpoint after basic cleaning
df_cleaned = df.copy()
df_cleaned.head()

Unnamed: 0,reason_for_absence,date_of_absence,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,absenteeism_time_hours
0,26,2015-07-07,289,36,33,239.554,30,1,2,1,4
1,0,2015-07-14,118,13,50,239.554,31,1,1,0,0
2,23,2015-07-15,179,51,38,239.554,31,1,0,0,2
3,7,2015-07-16,279,5,39,239.554,24,1,2,0,4
4,23,2015-07-23,289,36,33,239.554,30,1,2,1,2


- We store a checkpoint of the cleaned dataset in `df_checkpoint` to safeguard the work done so far.
- This allows us to refer back or roll back to this version if needed later during encoding, feature engineering, or modeling.

---

## **5. Feature Engineering**

### One-Hot Encoding — Reason for Absence

The `reason_for_absence` column contains categorical codes representing different types of work absences. To make this data suitable for modeling, we'll convert it into **dummy variables** using one-hot encoding. This process transforms each category into a separate binary column:

* `1` indicates the presence of a specific absence reason
* `0` indicates its absence

This encoding format helps machine learning models interpret categorical variables without implying any **ordinal relationship** between the reason codes.

In [35]:
# Create dummy variables for each reason code
reason_columns = pd.get_dummies(df['reason_for_absence'], dtype = int)

# Check how many reasons each record is associated with
reason_columns['number_of_reasons'] = reason_columns.sum(axis=1)

# Review the dummy variables
display(reason_columns)
display(Markdown(f"**Total number of reasons:** {(reason_columns['number_of_reasons'] == 1).sum()}"))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,number_of_reasons
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
696,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
697,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
698,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1


**Total number of reasons:** 700

- We created dummy variables for each of the 28 unique absence reason codes to prepare the data for modeling.
- To validate the structure of the data, we confirmed that each observation is associated with only one reason, meaning there are no multi-label cases.
  This makes one-hot encoding an appropriate and effective choice for representing this categorical feature.

In [37]:
# Drop the temporary check column
reason_columns = reason_columns.drop('number_of_reasons', axis=1)

### Handle Multicollinearity

When using one-hot encoding, we introduce multiple new binary columns for each category of the `reason_for_absence`. However, including all dummy columns introduces the **dummy variable trap**, which leads to the statistical issue of multicollinearity.

> Multicollinearity arises when one feature can be perfectly predicted from a combination of others.
> This is especially problematic in **linear models**, as it causes:
>
> * Unstable coefficient estimates
> * Difficulty in interpreting model results

To prevent this, we **drop the first dummy** variable (reason code `0`, meaning "no reason"). This category then acts as the **baseline** (reference), and all other dummy variables are interpreted relative to it:

In [39]:
reason_columns = pd.get_dummies(df['reason_for_absence'], drop_first=True, dtype=int)
reason_columns.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


To further prevent multicollinearity, we also **drop the original `reason_for_absence` column** from the main dataset. Since we’ve extracted the necessary information using one-hot encoding, and we'll soon be grouping the 28 individual reasons into broader categories, the original column is no longer needed in its raw form.

In [41]:
df.drop(['reason_for_absence'], axis = 1, inplace = True)

### Grouping Reasons for Absence into Broader Categories
We now have 27 unique absence reason codes, which can be simplified and grouped into broader categories to enhance interpretability and reduce dimensionality.

Based on domain understanding, we group the reasons as follows:

* **Group 1 — Illness-related:** Codes **1-14** (e.g., infectious diseases, respiratory issues)
* **Group 2 — Pregnancy-related:** Codes **15-17**
* **Group 3 — Injuries and serious conditions:** Codes **18-21**
* **Group 4 — Minor health issues & consultations:** Codes **22-28** (e.g., dental care, physiotherapy)

Each entry in the dataset is associated with only one reason, so we can safely group them using the `max(axis=1)` function to retain binary encoding (`1` if the reason falls in the group, `0` otherwise).

In [43]:
# Create grouped reason columns
reason_group_1 = reason_columns.loc[:, 1:14].max(axis=1)
reason_group_2 = reason_columns.loc[:, 15:17].max(axis=1)
reason_group_3 = reason_columns.loc[:, 18:21].max(axis=1)
reason_group_4 = reason_columns.loc[:, 22:28].max(axis=1)

# Combine into a single DataFrame
grouped_reasons = pd.DataFrame({
    'reason_group_1': reason_group_1,
    'reason_group_2': reason_group_2,
    'reason_group_3': reason_group_3,
    'reason_group_4': reason_group_4,
})

grouped_reasons

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4
0,0,0,0,1
1,0,0,0,0
2,0,0,0,1
3,1,0,0,0
4,0,0,0,1
...,...,...,...,...
695,1,0,0,0
696,1,0,0,0
697,1,0,0,0
698,0,0,0,1


In [44]:
# Add grouped reason columns to main dataset
df = pd.concat([df, grouped_reasons], axis=1)
df.head()

Unnamed: 0,date_of_absence,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,absenteeism_time_hours,reason_group_1,reason_group_2,reason_group_3,reason_group_4
0,2015-07-07,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,2015-07-14,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,2015-07-15,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,2015-07-16,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,2015-07-23,289,36,33,239.554,30,1,2,1,2,0,0,0,1


Now, the absence reasons are represented as four binary columns, capturing broader absence types that are interpretable, non-redundant, and ready for modeling.

## Deriving Time-Based Features

The `date_of_absence` column stores the date in **`day/month/year`** format. Since raw dates are rarely directly useful in modeling, we’ll convert this column to `datetime` format and then extract more informative features:

* **Day of the week** (`0` = Monday, `6` = Sunday)
* **Month** (`1` to `12`)

In [47]:
# Extract new time-based features
df['day_of_week'] = df['date_of_absence'].dt.weekday
df['month'] = df['date_of_absence'].dt.month

# Drop original date column
df.drop(['date_of_absence'], axis=1, inplace=True)

# Display updated dataframe
df.head()

Unnamed: 0,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,absenteeism_time_hours,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month
0,289,36,33,239.554,30,1,2,1,4,0,0,0,1,1,7
1,118,13,50,239.554,31,1,1,0,0,0,0,0,0,1,7
2,179,51,38,239.554,31,1,0,0,2,0,0,0,1,2,7
3,279,5,39,239.554,24,1,2,0,4,1,0,0,0,3,7
4,289,36,33,239.554,30,1,2,1,2,0,0,0,1,3,7


## Handling Categorical Variable — Education
The `education` column is **categorical**, represented by numerical codes:

* `1` → High School
* `2` → Graduate
* `3` → Postgraduate
* `4` → Master's or Doctoral degree

However, in its current form, there's no ordinal relationship the model should interpret between values `1`, `2`, `3`, or `4`.

In [49]:
df['education'].value_counts()

education
1    583
3     73
2     40
4      4
Name: count, dtype: int64

#### Simplifying Education Levels

We observe that most employees only have a **high school education** and the remaining categories all represent **higher education levels**.

To simplify the feature and improve interpretability, we'll combine all higher education levels (2–4) into a single category. This transforms the variable into **binary**:

* `0` → High School
* `1` → Higher Education (Graduate, Postgraduate, Master's/Doctoral)

In [51]:
# Simplify education levels into binary categories
df['education'] = df['education'].map(lambda x: 0 if x == 1 else 1)

# Check the new value distribution
df['education'].value_counts()

education
0    583
1    117
Name: count, dtype: int64

This binarization allows the model to distinguish between individuals with and without higher education, which may influence absenteeism behavior.

---

## **6. Final Dataset Preparation & Export**

### Reorder Columns for Clarity
To improve readability and maintain logical grouping, we’ll reorder the dataset columns:

In [55]:
# View current column order
df.columns.values

array(['transportaion_expense_dollars', 'distance_to_work_miles', 'age',
       'daily_work_load_average', 'body_mass_index', 'education',
       'children', 'pets', 'absenteeism_time_hours', 'reason_group_1',
       'reason_group_2', 'reason_group_3', 'reason_group_4',
       'day_of_week', 'month'], dtype=object)

In [56]:
# Define the desired column order
columns_reordered = [
    'reason_group_1', 'reason_group_2', 'reason_group_3', 'reason_group_4',
    'day_of_week', 'month', 'transportaion_expense_dollars',
    'distance_to_work_miles', 'age', 'daily_work_load_average',
    'body_mass_index', 'education', 'children', 'pets',
    'absenteeism_time_hours'  # Target variable
]

# Apply the new column order
df = df[columns_reordered]

### Create Final Checkpoint
We'll create a backup of the final preprocessed dataset before exporting and take a quick look at the first few rows:

In [58]:
# Final checkpoint
df_preprocessed = df.copy()
df_preprocessed.head()

Unnamed: 0,reason_group_1,reason_group_2,reason_group_3,reason_group_4,day_of_week,month,transportaion_expense_dollars,distance_to_work_miles,age,daily_work_load_average,body_mass_index,education,children,pets,absenteeism_time_hours
0,0,0,0,1,1,7,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,1,7,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,2,7,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,3,7,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,3,7,289,36,33,239.554,30,0,2,1,2


Note: Work_load is the avg amount of time spent working per day in minutes.

### Export the Cleaned Dataset
Save the final preprocessed dataset to a CSV file for modeling:

In [61]:
df_preprocessed.to_csv('absenteeism_preprocessed.csv', index=False)

---

## **References**

> Martiniano, A., & Ferreira, R. (2012). *Absenteeism at work* \[Dataset]. UCI Machine Learning Repository. [https://doi.org/10.24432/C5X882](https://doi.org/10.24432/C5X882)
> 
> Martiniano, A., Ferreira, R. P., Sassi, R. J., & Affonso, C. (2012). Application of a neuro fuzzy network in prediction of absenteeism at work. *7th Iberian Conference on Information Systems and Technologies (CISTI 2012)*, 1–4.

---