# 📈 Overtime Prediction using AutoGluon (Regression)

This notebook demonstrates how to use [AutoGluon](https://auto.gluon.ai) to predict **OvertimeHours** for employees based on other workplace factors such as salary, department, and performance.

The goal is to use AutoML to train a regression model that estimates overtime workload, which could help in employee workload planning and HR decisions like:
- Identify employees at risk of burnout or risk of overwork.
- Support HR decisions and workforce planning and forecasting labor needs across departments.
- Complement classification models that assess employee engagement.


### 📚 Data Dictionary

| Feature           | Type        | Description                             |
|-------------------|-------------|-----------------------------------------|
| Gender            | Categorical | Employee's gender                       |
| YearsWorked       | Numeric     | Number of years worked                  |
| Department        | Categorical | Department name                         |
| Country           | Categorical | Country of work                         |
| MonthlySalary     | Numeric     | Monthly salary                          |
| AnnualSalary      | Numeric     | Annual salary                           |
| JobRate           | Numeric     | Job performance rating (1–5)            |
| SickLeaves        | Numeric     | Number of sick leave days               |
| UnpaidLeaves      | Numeric     | Number of unpaid leave days             |
| Location_ID       | Numeric     | Encoded office location                 |
| Department_ID     | Numeric     | Encoded department                      |
| **OvertimeHours** | **Target**  | Total overtime hours (continuous value) |


### 📥 Load libraries

In [1]:
import pandas as pd
from autogluon.tabular import TabularPredictor

#Load dataset
df = pd.read_csv(r"C:\Users\19024\DataScience\Employees_clean.csv")

### Drop target leakage and non-numeric identifiers

- **Columns removed**:
  - `'Performance_ID'`, `'Employee_ID'`, `'FirstName'`, `'LastName'`, and `'StartDate'` are dropped to avoid **target leakage** or because they are **non-numeric identifiers** with little predictive value.
  - These columns could mislead the model or inflate performance if kept.

- **Target variable**:
  - `'OvertimeHours'` is selected as the **target** for prediction.
  - This is a numeric variable, making the task a **regression problem**.

In [2]:
df_model = df.drop(['Performance_ID', 'Employee_ID', 'FirstName', 'LastName', 'StartDate'], axis=1)

# Set up regression target
target = 'OvertimeHours'

### 🧠 Train AutoGluon regression model

**AutoGluon's `TabularPredictor`** to automatically train multiple regression models.

- **`problem_type='regression'`**: Tells AutoGluon that the target (`OvertimeHours`) is a continuous numeric value.

- **`.fit(df_model)`**: Trains many models (e.g., LightGBM, XGBoost, neural networks) using the cleaned dataset.

- **`predictor.leaderboard()`**:
  - Displays all trained models ranked by performance (using RMSE).
  - This helps identify the **best-performing model** for making predictions.


In [3]:
predictor = TabularPredictor(label=target, problem_type='regression').fit(df_model)

#View leaderboard of trained models
predictor.leaderboard()

No path specified. Models will be saved in: "AutogluonModels\ag-20250414_224956"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.9
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26100
CPU Count:          8
Memory Avail:       1.94 GB / 11.78 GB (16.4%)
Disk Space Avail:   43.84 GB / 237.36 GB (18.5%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets=

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-28.841677,root_mean_squared_error,0.058432,7.790944,0.002012,0.035606,2,True,11
1,NeuralNetFastAI,-28.904347,root_mean_squared_error,0.024251,2.481977,0.024251,2.481977,1,True,8
2,LightGBMXT,-28.950909,root_mean_squared_error,0.008325,1.138332,0.008325,1.138332,1,True,3
3,LightGBM,-28.975198,root_mean_squared_error,0.00452,0.421193,0.00452,0.421193,1,True,4
4,CatBoost,-28.984411,root_mean_squared_error,0.008508,8.150041,0.008508,8.150041,1,True,6
5,XGBoost,-28.992312,root_mean_squared_error,0.006512,0.49973,0.006512,0.49973,1,True,9
6,LightGBMLarge,-29.02923,root_mean_squared_error,0.006588,0.863825,0.006588,0.863825,1,True,10
7,KNeighborsUnif,-30.214883,root_mean_squared_error,0.032169,5.27336,0.032169,5.27336,1,True,1
8,RandomForestMSE,-30.973644,root_mean_squared_error,0.050382,0.969945,0.050382,0.969945,1,True,5
9,ExtraTreesMSE,-30.9778,root_mean_squared_error,0.057063,0.854576,0.057063,0.854576,1,True,7


## 📊 Summary of Regression Results

AutoGluon automatically trained several models to predict `OvertimeHours` (a regression task). It evaluated model performance using **Root Mean Squared Error (RMSE)**, where **lower values indicate better predictions**.

### 🥇 Top Model:
- **Model**: `WeightedEnsemble_L2`
- **Validation RMSE**: **28.84**

On average, the model's predicted number of **OvertimeHours** is about **+/- 28.84 hours away** from the actual value in the validation data.

This ensemble model combined predictions from high-performing individual models, especially:
- `NeuralNetFastAI`
- `KNeighborsUnif`

It outperformed all others based on validation score.

### 📌 Clarification: Regression vs. Classification (Cross-Notebook Comparison)

This notebook used **regression** to predict the continuous variable `OvertimeHours`, which is a numeric field indicating how many extra hours an employee works.

Previous notebooks in the [general project folder](https://github.com/w0435723/BIA_Repository/tree/main/Applied%20Data%20Science) that explores the Employees dataset, using **PyCaret** and **Scikit-learn** focused on **classification** models to predict the categorical variable `EngagementLevel` (e.g., `High`, `Medium`) for the Employees data set.

### 🔄 Why the Results Can't Be Directly Compared:
- **Regression Metrics**: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² Score
- **Classification Metrics**: Accuracy, Precision, Recall, F1-score, AUC

These two model types serve **different predictive goals** and use **different evaluation methods**, so comparing RMSE to classification accuracy is not meaningful.

### 🔗 View the Engagement Classification Notebook

 [Click here to view the Employee Engagement Classification Notebook (AutoGluon)]()

### 💡 Conclusion:

While this regression model does not replace the engagement classification work, it **adds value** by offering a **numerical prediction** that can be analyzed alongside engagement insights.
