# 🎯 Employee Engagement Classification (PyCaret)

## 📘 Introduction
This notebook uses PyCaret to classify employee engagement levels: Low, Medium, or High.


## 📚 Data Dictionary
| Feature         | Type        | Description            |
|-----------------|-------------|------------------------|
| Gender          | Categorical | Employee's gender      |
| StartDate       | Date        | Date employee started  |
| YearsWorked     | Numeric     | Number of years worked |
| Department      | Categorical | Department name        |
| Country         | Categorical | Country of work        |
| MonthlySalary   | Numeric     | Monthly salary         |
| AnnualSalary    | Numeric     | Annual salary          |
| JobRate         | Numeric     | Job performance rating |
| SickLeaves      | Numeric     | Sick leave days        |
| UnpaidLeaves    | Numeric     | Unpaid leave days      |
| OvertimeHours   | Numeric     | Overtime hours         |
| EngagementLevel | Target      | Low / Medium / High    |

In [2]:
# 📥 Load the data
import pandas as pd
from pycaret.classification import *

df = pd.read_csv(r"C:\Users\19024\DataScience\Employees_clean.csv")

## 📊 Original Class Distribution

| Engagement Level | Count |
|------------------|-------|
| Medium           | 666   |
| High             | 21    |
| Low              | 2 ❗   |


In [3]:
# 🛠️ Derive Engagement Level
def get_engagement_level(row):
    if (row['JobRate'] <= 2) and (row['OvertimeHours'] < 10) and (row['SickLeaves'] >= 5) and (row['UnpaidLeaves'] >= 5):
        return 'Low'
    elif (row['JobRate'] >= 4) and (row['OvertimeHours'] > 50) and (row['SickLeaves'] <= 1) and (row['UnpaidLeaves'] <= 1):
        return 'High'
    else:
        return 'Medium'

df['EngagementLevel'] = df.apply(get_engagement_level, axis=1)
df['EngagementLevel'] = df['EngagementLevel'].replace({'Low': 'Medium'})

## 🧾 Engagement Level Setup
The dataset includes information about employees such as job performance (JobRate), absences, and overtime.
We created a custom `EngagementLevel` column using business rules based on these features.

🔸 Originally, the dataset included 3 engagement levels: **Low**, **Medium**, and **High**.

🔸 However, there were only **2 records** labeled as 'Low'. Since SMOTE requires at least 6, we merged 'Low' into 'Medium' to support class balancing and this allows SMOTE to oversample effectively

➡️ 'Low' was merged into 'Medium' before training.


In [4]:
# 🔍 View class distribution after merging
print(df['EngagementLevel'].value_counts())

EngagementLevel
Medium    668
High       21
Name: count, dtype: int64


## 📊 Updated Class Distribution
After merging 'Low' into 'Medium', we printed the new class counts to confirm:

- **Medium** = 668
- **High** = 21

This distribution is still imbalanced, so we'll use **SMOTE** (Synthetic Minority Over-sampling Technique) in the setup to balance it.


In [5]:
# ⚙️ Set up PyCaret with class imbalance fix
clf = setup(
    data=df,
    target='EngagementLevel',
    session_id=123,
    categorical_features=['Gender', 'Department', 'Country'],
    ignore_features=['Performance_ID', 'Employee_ID', 'FirstName', 'LastName', 'StartDate', 'Department_ID', 'Location_ID'],
    fix_imbalance=True
)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,EngagementLevel
2,Target type,Binary
3,Target mapping,"High: 0, Medium: 1"
4,Original data shape,"(689, 18)"
5,Transformed data shape,"(1141, 34)"
6,Transformed train set shape,"(934, 34)"
7,Transformed test set shape,"(207, 34)"
8,Ignore features,7
9,Numeric features,7


## ⚙️ PyCaret Setup with SMOTE
We used PyCaret's `setup()` to initialize the machine learning environment.

Key options used:
- `target='EngagementLevel'`: our classification label
- `fix_imbalance=True`: applies SMOTE automatically
- `categorical_features`: tells PyCaret which features are non-numeric
- `ignore_features`: removes identifiers like `Employee_ID` that don't help model learning


The summary confirms preprocessing was successful and imbalance handling is active.


In [6]:
# 🔢 Compare models
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9917,0.9935,0.9917,0.9952,0.9925,0.893,0.9072,0.21
ada,Ada Boost Classifier,0.9896,0.9625,0.9896,0.9863,0.9875,0.7484,0.7565,0.128
gbc,Gradient Boosting Classifier,0.9896,0.9479,0.9896,0.9863,0.9875,0.7484,0.7565,0.165
dt,Decision Tree Classifier,0.9875,0.8479,0.9875,0.9822,0.9844,0.6484,0.6565,0.084
et,Extra Trees Classifier,0.9875,0.9926,0.9875,0.9816,0.9834,0.5972,0.6099,0.154
nb,Naive Bayes,0.9834,0.9957,0.9834,0.9879,0.9845,0.7299,0.7495,0.082
rf,Random Forest Classifier,0.9813,0.9957,0.9813,0.9713,0.9755,0.4293,0.4378,0.174
lr,Logistic Regression,0.975,0.9894,0.975,0.9834,0.9777,0.5996,0.6241,1.533
qda,Quadratic Discriminant Analysis,0.9689,0.6474,0.9689,0.9389,0.9537,0.0,0.0,0.072
ridge,Ridge Classifier,0.9669,0.985,0.9669,0.986,0.9734,0.6705,0.7149,0.075


## 🔍 Model Comparison Results
We used PyCaret's `compare_models()` function to automatically train and evaluate several algorithms.

📈 Top performers based on **accuracy, F1-score, and precision**:
- 🥇 **LightGBM** - Accuracy: 0.9917, F1: 0.9925, Kappa: 0.893
- AdaBoost and Gradient Boosting were also strong but slightly lower



In [7]:
# 📊 Evaluate best model
evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## 📉 Visual Model Evaluation
Using `evaluate_model()`, we opened PyCaret's built-in model dashboard.

This let us explore:
- ROC and AUC curves
- Confusion matrix
- Feature importance
- Model pipeline stages

✅ All visual tools showed LightGBM handled both engagement classes well.


In [8]:
# 📈 Show prediction results and metrics
predict_model(best_model)
results = pull()
results

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Light Gradient Boosting Machine,0.9903,0.9764,0.9903,0.9903,0.9903,0.8284,0.8284


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Light Gradient Boosting Machine,0.9903,0.9764,0.9903,0.9903,0.9903,0.8284,0.8284


## ✅ Summary
This model used PyCaret and SMOTE to address class imbalance by merging the smallest class ('Low') into 'Medium'. The resulting model performs better across both categories.

### 📊 Final Model Metrics
Using `predict_model()` and `pull()`, we evaluated our model on unseen test data.

### Final Results for LightGBM:
| Metric      | Score  |
|-------------|--------|
| Accuracy    | 0.9903 |
| Recall      | 0.9903 |
| Precision   | 0.9903 |
| F1 Score    | 0.9903 |
| Kappa       | 0.8284 |
| MCC         | 0.8284 |

The model is highly consistent and performs equally well on both classes.
