<a href="https://colab.research.google.com/github/techaiweb3/high-value-employee-identification/blob/main/High_Value_Employee_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Identifying High-Value Employees Using Data-Driven Decision Thresholds

## 1. Project Overview & Business Context
Organizations often need to identify high-value employees or candidates early in the decision-making process, before final compensation details are available. This project focuses on building a binary classification system that flags individuals who are likely to belong to a high-salary category using information available prior to salary determination. The goal is to support decision-making rather than automate it.

## 2. Dataset Description & Assumptions
The dataset used in this project is a simplified representation of employee information and includes age, department, and salary. Age and department are assumed to be known at prediction time, while salary is treated as post-outcome information. The dataset is intended for demonstrating a realistic end-to-end machine learning workflow rather than modeling a complete human resources system.


In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

import matplotlib.pyplot as plt


# Creating a simple illustrative dataset
data = {
    "Age": [25, 30, 45, 35, 50, 28, 42, 39, 48, 33],
    "Department": ["IT", "HR", "IT", "Finance", "IT", "HR", "Finance", "IT", "Finance", "HR"],
    "Salary": [40000, 38000, 80000, 60000, 90000, 42000, 75000, 85000, 70000, 45000]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Age,Department,Salary
0,25,IT,40000
1,30,HR,38000
2,45,IT,80000
3,35,Finance,60000
4,50,IT,90000
5,28,HR,42000
6,42,Finance,75000
7,39,IT,85000
8,48,Finance,70000
9,33,HR,45000


## 3. Target Definition (Business-Driven)
Instead of predicting exact salary values, the problem is framed as a binary classification task. A target variable is defined to indicate whether an individual’s salary exceeds (Let’s assume ₹70,000+ is high salary (business decision).) a business-defined threshold. This approach reflects real-world scenarios where stakeholders are more interested in identifying high-value cases than estimating precise numeric outcomes.

In [4]:
salary_threshold = 70000

df["High_Salary"] = (df["Salary"] >= salary_threshold).astype(int)
df


Unnamed: 0,Age,Department,Salary,High_Salary
0,25,IT,40000,0
1,30,HR,38000,0
2,45,IT,80000,1
3,35,Finance,60000,0
4,50,IT,90000,1
5,28,HR,42000,0
6,42,Finance,75000,1
7,39,IT,85000,1
8,48,Finance,70000,1
9,33,HR,45000,0


## 4. Feature Selection & Leakage Prevention
Only features that would be available at prediction time are used as model inputs. Salary is deliberately excluded from the feature set to prevent data leakage, as it directly represents the outcome being predicted. This ensures that model evaluation remains realistic and that performance metrics are not artificially inflated.


In [5]:
X = df[["Age", "Department"]]
y = df["High_Salary"]


## 5. Encoding & Feature Preparation
The age feature is numeric and is used directly. The department feature is categorical and does not have any inherent ordering, so one-hot encoding is applied. This encoding method allows the model to process categorical information without introducing false ordinal relationships between departments.

In [6]:
X_encoded = pd.get_dummies(X, drop_first=True)
X_encoded


Unnamed: 0,Age,Department_HR,Department_IT
0,25,False,True
1,30,True,False
2,45,False,True
3,35,False,False
4,50,False,True
5,28,True,False
6,42,False,False
7,39,False,True
8,48,False,False
9,33,True,False


## 6. Train–Test Split Strategy
The dataset is divided into training and testing subsets to evaluate model performance on unseen data. A random train–test split is used, and a fixed random seed ensures reproducibility. This approach provides an unbiased estimate of how well the model generalizes beyond the training data.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.3, random_state=42
)


## 7. Model Selection & Training
Logistic regression is selected as the baseline model due to its interpretability and suitability for binary classification problems. The model is trained using only the training dataset, allowing it to learn relationships between the input features and the probability of an individual belonging to the high-salary category. (Logistic Regression (interpretable baseline))

In [8]:
model = LogisticRegression()
model.fit(X_train, y_train)


## 8. Evaluation Using Confusion Matrix
Model performance is evaluated using a confusion matrix along with precision and recall metrics. The confusion matrix provides a clear breakdown of correct predictions and different types of errors, enabling analysis of how well the model identifies high-salary cases and where misclassifications occur.

Predictions

In [9]:
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
cm

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



## 9. Threshold Tuning & Trade-offs
Rather than relying on the default probability threshold, the decision threshold is adjusted to explore trade-offs between recall and precision. Lowering the threshold increases the model’s ability to capture high-salary individuals at the cost of additional false positives. Threshold tuning allows the model’s behavior to be aligned with business priorities without retraining.

*Instead of default 0.5, we lower threshold to improve recall.*

In [10]:
y_prob = model.predict_proba(X_test)[:, 1]

# Custom threshold
threshold = 0.3
y_pred_custom = (y_prob >= threshold).astype(int)

confusion_matrix(y_test, y_pred_custom)

print(classification_report(y_test, y_pred_custom))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



## 10. Final Recommendation & Limitations
The model is best used as a decision-support or screening tool to help prioritize potentially high-value individuals for further review. The simplified nature of the dataset and the limited set of features mean that the model should not be treated as a standalone decision-maker. Human judgment and additional contextual information remain essential.