# Employee Compensation Fairness & Market Benchmarking Model

## Problem Statement
Our objective is to develop a robust machine learning model that accurately predicts optimal employee salary ranges based on internal factors (like experience, education, and role) and external market benchmarks. This model will enable the Human Resources department to ensure competitive and equitable compensation structures, proactively identify and rectify pay disparities, support data-driven salary negotiations for new hires, and optimize overall workforce budgeting for sustainable growth and talent retention.

## 1. Importing Libraries
We need to import all the necessary libraries that will be used for this project.

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier # New import for XGBoost
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## 2. Data Loading and Initial Exploration
We will use the 'Adult Income Dataset' (adult.csv) from [kaggle](https://www.kaggle.com/datasets/uciml/adult-census-income), which contains various demographic and employment-related features, along with an income bracket (`<=50K` or `>50K`), serving as our target variable for salary range prediction.

In [33]:
# Load the dataset. Make sure adult.csv is in the correct path or same directory.
try:
    data = pd.read_csv("adult.csv")
except FileNotFoundError:
    print("Error: adult.csv not found. Please ensure the file is in the correct directory.")
    # Example for Google Colab if you're still using it and the file is in sample_data
    # data = pd.read_csv("/content/sample_data/adult.csv")

print("First 5 rows of the dataset:")
print(data.head(5))

print("\nDataset shape (rows, columns):")
print(data.shape)

First 5 rows of the dataset:
   age workclass  fnlwgt     education  education.num marital.status  \
0   90         ?   77053       HS-grad              9        Widowed   
1   82   Private  132870       HS-grad              9        Widowed   
2   66         ?  186061  Some-college             10        Widowed   
3   54   Private  140359       7th-8th              4       Divorced   
4   41   Private  264663  Some-college             10      Separated   

          occupation   relationship   race     sex  capital.gain  \
0                  ?  Not-in-family  White  Female             0   
1    Exec-managerial  Not-in-family  White  Female             0   
2                  ?      Unmarried  Black  Female             0   
3  Machine-op-inspct      Unmarried  White  Female             0   
4     Prof-specialty      Own-child  White  Female             0   

   capital.loss  hours.per.week native.country income  
0          4356              40  United-States  <=50K  
1          4356  

## 3. Finding Null Values and Initial Cleaning
We'll check for missing values and handle them, specifically the '?' values in categorical columns.

In [34]:
print("Missing values before handling '?' marks:")
print(data.isna().sum())

# Replace '?' with 'Others' in relevant columns as done in the original notebook
data.replace('?', 'Others', inplace=True)

print("\nValue counts for 'workclass' after handling '?' marks:")
print(data['workclass'].value_counts())

print("\nValue counts for 'occupation' after handling '?' marks:")
print(data['occupation'].value_counts())

Missing values before handling '?' marks:
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

Value counts for 'workclass' after handling '?' marks:
workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
Others               1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64

Value counts for 'occupation' after handling '?' marks:
occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Others               1843
Transport-moving     1597
Handlers-cleaners

## 4. Handling Outliers and Data Cleaning by Mutual Understanding
Based on common sense and the nature of the data, we'll remove some rows that are less relevant for income prediction (e.g., very young/old ages, non-working individuals, very low education levels).

In [35]:
# Age outlier handling (as done in the original notebook)
print("Original shape:", data.shape)
data = data[(data['age'] <= 75) & (data['age'] >= 17)]
print("Shape after age filtering:", data.shape)

# Remove 'Without-pay' and 'Never-worked' from 'workclass'
data = data[data['workclass'] != 'Without-pay']
data = data[data['workclass'] != 'Never-worked']
print("Shape after workclass filtering:", data.shape)

# Remove very low education categories
data = data[data['education'] != '1st-4th']
data = data[data['education'] != '5th-6th']
data = data[data['education'] != 'Preschool']
print("Shape after education filtering:", data.shape)

Original shape: (32561, 15)
Shape after age filtering: (32320, 15)
Shape after workclass filtering: (32299, 15)
Shape after education filtering: (31758, 15)


## 5. Feature Engineering/Selection
We'll drop redundant columns. `education` and `education-num` convey similar information; we'll keep the numerical `education-num`.

In [36]:
data.drop(columns=['education'], inplace=True)
print("Columns after dropping 'education':", data.columns.tolist())

Columns after dropping 'education': ['age', 'workclass', 'fnlwgt', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income']


## 6. Encoding Categorical Features
Machine learning algorithms work with numerical data. We'll convert all categorical (object type) columns into numerical representations using `LabelEncoder`.

In [37]:
import joblib
import os
from sklearn.preprocessing import LabelEncoder

# Ensure the 'models' directory exists for saving encoders
if not os.path.exists('models'):
    os.makedirs('models')

# Identify categorical columns, excluding 'income' as it's the target
categorical_input_features = data.select_dtypes(include='object').columns.tolist()
if 'income' in categorical_input_features:
    categorical_input_features.remove('income')

# Store fitted encoders in a dictionary (optional, but good for debugging/verification)
fitted_input_encoders = {}

# Apply Label Encoding and save each fitted encoder
for col in categorical_input_features:
    encoder = LabelEncoder()
    # Fit and transform the column
    data[col] = encoder.fit_transform(data[col])
    # Store the fitted encoder
    fitted_input_encoders[col] = encoder
    # Save the encoder to a .pkl file
    joblib.dump(encoder, f'models/{col}_encoder.pkl')
    print(f"Saved models/{col}_encoder.pkl")

print("\nData after Label Encoding categorical features (first 5 rows):")
print(data.head())

# Also, ensure your target variable 'income' LabelEncoder is saved.
# This part is usually in section 10, but adding a check here.
# Make sure your original notebook also saves the 'le_income' from the target variable.
# For example, if you defined 'le_income' and fitted it:
# le_income = LabelEncoder()
# Y_encoded = le_income.fit_transform(Y) # Assuming Y is your target series
# joblib.dump(le_income, 'models/income_label_encoder.pkl')

Saved models/workclass_encoder.pkl
Saved models/marital.status_encoder.pkl
Saved models/occupation_encoder.pkl
Saved models/relationship_encoder.pkl
Saved models/race_encoder.pkl
Saved models/sex_encoder.pkl
Saved models/native.country_encoder.pkl

Data after Label Encoding categorical features (first 5 rows):
   age  workclass  fnlwgt  education.num  marital.status  occupation  \
2   66          2  186061             10               6           8   
3   54          3  140359              4               0           6   
4   41          3  264663             10               5          10   
5   34          3  216864              9               0           7   
6   38          3  150601              6               5           0   

   relationship  race  sex  capital.gain  capital.loss  hours.per.week  \
2             4     2    0             0          4356              40   
3             4     4    0             0          3900              40   
4             3     4    0       

## 7. Splitting Data into X (Independent) and Y (Dependent) Variables
We separate the features (X) that will be used for prediction from the target variable (Y), which is 'income' in our case.

In [38]:
X = data.drop(columns=['income'])
Y = data['income']

print("X (features) head:")
print(X.head())
print("\nY (target) value counts:")
print(Y.value_counts())

X (features) head:
   age  workclass  fnlwgt  education.num  marital.status  occupation  \
2   66          2  186061             10               6           8   
3   54          3  140359              4               0           6   
4   41          3  264663             10               5          10   
5   34          3  216864              9               0           7   
6   38          3  150601              6               5           0   

   relationship  race  sex  capital.gain  capital.loss  hours.per.week  \
2             4     2    0             0          4356              40   
3             4     4    0             0          3900              40   
4             3     4    0             0          3900              40   
5             4     4    0             0          3770              45   
6             4     4    1             0          3770              40   

   native.country  
2              39  
3              39  
4              39  
5              39  
6  

## 8. Feature Scaling
Scaling converts all feature values into a uniform range (0 to 1) using `MinMaxScaler`. This is crucial for distance-based algorithms and can benefit others by preventing features with larger values from dominating the learning process.

In [39]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print("First 5 rows of scaled X (features):")
print(X_scaled[:5])

First 5 rows of scaled X (features):
[[0.84482759 0.33333333 0.11802067 0.5        1.         0.57142857
  0.8        0.5        0.         0.         1.         0.39795918
  0.95121951]
 [0.63793103 0.5        0.08698198 0.         0.         0.42857143
  0.8        1.         0.         0.         0.8953168  0.39795918
  0.95121951]
 [0.4137931  0.5        0.17140354 0.5        0.83333333 0.71428571
  0.6        1.         0.         0.         0.8953168  0.39795918
  0.95121951]
 [0.29310345 0.5        0.13894066 0.41666667 0.         0.5
  0.8        1.         0.         0.         0.86547291 0.44897959
  0.95121951]
 [0.36206897 0.5        0.09393787 0.16666667 0.83333333 0.
  0.8        1.         1.         0.         0.86547291 0.39795918
  0.95121951]]


## 9. Train-Test Split
We divide the data into training and testing sets to evaluate the model's performance on unseen data. `stratify=Y` ensures that the proportion of income categories is maintained in both sets, which is important for imbalanced datasets.

In [40]:
xtrain, xtest, ytrain, ytest = train_test_split(X_scaled, Y, test_size=0.2, random_state=42, stratify=Y)

print("Shape of xtrain:", xtrain.shape)
print("Shape of xtest:", xtest.shape)
print("Shape of ytrain:", ytrain.shape)
print("Shape of ytest:", ytest.shape)

Shape of xtrain: (25406, 13)
Shape of xtest: (6352, 13)
Shape of ytrain: (25406,)
Shape of ytest: (6352,)


## 10. Model Training: XGBoost Classifier
We'll use an XGBoost Classifier, a powerful gradient boosting algorithm known for its high performance and efficiency.

In [41]:
# XGBoost requires target labels to be 0 and 1. 'income' column has '<=50K' and '>50K'.
# We need to encode 'income' column to 0 and 1.
le_income = LabelEncoder()
ytrain_encoded = le_income.fit_transform(ytrain)
ytest_encoded = le_income.transform(ytest)

xgb_model = XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False, random_state=42)
xgb_model.fit(xtrain, ytrain_encoded)

print("XGBoost Model trained successfully.")

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost Model trained successfully.


## 11. Model Prediction
Making predictions on the test set using the trained XGBoost model.

In [42]:
ypred_encoded = xgb_model.predict(xtest)

# Decode predictions back to original labels for readability
ypred = le_income.inverse_transform(ypred_encoded)

print("First 10 predictions:", ypred[:10])
print("First 10 actual values:", ytest[:10].values)

First 10 predictions: ['<=50K' '<=50K' '<=50K' '<=50K' '<=50K' '>50K' '<=50K' '<=50K' '<=50K'
 '>50K']
First 10 actual values: ['<=50K' '<=50K' '>50K' '<=50K' '<=50K' '<=50K' '>50K' '<=50K' '<=50K'
 '>50K']


## 12. Model Evaluation
Evaluating the model's performance using various metrics: accuracy, classification report (precision, recall, f1-score), and confusion matrix.

In [43]:
accuracy = accuracy_score(ytest, ypred)
print(f"Accuracy Score: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(ytest, ypred))

print("\nConfusion Matrix:")
print(confusion_matrix(ytest, ypred))

Accuracy Score: 0.8684

Classification Report:
              precision    recall  f1-score   support

       <=50K       0.89      0.94      0.91      4796
        >50K       0.77      0.66      0.71      1556

    accuracy                           0.87      6352
   macro avg       0.83      0.80      0.81      6352
weighted avg       0.86      0.87      0.86      6352


Confusion Matrix:
[[4494  302]
 [ 534 1022]]


## 13. Hyperparameter Tuning
Using `GridSearchCV` to find the best hyperparameters for the XGBoost Classifier, which can further improve model performance.

In [44]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],            # Number of boosting rounds
    'max_depth': [3, 5, 7],                     # Maximum depth of a tree
    'learning_rate': [0.01, 0.1, 0.2],          # Step size shrinkage to prevent overfitting
    'subsample': [0.7, 1.0],                    # Subsample ratio of the training instance
    'colsample_bytree': [0.7, 1.0]              # Subsample ratio of columns when constructing each tree
}

# Initialize GridSearchCV
# Use ytrain_encoded for GridSearchCV target
grid_search = GridSearchCV(estimator=XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False, random_state=42),
                           param_grid=param_grid,
                           cv=3,
                           n_jobs=-1,
                           verbose=2,
                           scoring='accuracy')

# Fit GridSearchCV to the training data
print("Starting GridSearchCV for XGBoost...")
grid_search.fit(xtrain, ytrain_encoded)

print("\nBest parameters found:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Evaluate the best model on the test set
best_xgb_model = grid_search.best_estimator_
y_pred_tuned_encoded = best_xgb_model.predict(xtest)

# Decode predictions back to original labels
y_pred_tuned = le_income.inverse_transform(y_pred_tuned_encoded)

tuned_accuracy = accuracy_score(ytest, y_pred_tuned)

print(f"\nAccuracy of the tuned XGBoost model on test set: {tuned_accuracy:.4f}")
print("\nClassification Report of the tuned XGBoost model:")
print(classification_report(ytest, y_pred_tuned))

Starting GridSearchCV for XGBoost...
Fitting 3 folds for each of 108 candidates, totalling 324 fits


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Best parameters found: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 1.0}
Best cross-validation score: 0.8701094551044378

Accuracy of the tuned XGBoost model on test set: 0.8758

Classification Report of the tuned XGBoost model:
              precision    recall  f1-score   support

       <=50K       0.89      0.95      0.92      4796
        >50K       0.80      0.65      0.72      1556

    accuracy                           0.88      6352
   macro avg       0.85      0.80      0.82      6352
weighted avg       0.87      0.88      0.87      6352



## Next Steps & Deployment
After finalizing our model, we can save the `best_xgb_model`, `scaler`, and importantly, the `le_income` LabelEncoder using `joblib` or `pickle`.
Then, our Streamlit application (`app.py`) will load these saved objects to make predictions based on user input.

In [45]:
import joblib
import os

# Create the 'models' directory if it doesn't exist
if not os.path.exists('models'):
    os.makedirs('models')

# Save the best trained XGBoost model
joblib.dump(best_xgb_model, 'models/best_salary_predictor_xgb_model.pkl')
print("Trained XGBoost model saved successfully to models/best_salary_predictor_xgb_model.pkl")

# Save the MinMaxScaler
joblib.dump(scaler, 'models/minmax_scaler.pkl')
print("MinMaxScaler saved successfully to models/minmax_scaler.pkl")

# Save the LabelEncoder for the income target variable
# This is crucial for decoding predictions in the deployed app
joblib.dump(le_income, 'models/income_label_encoder.pkl')
print("Income LabelEncoder saved successfully to models/income_label_encoder.pkl")

Trained XGBoost model saved successfully to models/best_salary_predictor_xgb_model.pkl
MinMaxScaler saved successfully to models/minmax_scaler.pkl
Income LabelEncoder saved successfully to models/income_label_encoder.pkl
