# **Predicting Salary using Gradient Boosting Model**

In this Notebook, we will focus on building a Gradient Boosting model to predict salary based on [`Age Range`, `Industry`, `Job Title`, `Education`, `Country`, `Gender`, `Experience`, `Annual Bonus` and `Signon Bonus`].

the model is performing several tasks including data preprocessing, model training, hyperparameter tuning, model evaluation using cross-validation, and finally evaluating the model's performance on a hold-out test set. This process helps to build a predictive model for estimating annual salaries based on various features provided in the dataset.

# Developing the Model

We will be taking the following steps to develop the model:
1. Data Preparation
2. Model Training: Training the model on the training dataset, adjusting parameters as needed to optimize performance.
3. Model Evaluation: 

Hyperparameter Tuning: Use techniques like grid search or random search to find the optimal settings for your model's parameters. Iterate and Optimize: Based on the evaluation, refine the model by adjusting its parameters, reselecting features, or further addressing data imbalances. Repeat the training and evaluation process as needed.

Importing the necessary Libraries

In [1]:
#import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor


# Data Preparation/Preprocessing:

Developing the Model:
1.	Data Collection: Gather a dataset that is representative of the population you're studying. The quality and quantity of your data are crucial for building a reliable model.
2.	Data Cleaning and Preprocessing: Clean the data to handle missing values, outliers, and errors. Preprocess the data by encoding categorical variables, normalizing numerical features, and splitting the dataset into training and testing sets.
3.	Feature Selection and Engineering: Identify the most relevant features that contribute to the target variable and create new features that could improve model performance.

In [2]:
# Load the dataset
df = pd.read_csv('Cleaned_SalSur.csv')
df_capped = pd.read_csv('Cleaned_SalSur.csv')

# Outlier Detection using IQR
Q1 = df[['Annual Salary', 'Annual Bonus', 'Signon Bonus']].quantile(0.25)
Q3 = df[['Annual Salary', 'Annual Bonus', 'Signon Bonus']].quantile(0.75)

IQR = Q3 - Q1

# Define bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Count outliers
outliers = ((df[['Annual Salary', 'Annual Bonus', 'Signon Bonus']] < lower_bound) | 
            (df[['Annual Salary', 'Annual Bonus', 'Signon Bonus']] > upper_bound)).sum()


df_capped = df.copy()
df_capped['Annual Salary'] = df_capped['Annual Salary'].clip(upper=upper_bound['Annual Salary'])
df_capped['Annual Bonus'] = df_capped['Annual Bonus'].clip(upper=upper_bound['Annual Bonus'])


# Define filters
jobfilter = ['Audit Associate', 'Account Executive', 'Recruiter', 'Nurse', 'Sales',
             'Actuarial Associate', 'Software Engineer', 'Project Manager',
             'Account Manager', 'Teacher', 'Marketing', 'Coo', 'Consultant',
             'Director', 'Engineer', 'Analyst', 'Manager']

industryfilter = ['Computer Software', 'Construction', 'Architecture', 'Engineering', 
                  'Sales', 'Non-Profit', 'Law/Legal/Attorney', 'Advertising', 
                  'Banking', 'Insurance', 'Retail', 'Accounting', 'Consulting', 
                  'Marketing', 'Finance', 'Education', 'Information Technology/It', 
                  'Biotech', 'Health Care']

countryfilter = ['United States', 'Canada', 'United Kingdom', 'Australia', 'Ireland']

# Re-apply filters to the capped dataframe
df_filtered_capped = df_capped[df_capped['Job Title'].isin(jobfilter) & 
                               df_capped['Industry'].isin(industryfilter) & 
                               df_capped['Country'].isin(countryfilter)]

# Define features (X) and target (y)
X = df_capped.drop(['Annual Salary'], axis=1)
y = df_capped['Annual Salary']

# Model Training and Parameter Tuning using Grid search:

Define preprocessing steps for numeric and categorical features, including imputation and scaling for numeric features and one-hot encoding for categorical features.
Combine preprocessing steps into a single ColumnTransformer.

After conducting hyperparameter tuning using RandomizedSearchCV, the following hyperparameters were selected for the XGBoost model:

- `n_estimators`: 300
- `learning_rate`: 0.3
- `max_depth`: 5
- `subsample`: 0.7
- `colsample_bytree`: 0.7
- `reg_alpha`: 1
- `reg_lambda`: 4
- `gamma`: 3

In [3]:
# Preprocessing steps for numeric and categorical columns
numeric_features = ['Experience', 'Annual Bonus', 'Signon Bonus']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age Range', 'Industry', 'Job Title', 'Education', 'Country', 'Gender']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define the XGBRegressor with optimized hyperparameters
xgb_model = XGBRegressor(
    n_estimators=350, #300
    learning_rate=0.3,
    max_depth=5,
    subsample=0.7,#1
    colsample_bytree=0.7,#0.8
    reg_alpha=1,
    reg_lambda=4,
    gamma=3,
    random_state=42)

# Create a pipeline that includes the preprocessor and the model
xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', xgb_model)])

# Model Evaluation:

Validating the Model:
1.	Cross-Validation: Use cross-validation techniques to assess how the model will generalize to an independent dataset. This involves dividing the dataset into complementary subsets, training the model on one subset, and validating it on the other.
2.	Performance Metrics: Evaluate your model using metrics appropriate for classification tasks, such as accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC). The choice of metrics should reflect your project's objectives and the balance between different types of errors.
3.	Model Interpretation: Understand how your model makes predictions and which features are most influential. This can help identify any biases or weaknesses in the model.
4.	Ethical Consideration and Bias Mitigation: Assess and address potential biases in your model or data. Ensure that your model's use complies with ethical standards, especially when predicting sensitive information like gender.


In [4]:
# Perform cross-validation using KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb_pipeline, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)
cv_scores = np.abs(cv_scores)  # Convert scores to positive values

# Print cross-validation MSE scores
print(f"CV MSE Scores: {cv_scores}")
print(f"Mean CV MSE: {np.mean(cv_scores)}")
print(f"Standard Deviation of CV MSE: {np.std(cv_scores)}")

# Split the capped data into training and testing sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model on the training data
xgb_pipeline.fit(X_train, y_train)

# Evaluate the model
y_pred = xgb_pipeline.predict(X_test)

# Calculate the performance metrics for the hold-out test set
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

#print the results
print(f"Hold-out Test Set Mean Square Error(MSE): {mse}")
print(f"Hold-out Test Set R-Square: {r2}")

CV MSE Scores: [4.45972173e+08 4.81263818e+08 4.81200522e+08 4.65374553e+08
 4.67688888e+08]
Mean CV MSE: 468299990.8142292
Standard Deviation of CV MSE: 12976498.594821848
Hold-out Test Set Mean Square Error(MSE): 446904196.1233598
Hold-out Test Set R-Square: 0.6021761531567582


# Model Interpretation

# Cross-Validation (CV) MSE Scores**

**CV MSE Scores:** The MSE scores from cross-validation are relatively consistent, ranging from approximately 446 million to 481 million. This indicates that the model is somewhat stable across different subsets of the training data.

**Mean CV MSE:**The average MSE across all CV folds is about 468 million. This value gives a general idea of the model's prediction error across the CV process.
Standard Deviation of CV MSE: The standard deviation of the CV MSE scores is approximately 12.98 million, which is relatively small compared to the magnitude of the MSE values themselves. This low standard deviation suggests that the model's performance is quite consistent across different folds, indicating a stable model that does not suffer too much from variability due to the randomness in the data splitting.

# Hold-out Test Set Performance
**Hold-out Test Set MSE:** The MSE on the hold-out test set is approximately 446 million, which is slightly better (lower) than the mean CV MSE. This suggests that the model generalizes well to unseen data, at least as far as the MSE metric is concerned.

**Hold-out Test Set R-Square (R²)**: The R² value of 0.602 indicates that about 60.22% of the variance in the salary is explained by the model. This is a decent level of predictive power, suggesting the model has learned meaningful patterns from the features that contribute to salary prediction.


**Model Performance:** The gradient boosting model shows a good balance between bias and variance, as indicated by the consistency of CV MSE scores and the reasonable R² value on the hold-out test set. An R² of over 0.60 is indicative of a model that captures a significant portion of the variance in the target variable.

# Extended Hyperparamter tuning using Randomize CSV

In [6]:
from sklearn.model_selection import RandomizedSearchCV, KFold
from scipy.stats import uniform, randint

# Define the cross-validation strategy
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter distribution
param_dist = {
    'model__n_estimators': randint(100, 600),
    'model__learning_rate': uniform(0.01, 0.2),
    'model__max_depth': randint(3, 10),
    'model__subsample': uniform(0.7, 0.3),
    'model__colsample_bytree': uniform(0.7, 0.3),
    'model__reg_alpha': uniform(0, 2),
    'model__reg_lambda': uniform(1, 3),
    'model__gamma': uniform(0, 5)
}

# RandomizedSearch setup
random_search = RandomizedSearchCV(
    estimator=xgb_pipeline,
    param_distributions=param_dist,
    n_iter=50,  # Number of parameter settings sampled
    scoring='neg_mean_squared_error',
    cv=cv,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

# Fit RandomizedSearch
random_search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters found: ", random_search.best_params_)
print("Lowest MSE found: ", np.abs(random_search.best_score_))

# Predict with best model
y_pred_optimized = random_search.predict(X_test)

# Performance Metrics
mse_optimized = mean_squared_error(y_test, y_pred_optimized)
r2_optimized = r2_score(y_test, y_pred_optimized)

print("Optimized Mean Square Error(MSE):", mse_optimized)
print("Optimized R-Square:", r2_optimized)


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters found:  {'model__colsample_bytree': 0.7418481581956125, 'model__gamma': 1.4607232426760908, 'model__learning_rate': 0.08327236865873834, 'model__max_depth': 8, 'model__n_estimators': 545, 'model__reg_alpha': 1.5703519227860272, 'model__reg_lambda': 1.5990213464750793, 'model__subsample': 0.8542703315240834}
Lowest MSE found:  469876439.70147216
Optimized Mean Square Error(MSE): 436818537.3437755
Optimized R-Square: 0.6111541748635292
