
# Salary Prediction Model

This notebook demonstrates a complete workflow for building a salary prediction model using machine learning. The model is trained using a Random Forest Regressor, and its hyperparameters are optimized with GridSearchCV. The final model is exported for later use in production.

**Sections:**
1. Data Preprocessing
2. Model Training and Hyperparameter Tuning
3. Model Evaluation
4. Model Export
5. Model Testing

Let's dive in!


## Importing Necessary Libraries

In [None]:
# Import libraries
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings


warnings.filterwarnings('ignore')

In [None]:
# Code Block
df = pd.read_csv("../Data/Cleaned/glassdoor-cleaned.csv")
df.head()

### Feature Engineering

#### Making a new dataframe with relevant features for the regression model

In [None]:
# Code Block
cols_model = ['job_state', 'seniority', 'job_education', 'job_experience', 'company_industry', 'company_rating', 'salary_estimate']

df_model = df[cols_model]
df_model.head()

#### Dropping some states and company industries that have below 5 data points.

In [None]:
# Code Block
jobstate_count = df_model['job_state'].value_counts()
jobstate_count

In [None]:
# Code Block
states_g5 = jobstate_count[jobstate_count > 5].index.tolist()

df_model = df_model[df_model["job_state"].isin(states_g5)]
df_model['job_state'].value_counts()

## For the Predict Salary page

In [None]:
# Code Block
df_model['company_rating'].unique()

In [None]:
# Code Block
df_model['company_industry'].unique()

#### One Hot Encoding

## Importing Necessary Libraries

In [None]:
# Import libraries
from sklearn.preprocessing import LabelEncoder

In [None]:
# Code Block
le_state = LabelEncoder()
df_model['job_state'] = le_state.fit_transform(df_model['job_state'])
df_model['job_state'].unique()

In [None]:
# Code Block
le_sen = LabelEncoder()
df_model['seniority'] = le_sen.fit_transform(df_model['seniority'])
df_model['seniority'].unique()

In [None]:
# Code Block
le_edu = LabelEncoder()
df_model['job_education'] = le_edu.fit_transform(df_model['job_education'])
df_model['job_education'].unique()

In [None]:
# Code Block
le_indu = LabelEncoder()
df_model['company_industry'] = le_indu.fit_transform(df_model['company_industry'])
df_model['company_industry'].unique()

In [None]:
# Code Block
unique_company_ratings = np.sort(df_model['company_rating'].unique())
unique_company_ratings

In [None]:
# Code Block
le_rating = LabelEncoder()
df_model['company_rating'] = le_rating.fit_transform(df_model['company_rating'])
df_model['company_rating'].unique()

In [None]:
# Code Block
le_exp = LabelEncoder()
df_model['job_experience'] = le_exp.fit_transform(df_model['job_experience'])
df_model['job_experience'].unique()

In [None]:
# Code Block
df_model.head()

In [None]:
# Code Block
corrmat = df_model.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, annot=True, square=True)

#### The salary is moderatly correlated with the seniority, company industry and company rating. Suprisingly it is negatively correlated with the job experience needed and the job state.

## Multiple Linear Regression

## Importing Necessary Libraries

In [None]:
# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
# Code Block
X = df_model.drop("salary_estimate", axis=1)
y = df_model["salary_estimate"].values

In [None]:
# Code Block
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# Code Block
model = LinearRegression()

In [None]:
# Code Block
model.fit(X_train, y_train)

In [None]:
# Code Block
predictions = model.predict(X_test)

In [None]:
# Code Block
print(f'Mean Absolute Error : ${round(mean_absolute_error(y_test, predictions), 2)}')

In [None]:
# Code Block
error = np.sqrt(mean_squared_error(y_test, predictions))
print("Error : ${:,.02f}".format(error)) 

## Random Forest Regressor Model

## Importing Necessary Libraries

In [None]:
# Import libraries
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Code Block
X = df_model.drop("salary_estimate", axis=1)
y = df_model["salary_estimate"].values

### Hyperparameter Tuning

In [None]:
# Using GridSearchCV to find the best parameters for RandomForestRegressor
max_depth = [None, 2, 4, 6, 8, 10, 12]

parameters = {"max_depth": max_depth}

regressor = RandomForestRegressor(n_estimators = 100, random_state=0)
gs = GridSearchCV(regressor, parameters, scoring='neg_mean_squared_error')
gs.fit(X_train, y_train)

In [None]:
# Code Block
regressor = gs.best_estimator_

regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

print(f'Mean Absolute Error : ${round(mean_absolute_error(y_test, y_pred), 2)}')
error = np.sqrt(mean_squared_error(y_test, y_pred))
print("Error : ${:,.02f}".format(error)) 

In [None]:
# Code Block
errors = abs(y_pred - y_test)

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%')

## Training the Model

In [None]:
# Code Block
regressor.fit(X, y)

## Importing Necessary Libraries

In [None]:
# Import libraries
# Get numerical feature importances
importances = list(regressor.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(cols_model, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

In [None]:
# Code Block
# job_state, seniority, job_education, job_experience, company_industry, company_rating

X_example =  np.array([["CA", "senior", "bachelor", "0-2 years", "Information Technology Support Services", 4.5]])

In [None]:
# Code Block
X_example[:, 0] = le_state.transform(X_example[:,0])
X_example[:, 1] = le_sen.transform(X_example[:,1])
X_example[:, 2] = le_edu.transform(X_example[:,2])
X_example[:, 3] = le_exp.transform(X_example[:,3])
X_example[:, 4] = le_indu.transform(X_example[:,4])
X_example[:, 5] = le_rating.transform(X_example[:,5])

X_example = X_example.astype(float)
X_example

In [None]:
# Code Block
y_pred = regressor.predict(X_example)
salary = int(y_pred[0])
print(f"Predicted salary: ${salary:,}")

## Model Export

## Importing Necessary Libraries

In [None]:
# Import libraries
import pickle

data = {"model": regressor, "le_state": le_state, "le_sen": le_sen, "le_edu": le_edu, "le_exp": le_exp, "le_indu": le_indu, "le_rating": le_rating}

with open('../Models/model_salary_pred.pkl', 'wb') as file:
    pickle.dump(data, file)

#### Let's test our model after importing it from the pickle file

### Exporting the Model

In [None]:
# Exporting the trained model and necessary encoders
with open('../Models/model_salary_pred.pkl', 'rb') as file:
    data = pickle.load(file)

regressor_loaded = data["model"]
le_state = data["le_state"]
le_sen = data["le_sen"]
le_edu = data["le_edu"]
le_exp = data["le_exp"]
le_indu = data["le_indu"]
le_rating = data["le_rating"]

In [None]:
# Code Block
y_pred = regressor_loaded.predict(X_example)
salary = int(y_pred[0])
print(f"Predicted salary: ${salary:,}")

In [None]:
# Code Block



## Model Performance and Accuracy

The model's performance is evaluated using the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). These metrics provide insights into how well the model is performing:

- **MAE:** Represents the average magnitude of errors in a set of predictions, without considering their direction.
- **RMSE:** Measures the square root of the average of squared differences between predicted and observed values, penalizing larger errors more than MAE.

**Accuracy:** The accuracy is calculated based on the percentage of correctly predicted salary values within the given test set.


### Visualizing Feature Importance

In [None]:

import matplotlib.pyplot as plt

# Feature importance visualization
importance = regressor.feature_importances_
features = X_train.columns

plt.figure(figsize=(10, 6))
plt.barh(features, importance)
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance for Salary Prediction Model")
plt.show()
