# **Final Project** - `Yassine ERRAJI` & `William PELTIER`

## ***Business Problem Definition*** 

William and I came to notice that companies struggle to offer competitive salaries for data science-related positions.

So we decided to build a project that analyzes salary determinants, predicts expected compensation, segments job roles, and visualizes job families to improve HR compensation strategy and hiring decisions.

Chapters seen in class were the pillars of this project and we will effort ourselves to cover as many concepts studied as possible but only while it is **not** at the expense of high quality insights and a concrete business problem being solved.

## ***The Dataset*** 

* The Data in the dataset is extracted from the Glassdoor website, which is a job posting website. 

* The dataset has data related to data science jobs and salaries and a lot more, offering a clear view of job opportunities. 

* It is packed with essential details like job titles, estimated salaries, job descriptions, company ratings, and key company info such as location, size, and industry. 

### A few columns in the dataset include information on

**Job Title:** Title of the Job

**Salary Estimate:** Estimated salary for the job that the company provides

**Job Description:** The description of the job

**Rating:** Rating of the company

**Company Name:** Name of the Company

**Location:** Location of the job

**Headquarters:** Headquarters of the company

**Size:** Number of employees in the company

**Founded:** The year company founded

**Type of ownership:** Ownership types like private, public, government, and non-profit organizations

**Industry:** Industry type like Aerospace, Energy where the company provides services

**Sector:** Which type of services company provide in the industry, like industry (Energy), Sector (Oil, Gas)

**Revenue:** Total revenue of the company

**Competitors:** Company competitors

... and more !

## ***Tasks Performed*** 

* **Regression** : Predict average salary.

* **Classification** : Predict whether a job is "high-paying".

* **Clustering** : Segment job postings into meaningful job families.

* **Dimensionality Reduction** : PCA & t-SNE to visualize job similarities

## 1. Data Importation & Base Cleaning

Task 1.1 – Load the dataset
	•	pd.read_csv("Glassdoor_Salary_Cleaned_Version.csv")

Task 1.2 – Inspect structure
	•	df.info(), df.describe(), df.head()
	•	Identify numeric, categorical, and binary columns.

Task 1.3 – Handle missing values
	•	Check missingness per column
	•	Decide: imputation? drop? (likely few missing values)

Task 1.4 – Convert relevant columns
	•	job_state, industry, sector, company_txt, revenue, etc. → categorical
	•	min_salary, max_salary, avg_salary, age, rating → numeric

Task 1.5 – Remove non-informative columns

Columns like:

	•	index
	•	Unnamed:…
	•	Possibly Competitors

(Only if they contain no usable variation.)

## 2. Exploratory Data Analysis 

Task 2.1 – Salary Distribution
	•	Histogram of avg_salary
	•	Boxplot by state, industry, size

Task 2.2 – Salary differences across categories
	•	Salary by:
	•	job_state
	•	industry
	•	sector
	•	company_size

Boxplots + groupby summaries.

Task 2.3 – Skill prevalence
	•	Count plots for python_yn, R_yn, spark, aws, excel
	•	Heatmap of skills vs salary

Task 2.4 – Correlation matrix for numerical features

Include:
	•	rating, age, salaries, skill binaries (as ints), same_state

Task 2.5 – Outliers and anomalies detection

Identify:

	•	Very high salaries
	•	Inconsistent ratings
	•	Very young or very old companies

We do not need to remove them unless they break models.

## 3. Feature Engineering

Task 3.1 – Encode categorical features

Use:

	•	One-hot encoding for: sector, industry, revenue, company size, job_state, type_of_ownership
	•	Ordinal encoding for revenue if logically ordered (e.g., “Less than $1M”, “$1-5M”, …)

Task 3.2 – Create additional features

You can and should engineer:

	•	salary_range = max_salary - min_salary
	•	is_senior = 1 if ("Senior" or "Sr" in job_title)
	•	is_manager = 1 if ("Manager" in job_title)
	•	num_skills = python_yn + R_yn + spark + aws + excel
	•	Extract simplified job categories from job titles (data engineer, analyst, DS, ML, cloud, etc.)

Task 3.3 – Train/validation split

	•	train_test_split(df, test_size=0.2, random_state=42)

This is the dataset we will use for regression & classification.

## 4. Regression - Predicting Salary

Task 4.1 – Baseline model

	•	Linear Regression
	•	Evaluate: RMSE, MAE, R²

Task 4.2 – Regularized models

Use:

	•	Ridge
	•	Lasso
	•	Compare parameter tuning via GridSearchCV

Task 4.3 – Tree-based regressors

	•	Random Forest Regressor
	•	Gradient Boosting Regressor

Task 4.4 – Model comparison table

Include:

	•	RMSE on train & test
	•	R² test
	•	Which model generalizes best?

Task 4.5 – Interpret results

For the best model:

	•	Feature importances
	•	Which skills increase salary?
	•	Which industries pay more?
	•	How much does a 1-point rating increase salary?
	•	Effect of seniority

This is where the business insights shine.

## 5. Classification - Predicting High-Paying Jobs

Task 5.1 – Create a classification target

Define :
```python 
threshold = df["avg_salary"].quantile(0.75)
df["high_salary"] = (df["avg_salary"] >= threshold).astype(int)
````
Task 5.2 – Train classification models

Models to include:

	•	Logistic Regression
	•	KNN
	•	Decision Tree
	•	Random Forest / Gradient Boosting

Task 5.3 – Evaluate performance

	•	Confusion matrix
	•	Precision, recall, F1
	•	ROC-AUC curve

Task 5.4 – Explain model

From logistic regression:

	•	Significant predictors
	•	Odds ratios
From tree models:
	•	Feature importances

Business interpretation:

	•	Which skills most strongly predict high-paying jobs?
	•	Which industries/states tend to offer higher salaries?
	•	What role does company size or age play?

## 6. Dimensionality Reduction (PCA + t-SNE)

Task 6.1 – Select relevant features

Two possible approaches:

Approach A — Skill space only

	•	python_yn, R_yn, spark, aws, excel

Approach B — Full feature set
(use scaled numeric + encoded categorical)

Task 6.2 – Standardize features

Use StandardScaler.

Task 6.3 – PCA

	•	Show explained variance ratio
	•	Plot PC1 vs PC2
	•	Interpret loadings (e.g., PC1 = cloud skills vs analyst skills)

Task 6.4 – t-SNE or UMAP

	•	Use perplexity tuning
	•	Visualize role clusters in 2D

Insight:

Reveal natural divisions such as:

	•	Data engineering vs. analyst vs. ML roles
	•	Cloud-heavy vs. coding-heavy vs. business roles

## 7. Clustering Analysis

Task 7.1 – Apply clustering on PCA-reduced data

Methods:

	•	K-means
	•	Hierarchical clustering
	•	(optional) DBSCAN

Task 7.2 – Determine optimal cluster number

Use:

	•	Elbow method
	•	Silhouette score

Task 7.3 – Visualize clusters (very important)

Show:

	•	Clusters on PCA components
	•	Clusters on t-SNE map

Task 7.4 – Characterize clusters

For each cluster:

	•	Mean salary
	•	Most common skills
	•	Dominant industries or job titles
	•	Typical company size

Business insights:

	•	Identify distinct job families
	•	Understand which clusters correspond to high-paying profiles
	•	Recommend tailored compensation bands


## 8. Business Interpretation & Recommendation Section 

Task 8.1 – Synthesize key insights from all models

Examples:

	•	Python + AWS adds X salary premium
	•	Data engineers cluster separately and earn more than analysts
	•	Industry A systematically underpays compared to industry B
	•	Rating and company age have moderate influence

Task 8.2 – Provide actionable recommendations

For example:

	•	For companies:
	•	“Increase salary offers in states X and Y to remain competitive”
	•	“Roles requiring cloud skills should have independent pay scales”
	•	For job seekers:
	•	“Adding Spark or AWS significantly increases expected salary”

Task 8.3 – Highlight limitations and future improvements

	•	Dataset size
	•	Potential sampling bias
	•	Missing compensation factors (benefits, bonuses)
	•	Text processing could be extended