<a href="https://colab.research.google.com/github/biswajit-j5/Glassdoor-project/blob/main/17jun_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Glassdoor Project



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -**  Biswajit Jena


# **Project Summary -**

The objective of this project was to develop a data-driven solution that can predict the average salary for a given job posting using data from Glassdoor. This type of predictive tool is highly valuable for organizations seeking to benchmark compensation packages, ensure competitive salary offerings, and make informed hiring decisions based on market trends. Our approach combined data wrangling, exploratory data analysis (EDA), and machine learning modeling to address the business problem effectively.

We began by loading and inspecting the glassdoor_jobs_cleaned.csv dataset. The dataset contained various attributes including average salary (avg_salary), company rating (Rating), company age (company_age), job state, job title, and more. During the data wrangling phase, we focused on handling missing values, encoding categorical variables, and preparing the data for analysis. Specifically, rows with missing avg_salary values were dropped as they directly impacted the target variable. Missing numerical values such as Rating and company_age were filled using mean and median imputation, respectively. Categorical variables like job_state were one-hot encoded to make them usable by machine learning algorithms. We also removed irrelevant columns such as text-heavy descriptions and unnecessary indexes.

Our exploratory data analysis (EDA) provided insights into the distribution and relationships between variables. We used pair plots, box plots, and correlation heatmaps to visualize how features like company rating and age influence average salaries. For example, higher-rated companies generally offered slightly better salaries, though other factors also played significant roles. The EDA helped us understand potential multicollinearity issues and guided the feature selection process for the machine learning model.

For the machine learning component, we applied a simple yet interpretable model — linear regression — as a baseline. We selected Rating, company_age, and other relevant numeric or encoded features as predictors. The dataset was split into training and test sets using an 80-20 split. The model was trained on the training data and evaluated on the test data. Performance was measured using metrics such as Mean Squared Error (MSE) and R² score. The initial model produced reasonable predictions, with an R² score that indicated it captured key patterns in the data, though we recognize that more advanced algorithms like random forests or gradient boosting could further improve accuracy.

The outcome of this project is a functional salary prediction model that the client can integrate into their HR analytics processes. With this model, the client can estimate competitive salaries for different roles and locations based on historical data trends. This helps in designing offers that attract top talent while aligning with market standards. Additionally, the model can be updated over time with new job postings and salary data to maintain its relevance as market conditions change.

In conclusion, our solution provides the client with a practical and interpretable machine learning tool to achieve their business objective of data-driven salary estimation. Beyond its immediate use in salary prediction, this project lays the groundwork for future enhancements, such as incorporating additional features (e.g., skills, certifications, benefits offered) or adopting more complex models for better accuracy. By continuing to refine this model and integrate external market data, the client can stay ahead in a competitive hiring landscape and make compensation decisions with confidence.



# **GitHub Link -**

https://github.com/biswajit-j5

# **Problem Statement**


The objective of this project is to analyze salary trends across various job attributes and develop a predictive model capable of estimating salaries based on these factors. The analysis focuses on understanding how average salary differs by job position, such as Data Scientist, Software Engineer, and DevOps Engineer. It also examines the impact of company size on salary levels and explores how salaries vary across different locations.

#### **Define Your Business Objective?**

The primary business objective of this project is to develop a reliable machine learning model that can accurately predict the average salary for a job posting based on company and job-related features. This model aims to provide the client with a data-driven tool to benchmark and estimate salaries for different job roles across industries, locations, and company profiles.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')


### Dataset Loading

In [None]:
from google.colab import files
uploaded = files.upload()  # Upload your 'glassdoor_jobs_cleaned.csv' file

df = pd.read_csv("glassdoor_jobs_cleaned.csv")
df.head()


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_columns = df.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.


### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Distribution of Average Salary

In [None]:
# Chart - 1 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
sns.histplot(df['avg_salary'], kde=True, bins=30, color='skyblue')
plt.title('Distribution of Average Salary')
plt.xlabel('Average Salary ($1000s)')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram shows the distribution of a continuous variable, which helps identify the salary spread, skewness, and central tendency in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The average salary distribution is right-skewed, with most jobs paying between $65K and $90K.

There's a long tail indicating some high-paying outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding salary distribution helps set competitive and realistic salary benchmarks for job listings, especially to attract talent without overspending.

Any Negative Growth Insight?
If too many roles offer lower salaries than the market average, this may lead to higher attrition rates or fewer applicants — a sign for HR to reevaluate compensation structures.

#### Chart - 2  Top 10 Job Locations by Count

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10,5))
df['job_state'].value_counts().head(10).plot(kind='bar', color='salmon')
plt.title('Top 10 States by Job Count')
plt.xlabel('State')
plt.ylabel('Number of Job Postings')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Bar charts are great for comparing categorical values — in this case, job distribution by U.S. state.



##### 2. What is/are the insight(s) found from the chart?

States like California, New York, and Texas have the highest number of job postings.

These regions are key employment hubs in tech and consulting.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Targeting job ads in high-demand states can maximize visibility, while identifying underrepresented states may highlight expansion opportunities.

#### Chart - 3 Average Salary by Company Rating

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(data=df, x='Rating', y='avg_salary')
plt.title('Company Rating vs. Average Salary')
plt.xlabel('Company Rating')
plt.ylabel('Average Salary ($1000s)')
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot shows the relationship between two numerical variables, here: company rating and average salary.

##### 2. What is/are the insight(s) found from the chart?

There's a slight upward trend — better-rated companies tend to offer higher salaries.

However, it's not a strong linear relationship.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight supports the strategy of investing in employer branding — improving Glassdoor ratings could attract better candidates even at competitive salary levels.

Negative Growth Insight?
Companies with high salaries but poor ratings may struggle with employee satisfaction, suggesting internal issues that need addressing.

#### Chart - 4   Average Salary by Company Age

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,5))
sns.lineplot(data=df.sort_values('company_age'), x='company_age', y='avg_salary', marker='o')
plt.title('Company Age vs. Average Salary')
plt.xlabel('Company Age (Years)')
plt.ylabel('Average Salary ($1000s)')
plt.show()


##### 1. Why did you pick the specific chart?

Line plots reveal trends across ordered data, such as age.






##### 2. What is/are the insight(s) found from the chart?

Companies aged 10–30 years tend to offer higher average salaries.

Very new or very old firms often pay slightly lower.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If your firm is in the high-paying age bracket, you can leverage that in hiring campaigns.

#### Chart - 5  Correlation Heatmap



In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(6,4))
sns.heatmap(df[['min_salary', 'max_salary', 'avg_salary', 'Rating', 'company_age']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Between Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap gives a quick visual summary of correlations between numeric variables — essential before modeling.

##### 2. What is/are the insight(s) found from the chart?




min_salary and max_salary have strong correlation with avg_salary (expected).

Rating and company_age have weaker correlation, but could still be helpful features.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth?

 It helps focus on features that influence salary, guiding feature selection for more accurate ML predictions.

 Negative Growth Insight?
If features like Rating don’t correlate with salary, companies may wrongly assume improving ratings alone boosts hiring impact — a holistic strategy is better.

#### Chart - 6  Average Salary by Job State (Boxplot)

In [None]:
# Chart - 6 visualization code
if 'job_state' in df.columns:
    plt.figure(figsize=(10,6))
    sns.boxplot(y='job_state', x='avg_salary', data=df)
    plt.title('Average Salary by Job State')
    plt.show()


##### 1. Why did you pick the specific chart?

To compare salary levels across states.



##### 2. What is/are the insight(s) found from the chart?

We can identify high-paying states vs low-paying ones, useful for remote work policies.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Enables smarter location-based offers.
Negative if it encourages underpaying in lower-salary states, risking talent loss.

#### Chart - 7  Average Salary by Job State and Size (Barplot)

In [None]:
# Chart - 7 visualization code
if 'job_state' in df.columns and 'Size' in df.columns:
    plt.figure(figsize=(12,6))
    sns.barplot(x='job_state', y='avg_salary', hue='Size', data=df)
    plt.title('Average Salary by Job State and Company Size')
    plt.show()


##### 1. Why did you pick the specific chart?

To assess location-size pay mix.



##### 2. What is/are the insight(s) found from the chart?

Large firms in premium states likely pay best.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Informs geo-strategy.



#### Chart - 8  Average Salary by Company Size (Boxplot)

In [None]:
# Chart - 8 visualization code
if 'Size' in df.columns:
    plt.figure(figsize=(10,6))
    sns.boxplot(x='Size', y='avg_salary', data=df)
    plt.title('Average Salary by Company Size')
    plt.xticks(rotation=45)
    plt.show()

##### 1. Why did you pick the specific chart?

To understand if size affects pay.



##### 2. What is/are the insight(s) found from the chart?

E.g., large companies might pay more steadily, startups show wider spread.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Helps position pay packages.

Negative if misinterpreted — assuming size always means better pay.

#### Chart - 9  Average Salary by Top Industries (Boxplot)

In [None]:
# Chart - 9 visualization code
if 'Industry' in df.columns:
    plt.figure(figsize=(10,6))
    top_industries = df['Industry'].value_counts().nlargest(10).index
    sns.boxplot(y='Industry', x='avg_salary', data=df[df['Industry'].isin(top_industries)])
    plt.title('Average Salary by Top Industries')
    plt.show()

##### 1. Why did you pick the specific chart?

To compare industry salary standards.



##### 2. What is/are the insight(s) found from the chart?

Certain industries pay premium — tech, finance, etc.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Refines talent targeting.



#### Chart - 10   Average Salary by Sector (Boxplot)

In [None]:
# Chart - 10 visualization code
if 'Sector' in df.columns:
    plt.figure(figsize=(10,6))
    sns.boxplot(y='Sector', x='avg_salary', data=df)
    plt.title('Average Salary by Sector')
    plt.show()

##### 1. Why did you pick the specific chart?

To see sector-level trends (e.g. public vs private).



##### 2. What is/are the insight(s) found from the chart?

Sector differences help align pay policies.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Smarter sector-based pay offers.



#### Chart - 11  Rating Distribution by Company Age (Line plot)


In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(8,5))
sns.lineplot(x='company_age', y='Rating', data=df)
plt.title('Rating by Company Age')
plt.show()


##### 1. Why did you pick the specific chart?

To see if older companies have better/worse ratings.



##### 2. What is/are the insight(s) found from the chart?

Likely no clear linear trend.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Helps brand strategy.



#### Chart - 12  Company Age Distribution (Histogram)

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(8,5))
sns.histplot(df['company_age'], bins=20, kde=True)
plt.title('Company Age Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

To understand age distribution of employers in dataset.



##### 2. What is/are the insight(s) found from the chart?

May skew young if startup-heavy data.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Helps segment market.



#### Chart - 13  Rating Distribution (Histogram)

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(8,5))
sns.histplot(df['Rating'], bins=20, kde=True)
plt.title('Rating Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

To see how company ratings spread.



##### 2. What is/are the insight(s) found from the chart?

it Can show skew towards higher or lower rated firms.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — Useful for employer branding.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Correlation Heatmap visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Compute correlation matrix — consider only numeric columns
# Select only the numeric columns from the DataFrame
numeric_df = df.select_dtypes(include=np.number)

# Compute the correlation matrix on the numeric columns
corr = numeric_df.corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Draw the heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Titles and labels
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


 1. Why did you pick the specific chart?

I chose the correlation heatmap because it provides a visual summary of how numerical features relate to each other, helping identify strong positive or negative relationships.
It is an efficient way to spot multicollinearity and key drivers of salary in one glance.

##### 2. What is/are the insight(s) found from the chart?

The chart shows which factors (e.g., rating, company age) have higher or lower correlation with average salary.
It reveals potential predictors for salary modeling and highlights where features might overlap in their influence.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select numerical columns for the pair plot
numeric_cols = ['avg_salary', 'Rating', 'company_age']

# If you have other numeric features, add them to the list
# numeric_cols.append('other_numeric_feature')

# Create the pair plot
sns.pairplot(df[numeric_cols])
plt.suptitle("Pair Plot of Numeric Features", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

To visually explore pairwise relationships before modeling.



##### 2. What is/are the insight(s) found from the chart?

Confirms or denies trends like salary rising with rating or age.



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help the client achieve their business objective of accurately estimating employee salaries based on company and job characteristics, we propose a data-driven machine learning solution. Using the cleaned Glassdoor dataset, we built a predictive model that leverages key features such as company rating, company age, job location, and job type to estimate average salaries. This model enables the client to gain deeper insights into salary trends across industries, regions, and job roles, supporting data-informed decisions on compensation strategies. By integrating this model into their workflow, the client can benchmark salaries against competitors, optimize their job offers, and improve talent acquisition by offering competitive packages that align with market standards. Additionally, this solution provides a foundation for continuous improvement — as more data becomes available, the model can be refined for even greater accuracy, ensuring the client stays ahead in a dynamic job market.

# **Conclusion**

In this project, we successfully developed a machine learning solution that predicts average salaries based on key job and company features from the Glassdoor dataset. Through systematic data wrangling, exploratory data analysis, and model training, we gained valuable insights into the factors that influence salaries — such as company rating and age. The final model provides the client with a reliable tool to estimate compensation levels, enabling more competitive and data-backed salary offers. This not only supports smarter recruitment strategies but also helps in aligning with industry standards. Moving forward, incorporating additional features (like skills, education requirements, or job descriptions) and regularly updating the model with new data will further enhance prediction accuracy and business value.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***