# **Project Name**    -     Glassdoor Salary Analysis



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Sourasish Mondal

# **Project Summary -**


This project aims to predict salaries based on job postings from Glassdoor. The dataset contains various features such as job title, location, company size, industry, and more. We will perform exploratory data analysis (EDA), data preprocessing, and build regression models to predict salaries. The goal is to understand which factors most influence salary and create a model that can accurately predict salaries based on these factors.

# **GitHub Link -**

[glassdoor_project ](https://github.com/souraOP/Projects-Labmentix/tree/6ce2f2cf33026aa6f943fdd26db8590e01aa8810/Glassdoor%20Project)  Provide your GitHub Link here.

# **Problem Statement**


Predict the average salary of a job listing using features such as company size, location, industry, job description, and company rating.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#%pip install contractions
#%pip install textblob

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import string
import re
import nltk
from nltk.corpus import stopwords

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('glassdoor_jobs.csv')
dataframe = pd.DataFrame(df)

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape
print(f"Total 'rows' of the dataset: ", df.shape[0])
print(f"Total 'columns' of the dataset: ", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
print("Dataset Information: \n")
print(df.info(), "\n")

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cmap="crest")
plt.show()
print("The uniform colour of the heatmap indicates that there are NO missing values")

### What did you know about your dataset?

The dataset contains job postings from Glassdoor website with various features such as job title, salary estimate, job description, company name, location, headquarters, revenue, sector and many more. There are some missing values in the dataset, particularly in the 'Competitors' column. The dataset has both numerical and categorical features, which will require preprocessing before building the model.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

- **Job Title**: The title of the job posting.
- **Salary Estimate**: The estimated salary range for the job given.
- **Job Description**: A description/responsibilities of the job posting.
- **Rating**: The company's rating on Glassdoor's website.
- **Company Name**: The name of the company posting the job on the website.
- **Location**: The location of the job.
- **Headquarters**: The headquarters location of the company.
- **Size**: The size of the company in terms of total employee count.
- **Founded**: The year the company was founded.
- **Type of ownership**: The type of ownership (private/public etc).
- **Industry**: The industry of the company.
- **Sector**: The sector of the company.
- **Revenue**: The revenue of the company.
- **Competitors**: Competitors of the company (with other company(s) if not then -1).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Before cleaning the data
print(df['Salary Estimate'], "\n")
print(df['Salary Estimate'].shape)

In [None]:
# Write your code to make your dataset analysis ready.

def clean_salary(salary):
  if pd.isna(salary) or salary == '-1':
    return np.nan

  salary = salary.split('(')[0].strip()

  if 'Employer Provided Salary:' in salary:
    salary=salary.replace('Employer Provided Salary:', '').strip()

  if 'Per Hour' in salary.lower():
    salary = salary.lower().replace(' Per Hour', '').strip()
    if '-' in salary:
      min_rate, max_rate = salary.split('-')
      min_rate = float(min_rate.replace('$', '').strip())
      max_rate = float(max_rate.replace('$', '').strip())
    else:

      rate = float(salary.replace('$', '').strip())
      min_rate = max_rate = rate

    # Convert to annual (40 hours/week, 52 weeks/year)
    min_annual = int(min_rate * 40 * 52 / 1000)
    max_annual = int(max_rate * 40 * 52 / 1000)
    return f"{min_annual}K-{max_annual}K"

    # min_hourly, max_hourly = hourly_range.split('-')
    # min_hourly = float(min_hourly.replace('$', '').strip())
    # max_hourly=float(max_hourly.replace('$', '').strip())

    # min_annual = min_hourly * 40 * 52 / 1000 # to thousands
    # max_annual = max_hourly * 40 * 52 / 1000 # to thousands
    # return f"{min_annual:.0f}K-{max_annual:.0f}K"

  if '-' in salary:
    min_salary, max_salary = salary.split('-')
    min_salary = min_salary.replace('$', '').replace('K', '').strip()
    max_salary = max_salary.replace('$', '').replace('K', '').strip()
    try:
      min_salary=int(float(min_salary))
      max_salary=int(float(max_salary))
      return f"{min_salary}K-{max_salary}K"
    except ValueError:
      return np.nan

    # If no valid format is found, return NaN
  return np.nan

cleaned_df = pd.read_csv('glassdoor_jobs.csv')
cleaned_df = cleaned_df.loc[:, ~cleaned_df.columns.str.contains('^Unnamed')]
cleaned_df['Salary Estimate'] = cleaned_df['Salary Estimate'].apply(clean_salary)
cleaned_df.dropna(subset=['Salary Estimate'], inplace=True)
cleaned_df['Company Age'] = 2025 - cleaned_df['Founded']
cleaned_df.drop('Founded', axis=1, inplace=True)

# taking out the max and min salary from the range
cleaned_df['Min_Salary'] = cleaned_df['Salary Estimate'].apply(lambda a: int(a.split('-')[0].replace('K', '')))
cleaned_df['Min_Salary']
cleaned_df['Max_Salary'] = cleaned_df['Salary Estimate'].apply(lambda a: int(a.split('-')[1].replace('K', '')))
cleaned_df['Max_Salary']

# avg sal
cleaned_df['Average_Salary'] = (cleaned_df['Min_Salary'] + cleaned_df['Max_Salary']) / 2
print(cleaned_df[['Salary Estimate', 'Min_Salary', 'Max_Salary', 'Average_Salary']].head())

In [None]:
print(cleaned_df)
print(cleaned_df['Salary Estimate'].isnull().sum())

In [None]:
cleaned_df['Job Title'].unique()

### What all manipulations have you done and insights you found?

- Extracted the minimum and maximum salary from the 'Salary Estimate' column.
- Calculated the average salary.
- Dropped the original 'Salary Estimate' column as it is no longer needed.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart 1: Distribution of Average Salary (Univariate)
plt.figure(figsize=(10, 6))
sns.histplot(cleaned_df['Average_Salary'], kde=True, bins=30, color='blue')
plt.title('Distribution of Average Salary')
plt.xlabel('Average Salary (in thousands)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

This histogram plot shows the frequency of different average salaries for multiple job roles.

##### 2. What is/are the insight(s) found from the chart?

From the above histogram plot we can observe that as the average salary from $80K to $110k the frequency of the curve increase drastically and reaches highest around the range from `$85k to $94k` the peak is at the highest.

- After the `$100k` mark the frequency of the curve decreases and almost converges with the x-axis from `$200k` onwards

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Positive Impacts:

- <b>Competitive Mid Range Salaries:</b> The distribution shows a strong concentration of salaries between _$75,000-$125,000_, suggesting the company is paying competitive market rates for most positions. This can help with employee retention and attraction of talent, reducing costly turnover.

- <b>Room for Growth:</b> The extended right tail (showing salaries up to _$250,000_) demonstrates potential for career advancement and higher earnings, which can be a strong motivator for high-performing employees.

#### Negative Impacts:

- <b>Salary Compression:</b> The high concentration in the middle range might indicate salary compression, where experienced employees and new hires have similar salaries. This could lead to dissatisfaction among experienced employees and potentially leaving the company.
  
- <b>Wide Gap between Mid range and top earners:</b> The significant drop in frequency between the peak (_$100,000_) and higher salaries (_$200,000+_) might create perception issues about advancement opportunities, potentially causing mid-level talent to seek opportunities elsewhere.

#### Chart - 2

In [None]:
# Chart 2: Distribution of Company Age (Univariate)
plt.figure(figsize=(10, 6))
sns.histplot(cleaned_df['Company Age'], kde=True, bins=30, color='green')
plt.title('Distribution of Company Age')
plt.xlabel('Company Age (in years)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Chart - 2 visualization code
top_jobs = cleaned_df['Job Title'].value_counts().nlargest(10).index
sns.boxplot(x='Job Title', y='Average_Salary', data=cleaned_df[cleaned_df['Job Title'].isin(top_jobs)])
plt.xticks(rotation=90)
plt.title('Average Salary by Job Title')
plt.show()

##### 1. Why did you pick the specific chart?

The boxplot effectively visualizes the distribution of average salaries across different job titles.
- It provides clear summary of the distribution.
- Easy to compare with other job titles along with their average salaries.
- Able to detect outliers.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart 3: Count of Job Titles (Univariate)
top_job_3 = cleaned_df['Job Title'].value_counts().nlargest(15).index
plt.figure(figsize=(12, 8))
sns.countplot(y='Job Title', data=cleaned_df[cleaned_df['Job Title'].isin(top_job_3)], palette='viridis', hue='Job Title', legend=False)
plt.title('Count of Job Titles')
plt.xlabel('Count')
plt.ylabel('Job Title')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart 4: Average Salary by Job Title - (Bivariate)
top_job_titles = cleaned_df['Job Title'].value_counts().nlargest(30).index
plt.figure(figsize=(15, 8))
sns.boxplot(x='Job Title', y='Average_Salary', hue='Job Title', legend=False, data=cleaned_df[cleaned_df['Job Title'].isin(top_job_titles)], palette='viridis')
plt.xticks(rotation=90)
plt.title('Average Salary by Job Title')
plt.xlabel('Job Titles (20)', fontsize=20)
plt.ylabel('Average Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart 5: Average Salary by Company Size (Bivariate - Numerical:Categorical)
plt.figure(figsize=(12, 6))
sns.boxplot(x='Size', hue='Size', legend=False, y='Average_Salary', data=cleaned_df, palette='coolwarm')
plt.title('Average Salary by Company Size')
plt.xlabel('Company Size')
plt.xticks(rotation=90)
plt.ylabel('Average Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart 6: Average Salary by Revenue - (Bivariate, Numerical:Categorical)
plt.figure(figsize=(12, 6))
sns.boxplot(x='Revenue', y='Average_Salary', hue='Revenue', legend=False, data=cleaned_df, palette='magma')
plt.xticks(rotation=90)
plt.title('Average Salary by Revenue')
plt.xlabel('Revenue')
plt.ylabel('Average Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart 7: Average Salary by Location
top_location = cleaned_df['Location'].value_counts().nlargest(25).index
plt.figure(figsize=(18, 6))
sns.boxplot(x='Location', y='Average_Salary', hue='Location', data=cleaned_df[cleaned_df['Location'].isin(top_location)], palette='plasma')
plt.xticks(rotation=90)
plt.title('Average Salary by 25 Locations')
plt.xlabel('Locations (25)')
plt.ylabel('Average Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart 8: Scatter Plot - Company Age vs. Average Salary
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Company Age', y='Average_Salary', data=cleaned_df, color='purple')
plt.title('Company Age vs. Average Salary')
plt.xlabel('Company Age (in years)')
plt.ylabel('Average Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# B11: Company Size vs. Rating
plt.figure(figsize=(10, 6))
sns.boxplot(x='Size', y='Rating', data=cleaned_df, hue='Size', legend=False, palette='viridis')
plt.title('Company Size vs. Rating')
plt.xlabel('Company Size')
plt.xticks(rotation=90)
plt.ylabel('Rating')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart 10: Scatter Plot - Min_Salary vs. Max_Salary
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Min_Salary', y='Max_Salary', data=cleaned_df, color='orange')
plt.title('Min_Salary vs. Max_Salary')
plt.xlabel('Minimum Salary (in thousands)')
plt.ylabel('Maximum Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# B4: Average Salary by Industry
plt.figure(figsize=(16, 6))
sns.boxplot(x='Industry', y='Average_Salary', hue='Industry', legend=False, data=cleaned_df, palette='Set2')
plt.xticks(rotation=90)
plt.title('Average Salary by Industry')
plt.xlabel('Industry')
plt.ylabel('Average Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
def extract_seniority(job_title):
    # Convert job_title to string to avoid AttributeError
    job_title = str(job_title).lower()
    if 'senior' in job_title or 'lead' in job_title or 'principal' in job_title or 'manager' in job_title:
        return 'Senior'
    elif 'jr' in job_title or 'junior' in job_title or 'entry level' in job_title:
        return 'Junior'
    else:
        return 'Mid'  # Default to Mid-level

cleaned_df['Seniority'] = cleaned_df['Job Title'].apply(extract_seniority)



# B14: Seniority vs. Sector
plt.figure(figsize=(16, 9))
sns.countplot(x='Industry', hue='Seniority', data=cleaned_df, palette='Set2')
plt.xticks(rotation=45, ha='right')
plt.title('Seniority Count by Sector')
plt.xlabel('Sector')
plt.ylabel('Count')
plt.legend(title='Seniority')
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# B10: Scatter Plot - Max_Salary vs. Rating
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Rating', y='Max_Salary', data=cleaned_df, color='red')
plt.title('Max_Salary vs. Rating')
plt.xlabel('Company Rating')
plt.ylabel('Maximum Salary (in thousands)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Select only numerical columns
numerical_columns = cleaned_df.select_dtypes(include=['int64', 'float64']).columns
numerical_df = cleaned_df[numerical_columns]

# Compute correlation matrix
corr = numerical_df.corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap (Numerical Columns Only)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Chart 11: Pair Plot for Numerical Features
sns.pairplot(cleaned_df[['Min_Salary', 'Max_Salary', 'Average_Salary', 'Company Age', 'Rating']])
plt.suptitle('Pair Plot for Numerical Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### Statement 1: Job Title and Salary
- Hypothesis: The average salary differs significantly across different job titles.
  - Null Hypothesis (H_0): There is no significant difference in average salary across job titles.
  - Alternative Hypothesis (H_1): There is a significant difference in average salary across job titles.

#### 2. Perform an appropriate statistical test.

In [None]:
import scipy.stats as stats

# Group the data by 'Job Title' and extract 'Avg_Salary'
groups = [group['Average_Salary'].values for name, group in cleaned_df.groupby('Job Title')]

# Perform ANOVA
f_statistic, p_value = stats.f_oneway(*groups)

# Print results
print(f"ANOVA Results for Job Title vs. Avg_Salary:")
print(f"F-statistic: {f_statistic:.2f}, p-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis as significant difference in average salary across job titles.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in average salary across job titles.")

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### Statement 2: Sector and Company Rating

- Statement: The "Aerospace & Defense" sector has a different distribution of company ratings compared to the "Business Services" sector.
  - Null Hypothesis (H0): The distributions of company ratings are the same for the "Aerospace & Defense" and "Business Services" sectors.
  - Alternative Hypothesis (H1): The distributions of company ratings are different for the "Aerospace & Defense" and "Business Services" sectors.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy import stats

aerospace_ratings = cleaned_df[cleaned_df['Sector'] == 'Aerospace & Defense']['Rating'].dropna()
business_ratings = cleaned_df[cleaned_df['Sector'] == 'Business Services']['Rating'].dropna()

# Perform the Mann-Whitney U test

u_statistic, p_value_2 = stats.mannwhitneyu(aerospace_ratings, business_ratings, alternative='two-sided') #Non-parametric Test

print(f"Mann-Whitney U statistic: {u_statistic}")
print(f"P-value: {p_value}")

alpha = 0.05

if p_value_2 < alpha:
    print("Reject the null hypothesis.")
    print("There is evidence to support that the distributions of company ratings are different for the Aerospace & Defense and Business Services sectors.")
else:
    print("Fail to reject the null hypothesis.")
    print("There is no significant evidence to support that the distributions of company ratings are different for the Aerospace & Defense and Business Services sectors.")


##### Which statistical test have you done to obtain P-Value?

I have used Mann Whitney U statistical testing to obtain the P-value

##### Why did you choose the specific statistical test?

Mann Whitney U is a non-parametric test that doesn't assume normality but does assume that the data is at least ordinal.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### Statement 3: Seniority and Average Salary

Statement: Senior positions have a higher average salary than mid-level positions.
- Null Hypothesis (H_0): There is no difference in average salary between senior and mid-level positions.
- Alternative Hypothesis (H_1): The average salary of senior positions is greater than the average salary of mid-level positions. (One-tailed test)

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy import stats

senior_salaries = cleaned_df[cleaned_df['Seniority'] == 'Senior']['Average_Salary'].dropna()
mid_salaries = cleaned_df[cleaned_df['Seniority'] == 'Mid']['Average_Salary'].dropna()

levene_stat, levene_p = stats.levene(senior_salaries, mid_salaries)
print(f"Levene's test statistic: {levene_stat}, p-value: {levene_p}")


if levene_p < 0.05:
    print("Variances are unequal, using Welch's t-test")
    t_statistic, p_value_3 = stats.ttest_ind(senior_salaries, mid_salaries, equal_var=False, alternative='greater')
else:
    t_statistic, p_value_3 = stats.ttest_ind(senior_salaries, mid_salaries, equal_var=True, alternative='greater')

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value_3}")

alpha = 0.05

if p_value_3 < alpha:
    print("Reject the null hypothesis.")
    print("There is evidence to support that senior positions have a higher average salary than mid-level positions.")
else:
    print("Fail to reject the null hypothesis.")
    print("There is no significant evidence to support that senior positions have a higher average salary than mid-level positions.")


##### Which statistical test have you done to obtain P-Value?

For statement 3 I have used ANOVA (Analysis of variance) to test if there’s a significant difference in average salary across seniority in job titles.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
cleaned_df.replace("-1", np.nan, inplace=True)

numerical_columns = cleaned_df.select_dtypes(include=['int64', 'float64']).columns
cleaned_df[numerical_columns] = cleaned_df[numerical_columns].fillna(cleaned_df[numerical_columns].median())

# categorical encoding with mode
categorical_columns = cleaned_df.select_dtypes(include=['object']).columns
cleaned_df[categorical_columns]= cleaned_df[categorical_columns].fillna(cleaned_df[categorical_columns].mode().iloc[0])


#### What all missing value imputation techniques have you used and why did you use those techniques?

In the dataset there were some places where the data had -1 in it, so I have replaced them with Nan.

- After that those places where there numerical values in the column like the Revenue column, I have replaced them with median value
- In place where categorical values like in Competitor, I have replaced missing values with their mode

### 2. Handling Outliers

In [None]:
# Handle outliers in 'Avg_Salary' using IQR
Q1 = cleaned_df['Average_Salary'].quantile(0.25)
Q3 = cleaned_df['Average_Salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

cleaned_df = cleaned_df[(cleaned_df['Average_Salary'] >= lower_bound) & (cleaned_df['Average_Salary'] <= upper_bound)]
cleaned_df['Average_Salary']

##### What all outlier treatment techniques have you used and why did you use those techniques?

Used IQR (Interquartile Range) to detect and handle outliers, because it is a very robust method and works well with this type of datasets.
- Works well for skewed dataset

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# using One hot encoding

label_encoder = LabelEncoder()
categorical_columns = ['Job Title', 'Location', 'Company Name', 'Size', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Headquarters', 'Competitors']
for i in categorical_columns:
  cleaned_df[i]= label_encoder.fit_transform(cleaned_df[i])

cleaned_df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

I have used LabelEncoder from scikit-learn as a method of one hot encoding.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
%pip install contractions

In [None]:
# Expand Contraction
import contractions

cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(lambda x: contractions.fix(x))
cleaned_df['Job Description']
# basically if a sentence given is like : I'll be there within 6 min.
# after contraction.fix() : I will be there within 6 min.

#### 2. Lower Casing

In [None]:
# Lower Casing
cleaned_df['Job Description'] = cleaned_df['Job Description'].str.lower()
cleaned_df['Job Description']

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
cleaned_df['Job Description']

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

# for remvoing urls
cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(lambda a: re.sub(r'https\S+|www\S+|https\S+', '', a, flags=re.MULTILINE))

# digits removal
cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(lambda x: re.sub(r'\d+', '', x))

# remove special word
cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(lambda b: re.sub(r'\b(job|description)\b', '', b))

cleaned_df['Job Description']

#### 5. Removing Stopwords & Removing White spaces

In [None]:
#%pip install nltk

In [None]:
# Remove Stopwords
nltk.download('stopwords')

stop_words= set(stopwords.words('english'))
cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(lambda x: ' '.join(i for i in x.split() if i not in stop_words))
cleaned_df['Job Description']

In [None]:
# Remove White spaces
cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(lambda x: ' '.join(x.split()))
cleaned_df['Job Description']


#### 6. Rephrase Text

In [None]:
#%pip install pyspellchecker

In [None]:
# Rephrase Text
# from textblob import TextBlob
# from multiprocessing import Pool, cpu_count
# from spellchecker import SpellChecker

# def rep_text(text):
#   b = TextBlob(text)
#   return str(b.correct())

# spell = SpellChecker()

# def correct_text(t):
#   words= t.split()
#   cor_word = [spell.correction(word) or word for word in words]
#   return " ".join(cor_word)

# cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(correct_text)

# with Pool(cpu_count()) as pool:
#   cleaned_df['Job Description'] = pool.map(rep_text, cleaned_df['Job Description'])


# this part was taking way too much time, so commented out


#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
cleaned_df['Job Description'] = cleaned_df['Job Description'].apply(word_tokenize)


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemm = WordNetLemmatizer()

cleaned_df['Stemmed_Description'] = cleaned_df['Job Description'].apply(lambda tk: [stemmer.stem(t) for t in tk])

#lemmatization
cleaned_df['Lemmatized_Description'] = cleaned_df['Job Description'].apply(lambda x: [lemm.lemmatize(i) for i in x])

##### Which text normalization technique have you used and why?

I have used PorterStemmer and WordNetLemmatizer from nltk library

In [None]:
cleaned_df['Stemmed_Description']

#### 9. Part of speech tagging

In [None]:
# POS Taging
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
cleaned_df['POS_tags'] = cleaned_df['Job Description'].apply(pos_tag)

In [None]:
cleaned_df['POS_tags']

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

cleaned_df['Lemmatized_Text'] = cleaned_df['Lemmatized_Description'].apply(lambda tokens: ' '.join(tokens))
cleaned_df['Lemmatized_Text']
cleaned_df['Lemmatized_Description']
# cleaned_df['Stemmed_Description']

In [None]:
# from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf_vector = TfidfVectorizer(max_features=1000)
# tfidf_matrix = tfidf_vector.fit_transform(cleaned_df['Lemmatized_Text'])
# tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vector.get_feature_names_out())
# cleaned_df = pd.concat([cleaned_df, tfidf_df], axis=1)

# # Drop the original 'Lemmatized_Description' column
# cleaned_df.drop('Lemmatized_Description', axis=1, inplace=True)


##### Which text vectorization technique have you used and why?

I have used TF-IDF (Term Frequency Inverse Document Frequency)

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features


cleaned_df['Salary Range'] = cleaned_df['Max_Salary'] - cleaned_df['Min_Salary']

# to reduce skewness
cleaned_df['Log_avg_salary'] = np.log(cleaned_df['Average_Salary'] + 1)
cleaned_df['Log_avg_salary']

In [None]:
print(cleaned_df.head())

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
corr_target = numerical_df.corr()['Average_Salary'].abs().sort_values(ascending=False)
imp_features = corr_target[1:5].index.tolist()
print(corr_target)
print("Top 10 features: ", imp_features)

# cleaned_df.head()

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

If any feature is highly skewed, we can apply transformations like Log Transform to reduce the skewness of the data, making it normally distributed

In [None]:
# Transform Your data
skewed_features = ['Min_Salary', 'Max_Salary', 'Company Age', 'Average_Salary']
for feature in skewed_features:
  cleaned_df[f'Log_{feature}'] = np.log(cleaned_df[feature] + 1)

cleaned_df['Company Age']

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
numerical_features = ['Rating', 'Salary Range', 'Log_avg_salary', 'Min_Salary', 'Max_Salary', 'Company Age', 'Log_Min_Salary', 'Log_Max_Salary', 'Log_Company Age']
cleaned_df[numerical_features] = scaler.fit_transform(cleaned_df[numerical_features])
cleaned_df[numerical_features]

# X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
# X_test[numerical_features] = scaler.transform(X_test[numerical_features])

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
cleaned_df.head()

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
to_drop = ['Job Description', 'Stemmed_Description', 'Lemmatized_Description', 'POS_tags', 'Lemmatized_Text', 'Seniority', 'Salary Estimate']
cleaned_df = cleaned_df.drop(columns=to_drop)
X = cleaned_df.drop(['Average_Salary', 'Min_Salary', 'Max_Salary'], axis=1)
y = cleaned_df['Average_Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape: ", X_train.shape, y_train.shape)
print("Testing set shape: ", X_test.shape)

In [None]:
X_train

In [None]:
y_train

##### What data splitting ratio have you used and why?

I have used the default 80 20 ratio to split the data into training and testing sets respectively.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:

# Compute correlation matrix
correlation_matrix = numerical_df.corr()

# Display correlation with target variable ('Avg_Salary')
correlation_with_target = correlation_matrix['Average_Salary'].abs().sort_values(ascending=False)
print("Correlation with Avg_Salary:")
print(correlation_with_target)

In [None]:
# Handling Imbalanced Dataset (If needed)
# from imblearn.over_sampling import SMOTE

# smote = SMOTE(random_state=42)
# X_train_resampled, y_train_resampled =smote.fit_resample(X_train, y_train)
# print("Resampled training set shape: ", X_train_resampled.shape)
# print(X_train_resampled)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

#### Linear Regression

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Fit the Algorithm

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Predict on the model
y_pred_linear = linear_model.predict(X_test)
y_pred_linear

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)
print("Linear Regression - MSE: ", mse_linear)
print("Linear Regressiond - R2 Score", r2_linear)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Calculate residuals
residuals_linear = y_test - y_pred_linear

# Plot residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_linear, residuals_linear, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Salary')
plt.ylabel('Residuals')
plt.title('Linear Regression: Residual Plot')
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
sns.scatterplot(x=y_test, y=y_pred_linear, alpha=0.6, color='blue', label="Predicted vs Actual")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='dashed', linewidth=2, label="Perfect Fit")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Linear Regression: Actual vs Predicted")
plt.legend()
plt.show()

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import cross_val_score
# Fit the Algorithm
cv_sc = cross_val_score(linear_model, X_train, y_train, cv=5, scoring='r2')
print("Linear Regression - Cross Val r2 scores: ", cv_sc)
print("Linear Regression - Mean Cross Val r2 score: ", cv_sc.mean())
# Predict on the model

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(range(1, 6), cv_sc, marker='o', linestyle='-', color='blue')


plt.xlabel("Fold Number")
plt.ylabel("R^2 Score")
plt.title("Cross-Validation R^2 Scores Across Folds")
plt.xticks(range(1, 6))
plt.grid(True)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Linear Regression has no hyperparameters to tune. But I have used cross validation score from scikit learn and applied 5 folds cross validation.

The model performs well overall, with R^2 scores around 0.968 to 0.976 (close to 1, indicating a strong fit).
There's some variation in scores across folds:
- The lowest score is around 0.968 (Fold 2).
- The highest score is about 0.976 (Fold 3).
- This means the model performs slightly worse in some splits but remains consistent overall.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The actual vs predicted graph is showing a perfect graph of linear regreesion that fits our dataset without any underfitting or overfitting!

### ML Model - 2

#### Random Forest


In [None]:
# ML Model - 2
from sklearn.ensemble import RandomForestRegressor
# Fit the Algorithm
randForest = RandomForestRegressor(random_state=40)
randForest.fit(X_train, y_train)
# Predict on the model
y_pred_RF = randForest.predict(X_test)
y_pred_RF

#### It's an ensemble model that combines multiple decision trees to improve prediction accuracy and reduce overfitting.

- #### It is robust to outliers and can handle non-linear relationships.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mse_RF = mean_squared_error(y_test, y_pred_RF)
r2_RF = r2_score(y_test, y_pred_RF)
print('Random Forest - MSE: ', mse_RF)
print("Random Forest - R2 Score: ", r2_RF)

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_test, y=y_pred_RF, color='blue', alpha=0.6, label="Predicted vs Actual")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='dashed', linewidth=2, label="Perfect Fit")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Random Forest Regression: Actual vs. Predicted")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Calculate residuals
residuals_rf = y_test - y_pred_RF

# Plot residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_RF, residuals_rf, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Salary')
plt.ylabel('Residuals')
plt.title('Random Forest: Residual Plot')
plt.show()

In [None]:
# Get feature importances
importances_rf = randForest.feature_importances_
feature_names = X_train.columns

# Plot feature importances
plt.figure(figsize=(12, 6))
plt.barh(feature_names, importances_rf, color='skyblue')
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest: Feature Importance')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

cv_scores_rf = cross_val_score(randForest, X_train, y_train, cv=5, scoring='r2')
print("Random Forest - Cross-Validation R2 Scores:", cv_scores_rf)
print("Random Forest - Mean Cross-Validation R2 Score:", cv_scores_rf.mean())


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

grid_search_rf = GridSearchCV(randForest, param_grid_rf, cv=5, scoring='r2')
grid_search_rf.fit(X_train, y_train)
print("Random Forest - Best Parameters:", grid_search_rf.best_params_)
y_pred_rf_tuned = grid_search_rf.predict(X_test)
mse_rf_tuned = mean_squared_error(y_test, y_pred_rf_tuned)
r2_rf_tuned = r2_score(y_test, y_pred_rf_tuned)
print("Random Forest (Tuned) - Mean Squared Error:", mse_rf_tuned)
print("Random Forest (Tuned) - R2 Score:", r2_rf_tuned)

##### Which hyperparameter optimization technique have you used and why?

For random forest I have used GridSearch Cross Validation from scikit learn

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

#### XGBoost

In [None]:
#%pip install xgboost

In [None]:
# ML Model - 3 Implementation
from xgboost import XGBRegressor
# Fit the Algorithm
xgb_model = XGBRegressor(random_state=40)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
y_pred_xgb
# Predict on the model

In [None]:
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)
print("XGBoost - Mean Squared Error: ", mse_xgb)
print("XGBoost - R2 Score: ", r2_xgb)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_xgb, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # Diagonal line
plt.xlabel('Actual Salary')
plt.ylabel('Predicted Salary')
plt.title('XGBoost: Actual vs. Predicted')
plt.show()

In [None]:
# Calculate residuals
residuals_xgb = y_test - y_pred_xgb

# Plot residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_xgb, residuals_xgb, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Salary')
plt.ylabel('Residuals')
plt.title('XGBoost: Residual Plot')
plt.show()

In [None]:

importances_xgb = xgb_model.feature_importances_

# Plot feature importances
plt.figure(figsize=(12, 6))
plt.barh(feature_names, importances_xgb, color='lightgreen')
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('XGBoost: Feature Importance')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

cv_scores_xgb = cross_val_score(xgb_model, X_train, y_train, cv=5, scoring='r2')
print("XGBoost - Cross-Validation R2 Scores:", cv_scores_xgb)
print("XGBoost - Mean Cross-Validation R2 Score:", cv_scores_xgb.mean())

In [None]:
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Initialize GridSearchCV
grid_search_xgb = GridSearchCV(xgb_model, param_grid_xgb, cv=5, scoring='r2')

# Fit the model
grid_search_xgb.fit(X_train, y_train)

# Best parameters
print("XGBoost - Best Parameters:", grid_search_xgb.best_params_)

# Evaluate the tuned model
y_pred_xgb_tuned = grid_search_xgb.predict(X_test)
mse_xgb_tuned = mean_squared_error(y_test, y_pred_xgb_tuned)
r2_xgb_tuned = r2_score(y_test, y_pred_xgb_tuned)
print("XGBoost (Tuned) - Mean Squared Error:", mse_xgb_tuned)
print("XGBoost (Tuned) - R2 Score:", r2_xgb_tuned)

##### Which hyperparameter optimization technique have you used and why?

I have used GridSearchCV as hyperparameter optimization for the XGBoost model

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Previously the Mean Squared Error:  0.342832715

After using GridSearchCV the MSE came down to : 0.16411464


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For all of the three models I have used Mean Squared Error and R2 Score for evaluation.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

best_model = grid_search_xgb.best_estimator_
joblib.dump(best_model, 'best_xgb_model.pkl')
print("Best Model saved : ")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
loadedModel = joblib.load('best_xgb_model.pkl')
print("model loaded successfully! ")

In [None]:
y_pred_test = loadedModel.predict(X_test)
X_test['Predicted Salary'] = y_pred_test
print(X_test[['Predicted Salary']].head())

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

print('MSE on test data: ', mse_test)
print('R2 score on test data', r2_test)

In [None]:
import matplotlib.pyplot as plt

# Plot actual vs. predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_test, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # Diagonal line
plt.xlabel('Actual Salary')
plt.ylabel('Predicted Salary')
plt.title('XGBoost: Actual vs. Predicted (Test Data)')
plt.show()

In [None]:
# Calculate residuals
residuals_test = y_test - y_pred_test

# Plot residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_test, residuals_test, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Salary')
plt.ylabel('Residuals')
plt.title('XGBoost: Residual Plot (Test Data)')
plt.show()

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***