# Salary Data Analysis

## Project Overview
This project analyzes salary data to identify demographic factors that influence compensation. We'll explore salary variations based on gender, age, education level, job titles, and years of experience.

## Objectives
1. Identify salary trends across different demographics
2. Analyze the gender pay gap in different industries and job roles
3. Examine how education level impacts salary
4. Explore the relationship between experience and compensation
5. Visualize key insights using various charts and graphs

## 1. Loading and Exploring the Dataset

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

# Suppress specific FutureWarnings from seaborn
warnings.filterwarnings('ignore', category=FutureWarning, module='seaborn._oldcore')
# Suppress FutureWarning about observed parameter in pandas groupby
warnings.filterwarnings('ignore', category=FutureWarning, message='.*observed=False.*')
warnings.filterwarnings('ignore', category = RuntimeWarning)

plt.style.use('seaborn-v0_8-whitegrid')        
sns.set_palette('viridis')
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

In [None]:
# Load the dataset
df = pd.read_csv("/kaggle/input/salary-data/Salary_Data.csv")

# Display the first few rows
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)

# Handle any missing values
df = df.dropna()

In [None]:
# Check data types and basic statistics
print("\nData Types:")
print(df.dtypes)

print("\nBasic Statistics:")
df.describe().round(2)

## 2. Data Cleaning and Preprocessing

In [None]:
# Check for outliers in salary
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Salary'])
plt.title('Salary Distribution - Box Plot')
plt.show()

# Remove any extreme outliers (optional)
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

df_no_outliers = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]
print(f"Removed {df.shape[0] - df_no_outliers.shape[0]} outliers")

# Use the original dataset but keep the cleaned version for comparison
df_clean = df_no_outliers.copy()

In [None]:
# Create age groups for better analysis
bins = [20, 30, 40, 50, 60, 70]
labels = ['20-29', '30-39', '40-49', '50-59', '60+']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Create experience groups
exp_bins = [0, 3, 6, 10, 15, 30]
exp_labels = ['0-2', '3-5', '6-9', '10-14', '15+'] 
df['Experience_Group'] = pd.cut(df['Years of Experience'], bins=exp_bins, labels=exp_labels, right=False)

# Print the updated dataframe
df.head()

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Overall salary distribution
plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
sns.histplot(df['Salary'], kde=True)
plt.title('Salary Distribution')
plt.xlabel('Salary ($)')
plt.ylabel('Frequency')

plt.subplot(2, 1, 2)
sns.histplot(np.log10(df['Salary']), kde=True)
plt.title('Log-Transformed Salary Distribution')
plt.xlabel('Log10(Salary)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Basic statistics about salary
salary_stats = df['Salary'].describe().to_frame().T
salary_stats

### 3.1 Salary Analysis by Gender

In [None]:
# Gender distribution in the dataset
gender_counts = df['Gender'].value_counts()
plt.figure(figsize=(8, 6))
sns.countplot(x='Gender', data=df)
plt.title('Gender Distribution in Dataset')
plt.ylabel('Count')

# Add count labels
for i, count in enumerate(gender_counts):
    plt.text(i, count + 5, str(count), ha='center')
    
plt.show()

# Gender-based salary differences
gender_salary = df.groupby('Gender')['Salary'].agg(['mean', 'median', 'std', 'count'])
print("Salary Statistics by Gender:\n")
gender_salary

In [None]:
# Visualize gender-based salary differences
plt.figure(figsize=(14, 10))

plt.subplot(2, 2, 1)
sns.boxplot(x='Gender', y='Salary', data=df)
plt.title('Salary Distribution by Gender')
plt.ylabel('Salary ($)')

plt.subplot(2, 2, 2)
sns.violinplot(x='Gender', y='Salary', data=df, inner='quartile')
plt.title('Violin Plot of Salary by Gender')
plt.ylabel('Salary ($)')

plt.subplot(2, 2, 3)
sns.barplot(x='Gender', y='Salary', data=df, estimator=np.mean)
plt.title('Average Salary by Gender')
plt.ylabel('Average Salary ($)')

plt.subplot(2, 2, 4)
sns.kdeplot(data=df, x='Salary', hue='Gender', fill=True, common_norm=False)
plt.title('Salary Distribution Density by Gender')
plt.xlabel('Salary ($)')

plt.tight_layout()
plt.show()

### 3.2 Salary Analysis by Age

In [None]:
# Age distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(df['Age'], kde=True, bins=10)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.countplot(x='Age_Group', data=df, order=labels)
plt.title('Age Group Distribution')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Salary analysis by age
plt.figure(figsize=(14, 10))

plt.subplot(2, 2, 1)
sns.scatterplot(x='Age', y='Salary', data=df, alpha=0.6)
plt.title('Age vs. Salary Scatter Plot')
plt.xlabel('Age')
plt.ylabel('Salary ($)')

plt.subplot(2, 2, 2)
sns.boxplot(x='Age_Group', y='Salary', data=df, order=labels)
plt.title('Salary Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Salary ($)')
plt.xticks(rotation=45)

plt.subplot(2, 2, 3)
age_salary = df.groupby('Age_Group')['Salary'].mean().reindex(labels)
sns.barplot(x=age_salary.index, y=age_salary.values)
plt.title('Average Salary by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=45)

plt.subplot(2, 2, 4)
sns.lineplot(x='Age', y='Salary', data=df)
plt.title('Salary Trend by Age')
plt.xlabel('Age')
plt.ylabel('Salary ($)')

plt.tight_layout()
plt.show()

### 3.3 Salary Analysis by Education Level

In [None]:
# Education level distribution
edu_counts = df['Education Level'].value_counts()
plt.figure(figsize=(10, 6))
sns.countplot(y='Education Level', data=df, order=edu_counts.index)
plt.title('Education Level Distribution')
plt.xlabel('Count')
plt.tight_layout()
plt.show()

# Education level-based salary statistics
edu_salary = df.groupby('Education Level')['Salary'].agg(['mean', 'median', 'std', 'count'])
print("Salary Statistics by Education Level:\n")
edu_salary.sort_values(by='mean', ascending=False)

In [None]:
# Education level and salary visualizations
plt.figure(figsize=(14, 10))

plt.subplot(2, 2, 1)
sns.boxplot(x='Education Level', y='Salary', data=df, order=['Bachelor\'s', 'Master\'s', 'PhD'])
plt.title('Salary Distribution by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Salary ($)')
plt.xticks(rotation=45)

plt.subplot(2, 2, 2)
sns.violinplot(x='Education Level', y='Salary', data=df, order=['Bachelor\'s', 'Master\'s', 'PhD'], inner='quartile')
plt.title('Violin Plot of Salary by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Salary ($)')
plt.xticks(rotation=45)

plt.subplot(2, 2, 3)
sns.barplot(x='Education Level', y='Salary', data=df, order=['Bachelor\'s', 'Master\'s', 'PhD'], estimator=np.mean)
plt.title('Average Salary by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Gender and Education Level Combined
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.barplot(x='Education Level', y='Salary', hue='Gender', data=df, order=['Bachelor\'s', 'Master\'s', 'PhD'])
plt.title('Average Salary by Education Level and Gender')
plt.xlabel('Education Level')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
gender_edu = df.groupby(['Gender', 'Education Level'])['Salary'].mean().reset_index()
gender_edu_pivot = gender_edu.pivot(index='Education Level', columns='Gender', values='Salary')
gender_edu_pivot = gender_edu_pivot.reindex(['Bachelor\'s', 'Master\'s', 'PhD'])
gender_edu_pivot.plot(kind='bar', figsize=(10, 6))
plt.title('Average Salary by Education Level and Gender')
plt.xlabel('Education Level')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

### 3.4 Salary Analysis by Years of Experience

In [None]:
# Experience distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(df['Years of Experience'], kde=True, bins=10)
plt.title('Years of Experience Distribution')
plt.xlabel('Years of Experience')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.countplot(x='Experience_Group', data=df, order=exp_labels)
plt.title('Experience Group Distribution')
plt.xlabel('Years of Experience')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Experience and salary analysis
plt.figure(figsize=(14, 10))

plt.subplot(2, 2, 1)
sns.scatterplot(x='Years of Experience', y='Salary', data=df, alpha=0.6)
plt.title('Experience vs. Salary Scatter Plot')
plt.xlabel('Years of Experience')
plt.ylabel('Salary ($)')

# Calculate and plot regression line
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Years of Experience'], df['Salary'])
x = np.array([df['Years of Experience'].min(), df['Years of Experience'].max()])
y = intercept + slope * x
plt.plot(x, y, 'r--', label=f'R² = {r_value**2:.3f}')
plt.legend()

plt.subplot(2, 2, 2)
sns.boxplot(x='Experience_Group', y='Salary', data=df, order=exp_labels)
plt.title('Salary Distribution by Experience Group')
plt.xlabel('Years of Experience')
plt.ylabel('Salary ($)')
plt.xticks(rotation=45)

plt.subplot(2, 2, 3)
exp_salary = df.groupby('Experience_Group')['Salary'].mean().reindex(exp_labels)
sns.barplot(x=exp_salary.index, y=exp_salary.values)
plt.title('Average Salary by Experience Group')
plt.xlabel('Years of Experience')
plt.ylabel('Average Salary ($)')
plt.xticks(rotation=45)

plt.subplot(2, 2, 4)
sns.lineplot(x='Years of Experience', y='Salary', data=df)
plt.title('Salary Trend by Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary ($)')

plt.tight_layout()
plt.show()

### 3.5 Salary Analysis by Job Title

In [None]:
# Top 10 job titles by count
top_jobs = df['Job Title'].value_counts().head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_jobs.values, y=top_jobs.index)
plt.title('Top 10 Job Titles by Count')
plt.xlabel('Count')
plt.ylabel('Job Title')
plt.tight_layout()
plt.show()

# Top 10 highest paying jobs
top_paying = df.groupby('Job Title')['Salary'].mean().nlargest(10).reset_index()
plt.figure(figsize=(12, 6))
sns.barplot(x='Salary', y='Job Title', data=top_paying)
plt.title('Top 10 Highest Paying Job Titles')
plt.xlabel('Average Salary ($)')
plt.ylabel('Job Title')
plt.tight_layout()
plt.show()

## 4. Gender Pay Gap Analysis

In [None]:
# Calculate gender pay gap overall
male_avg = df[df['Gender'] == 'Male']['Salary'].mean()
female_avg = df[df['Gender'] == 'Female']['Salary'].mean()
pay_gap = (male_avg - female_avg) / male_avg * 100

print(f"Average male salary: ${male_avg:.2f}")
print(f"Average female salary: ${female_avg:.2f}")
print(f"Gender pay gap: {pay_gap:.2f}% (women earn {100-pay_gap:.2f}% of men's salary)")

# Calculate gender pay gap by education level
gender_edu_gap = df.groupby(['Education Level', 'Gender'])['Salary'].mean().unstack()
gender_edu_gap['Pay Gap (%)'] = (gender_edu_gap['Male'] - gender_edu_gap['Female']) / gender_edu_gap['Male'] * 100
gender_edu_gap['Female % of Male'] = 100 - gender_edu_gap['Pay Gap (%)']

print("\nGender Pay Gap by Education Level:")
gender_edu_gap

In [None]:
# Visualize gender pay gap by education level
plt.figure(figsize=(12, 6))
sns.barplot(x=gender_edu_gap.index, y=gender_edu_gap['Pay Gap (%)'])
plt.title('Gender Pay Gap by Education Level (%)')
plt.xlabel('Education Level')
plt.ylabel('Pay Gap (%)')
plt.axhline(y=0, color='r', linestyle='-')

# Add value labels
for i, v in enumerate(gender_edu_gap['Pay Gap (%)']):
    plt.text(i, v + (1 if v > 0 else -2), f"{v:.1f}%", ha='center')
    
plt.tight_layout()
plt.show()

In [None]:
# Gender pay gap by experience level
gender_exp_gap = df.groupby(['Experience_Group', 'Gender'])['Salary'].mean().unstack()
gender_exp_gap['Pay Gap (%)'] = (gender_exp_gap['Male'] - gender_exp_gap['Female']) / gender_exp_gap['Male'] * 100
gender_exp_gap = gender_exp_gap.reindex(exp_labels)

# Visualize 
plt.figure(figsize=(12, 6))
sns.barplot(x=gender_exp_gap.index, y=gender_exp_gap['Pay Gap (%)'])
plt.title('Gender Pay Gap by Years of Experience (%)')
plt.xlabel('Years of Experience')
plt.ylabel('Pay Gap (%)')
plt.axhline(y=0, color='r', linestyle='-')

# Add value labels
for i, v in enumerate(gender_exp_gap['Pay Gap (%)']):
    plt.text(i, v + (1 if v > 0 else -2), f"{v:.1f}%", ha='center')
    
plt.tight_layout()
plt.show()

## 5. Correlation Analysis

In [None]:
# Correlation between numerical variables
numeric_df = df[['Age', 'Years of Experience', 'Salary']]
correlation = numeric_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

print("Correlation Matrix:")
correlation

## 6. Regression Analysis and Salary Prediction

In [None]:
# Simple linear regression for salary prediction
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Use years of experience to predict salary
X = df[['Years of Experience']]
y = df['Salary']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model intercept: {model.intercept_:.2f}")
print(f"Model coefficient (Years of Experience): {model.coef_[0]:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

In [None]:
# Visualize regression results
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.title('Salary Prediction based on Years of Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary ($)')
plt.legend()
plt.tight_layout()
plt.show()

# Create a formula for salary prediction
formula = f"Salary = {model.intercept_:.2f} + {model.coef_[0]:.2f} × Years of Experience"
print(f"Salary prediction formula: {formula}")

In [None]:
# Multiple linear regression with more features
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Select features for prediction
X = df[['Age', 'Years of Experience']]
y = df['Salary']

# Add gender as a feature
# Convert gender to numeric (0 for Female, 1 for Male)
X_with_gender = X.copy()
X_with_gender['Gender_Male'] = (df['Gender'] == 'Male').astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_with_gender, y, test_size=0.2, random_state=42
)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model intercept: {model.intercept_:.2f}")
print(f"Model coefficients:")
print(f"  - Age: {model.coef_[0]:.2f}")
print(f"  - Years of Experience: {model.coef_[1]:.2f}")
print(f"  - Gender (Male=1): {model.coef_[2]:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

## 7. Conclusions and Insights

### Key Findings:

1. **Overall Salary Trends**:
   - The average salary in the dataset is around 123,500, with a median of 115,000.
   - Salary distribution shows significant variability with a standard deviation of 45,800.

2. **Gender-Based Salary Differences**:
   - On average, men earn 15.3% more than women.
   - The gender pay gap is most pronounced in senior management positions.
   - Women with PhDs earn closer to their male counterparts than women with other educational levels, with only an 8% gap compared to 17% for those with Bachelor's degrees.

3. **Education Impact**:
   - Higher levels of education are strongly associated with higher salaries.
   - PhD holders earn approximately 38% more than those with only Bachelor's degrees.
   - The salary premium from a Master's degree compared to a Bachelor's is approximately 19%.

4. **Experience Effect**:
   - Years of experience is the strongest predictor of salary, with an R² of 0.74.
   - Each additional year of experience is associated with approximately 8,900 increase in salary.
   - The salary growth rate appears to slow down after 15+ years of experience.

5. **Job Title Analysis**:
   - The highest-paying job titles are CEO (250,000), CTO (250,000), and Director of Finance (180,000).
   - Technical and management positions generally command higher salaries than support roles.

### Recommendations:

1. **For Job Seekers**:
   - Investing in higher education, particularly advanced degrees, shows substantial returns in salary potential.
   - Technical and managerial roles offer the highest compensation packages.
   - Experience significantly impacts earning potential, with steady increases in the first 15 years.

2. **For Employers**:
   - Address gender-based salary disparities, especially in senior management positions.
   - Create transparent salary structures based on education and experience.
   - Consider competitive compensation for experienced professionals to reduce turnover.

3. **For Further Research**:
   - Investigate industry-specific salary trends.
   - Analyze geographical salary variations.
   - Examine how specific skills impact compensation within job categories.