# ECON 0150 | Replication Notebook

**Title:** Gender Wage Gap

**Original Authors:** Charley Wan and Asliddin Nurboev

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis. You can run this notebook yourself to explore the data, reproduce the findings, and try the extension exercises at the end.

## About This Replication

**Research Question:** Does the gender wage gap change when controlling for job title?

**Data Source:** Glassdoor Gender Pay Gap dataset (1,000 employees across 10 job titles)

**Methods:** OLS regression with and without categorical control variables

**Main Finding:** The raw gender pay gap (males earning ~9.5% more) disappears entirely when controlling for job title, suggesting occupational sorting rather than direct discrimination explains the gap in this dataset.

**Course Concepts Used:**
- OLS regression
- Categorical control variables
- Log transformations
- Residual analysis
- Omitted variable bias

---
## Step 0 | Setup

First, we import the necessary libraries and load the data.

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
data_url = 'https://tayweid.github.io/econ-0150/projects/replications/0037/data/glassdoor_gender_pay_gap.csv'
data = pd.read_csv(data_url)

# Preview the data
data.head()

In [None]:
# Check the shape and columns
print(f"Dataset has {len(data)} rows and {len(data.columns)} columns")
print(f"\nColumns: {list(data.columns)}")

---
## Step 1 | Data Exploration

Before analyzing the data, we explore its structure and key variables.

In [None]:
# Summary statistics
data.describe()

In [None]:
# Job title distribution
data.value_counts('JobTitle')

In [None]:
# Gender counts by job title
gender_jobtitle_counts = data.groupby(['JobTitle', 'Gender']).size().unstack(fill_value=0)
gender_jobtitle_counts.columns.name = None
gender_jobtitle_counts = gender_jobtitle_counts.rename(columns={'Female': 'Female Count', 'Male': 'Male Count'})

# Get total job title counts
total_job_counts_df = data['JobTitle'].value_counts().reset_index()
total_job_counts_df.columns = ['JobTitle', 'Total Count']
total_job_counts_df = total_job_counts_df.set_index('JobTitle')

# Merge and display
combined_job_counts = gender_jobtitle_counts.merge(total_job_counts_df, left_index=True, right_index=True)
combined_job_counts.sort_values(by='Total Count', ascending=True)

In [None]:
# Visualize job title distribution
Jobtitle_counts = data['JobTitle'].value_counts()
sns.countplot(data=data, y='JobTitle', hue='JobTitle', order=Jobtitle_counts.index)
plt.title('Distribution of Job Titles')
plt.show()

In [None]:
# Gender distribution
gender_counts = data['Gender'].value_counts()
print(gender_counts)

plt.figure(figsize=(6, 4))
sns.countplot(data=data, x='Gender', hue='Gender', palette='viridis', legend=False)
plt.title('Distribution of Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

---
## Step 2 | Data Preparation

We prepare the data for analysis by creating new variables.

In [None]:
# Create log of BasePay and binary MALE indicator
data['LogBasePay'] = np.log(data['BasePay'])
data['MALE'] = data['Gender'].map({'Female': 0, 'Male': 1})

data.head()

---
## Step 3 | Visualization

We visualize the key relationships in the data.

In [None]:
# Whole population: BasePay by Gender
mean_basepay_by_gender = data.groupby('MALE')['BasePay'].mean()
plt.figure(figsize=(10, 7))

sns.boxplot(x='Gender', y='BasePay', data=data, hue='Gender', palette='viridis', legend=False)
sns.stripplot(x='Gender', y='BasePay', data=data, color='red', alpha=0.6, jitter=0.2)

plt.plot(mean_basepay_by_gender.index, mean_basepay_by_gender.values,
         color='yellow', linestyle='-', linewidth=2, label='Mean BasePay Trend', zorder=10)

plt.title('GlassDoor BasePay Gender Disparity - Whole Population', fontsize=20)
plt.xlabel('Gender', fontsize=16)
plt.ylabel('BasePay', fontsize=16)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend()
plt.show()

In [None]:
# BasePay by Gender WITHIN each Job Title
mean_basepay_by_job_gender = data.groupby(['JobTitle', 'Gender'])['BasePay'].mean().reset_index()

plt.figure(figsize=(18, 9))

ax = sns.boxplot(x='JobTitle', y='BasePay', hue='Gender', data=data, palette='viridis', zorder=2)
sns.stripplot(x='JobTitle', y='BasePay', hue='Gender', data=data, color='red', alpha=0.6, 
              jitter=0.2, dodge=True, zorder=3, legend=False)

# Overlay mean lines
job_title_categories = data['JobTitle'].unique()
mean_basepay_pivot = mean_basepay_by_job_gender.pivot(index='JobTitle', columns='Gender', values='BasePay')
mean_basepay_pivot = mean_basepay_pivot.reindex(job_title_categories)

for i, job_title in enumerate(mean_basepay_pivot.index):
    female_mean = mean_basepay_pivot.loc[job_title, 'Female']
    male_mean = mean_basepay_pivot.loc[job_title, 'Male']
    x_female_pos = i - 0.2
    x_male_pos = i + 0.2
    plt.plot([x_female_pos, x_male_pos], [female_mean, male_mean],
             color='yellow', linestyle='-', linewidth=2, 
             label='Mean BasePay Trend' if i == 0 else "", zorder=10)

plt.title('Distribution of BasePay by Gender Across All Job Titles', fontsize=20)
plt.xlabel('Job Title', fontsize=16)
plt.ylabel('BasePay', fontsize=16)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

---
## Step 4 | Statistical Analysis

We run two regression models to answer the research question:
1. **Uncontrolled Model:** LogBasePay ~ MALE
2. **Controlled Model:** LogBasePay ~ MALE + JobTitle

### Uncontrolled Model

First, we examine the raw gender pay gap without any controls.

In [None]:
Uncontrolled_Model = smf.ols('LogBasePay ~ MALE', data=data).fit()
print(Uncontrolled_Model.summary().tables[1])

In [None]:
# Residual plot for uncontrolled model
residuals_uncontrolled = Uncontrolled_Model.resid
fitted_values_uncontrolled = Uncontrolled_Model.fittedvalues

plt.figure(figsize=(10, 6))
sns.residplot(x=fitted_values_uncontrolled, y=residuals_uncontrolled, lowess=True, 
              line_kws={'color': 'red', 'lw': 1})
plt.title('Residual Plot for Uncontrolled Model (LogBasePay ~ MALE)', fontsize=16)
plt.xlabel('Fitted Values (LogBasePay)', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.axhline(y=0, color='grey', linestyle='--', linewidth=0.8)
plt.show()

### Controlled Model

Now we add JobTitle as a categorical control variable.

In [None]:
Controlled_Model = smf.ols('LogBasePay ~ MALE + JobTitle', data=data).fit()
print(Controlled_Model.summary().tables[1])

In [None]:
# Residual plot for controlled model
residuals_controlled = Controlled_Model.resid
fitted_values_controlled = Controlled_Model.fittedvalues
male_indicator = data['MALE']

# Separate by gender
residuals_female = residuals_controlled[male_indicator == 0]
fitted_values_female = fitted_values_controlled[male_indicator == 0]
residuals_male = residuals_controlled[male_indicator == 1]
fitted_values_male = fitted_values_controlled[male_indicator == 1]

plt.figure(figsize=(12, 7))

jitter_amount_y = 0.005
plt.scatter(fitted_values_female,
            residuals_female + np.random.uniform(-jitter_amount_y, jitter_amount_y, len(residuals_female)),
            alpha=0.6, color='blue', s=30, label='Female Residuals')
plt.scatter(fitted_values_male,
            residuals_male + np.random.uniform(-jitter_amount_y, jitter_amount_y, len(residuals_male)),
            alpha=0.6, color='green', s=30, label='Male Residuals')

sns.regplot(x=fitted_values_controlled, y=residuals_controlled,
            scatter=False, lowess=True, line_kws={'color': 'red', 'lw': 2}, ax=plt.gca())

plt.title('Residual Plot for Controlled Model (by Gender)', fontsize=16)
plt.xlabel('Fitted Values (LogBasePay)', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.axhline(y=0, color='grey', linestyle='--', linewidth=0.8)
plt.legend(title='Gender', loc='upper right')
plt.tight_layout()
plt.show()

---
## Step 5 | Results Interpretation

### Key Findings

**Uncontrolled Model (no controls):**
- MALE coefficient: **+0.095** (p < 0.001)
- Interpretation: Males earn approximately 9.5% more than females on average
- This is statistically significant

**Controlled Model (with JobTitle):**
- MALE coefficient: **-0.003** (p = 0.87)
- Interpretation: After controlling for job title, males earn 0.3% *less* than females
- This is **not statistically significant**

### Conclusion

The apparent gender pay gap disappears when we control for job title. This suggests that in this dataset, the pay gap is driven by **occupational sorting** (men and women working in different jobs) rather than **direct pay discrimination** (men and women being paid differently for the same job).

Note: This finding is specific to this Glassdoor dataset and should not be generalized without further research.

---
## Replication Exercises

Try extending this analysis with the following exercises:

### Exercise 1: Add More Controls
Add `Education` and `Seniority` as additional control variables. Does the MALE coefficient change? What does this tell you about other potential sources of the pay gap?

### Exercise 2: Analyze a Specific Job Title
Pick one job title (e.g., Software Engineer or Manager) and run a regression just for that subset. Is there a gender pay gap within that specific occupation?

### Exercise 3: Total Compensation
Create a new variable for total compensation (BasePay + Bonus) and re-run the analysis. Do the results change when including bonuses?

### Challenge Exercise
The dataset shows that Software Engineers are 93% male while Marketing Associates are 91% female. How might this occupational segregation relate to broader questions about the gender pay gap? Write a brief analysis connecting the statistical findings to potential policy implications.

In [None]:
# Your code for Exercise 1: Add Education and Seniority controls


In [None]:
# Your code for Exercise 2: Analyze a specific job title


In [None]:
# Your code for Exercise 3: Total compensation analysis


In [None]:
# Your code for Challenge Exercise
