# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...

# My Data Cleaning Plan - Step 0

Based on my analysis of the Ask A Manager Salary Survey dataset, here's my comprehensive data cleaning plan:

## Dataset Overview
- **Size**: ~28,000 responses from 2021 salary survey
- **Format**: TSV file with 18 columns
- **Key Challenges**: Multiple currencies, inconsistent formatting, messy job titles, various country/state formats

## Data Quality Issues Identified
1. **Salary Data**: Multiple currencies (USD, GBP, CAD), inconsistent formatting, potential outliers
2. **Location Data**: Inconsistent US state formats ("US", "USA", "United States"), missing states
3. **Job Titles**: Highly variable, need to identify "Software Engineers" and "tech workers"
4. **Experience**: Range format ("5-7 years", "8-10 years") needs conversion to numeric
5. **Industry**: Need to categorize tech vs non-tech industries
6. **Education/Gender**: For bonus questions, need clean categories

## Step-by-Step Cleaning Plan

### Phase 1: Data Exploration & Setup
1. Load dataset and examine structure, data types, missing values
2. Identify all unique currencies and conversion rates needed
3. Map out all unique job titles, industries, and location formats

### Phase 2: Core Data Cleaning
1. **Salary Standardization**:
   - Convert all salaries to USD using 2021 exchange rates
   - Handle missing/zero salaries appropriately
   - Remove extreme outliers (likely data entry errors)
   - Add bonus compensation to total salary

2. **Location Standardization**:
   - Standardize country names to "United States"
   - Clean and standardize US state names
   - Filter to US-only data for core questions

3. **Job Title Categorization**:
   - Create "Software Engineer" category (exact matches + variations)
   - Create "Tech Worker" category (all computing/tech industry roles)
   - Handle job title variations and context

4. **Experience Conversion**:
   - Convert experience ranges to midpoint numeric values
   - Handle edge cases and missing data

### Phase 3: Analysis Preparation
1. **Industry Classification**:
   - Identify tech vs non-tech industries
   - Standardize industry names

2. **Education & Gender Cleaning**:
   - Standardize education levels for bonus questions
   - Clean gender categories

### Phase 4: Business Question Analysis
1. **Question 1**: Median salary for Software Engineers in US
2. **Question 2**: US state with highest average tech worker salary  
3. **Question 3**: Salary increase per year of experience in tech
4. **Question 4**: Highest median salary non-tech industry
5. **Question 5**: Gender salary gap in tech (bonus)
6. **Question 6**: Education level salary comparison (bonus)

## Key Business Decisions
- **Currency Conversion**: Use 2021 average exchange rates (GBP≈1.37, CAD≈0.80)
- **Outlier Handling**: Remove salaries <$20k or >$500k as likely errors
- **Tech Definition**: "Computing or Tech" industry + specific tech job titles
- **Experience Mapping**: Use range midpoints (e.g., "5-7 years" = 6 years)

## Success Criteria
- Final answers within 5% of expected values
- Clean, reproducible analysis
- Clear documentation of cleaning decisions

This plan addresses the real-world messiness of survey data while ensuring we can answer the specific business questions accurately.

## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 50)

print("Libraries imported successfully!")

# Load the dataset
file_path = '../../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv'

try:
    df = pd.read_csv(file_path, sep='\t')
    print(f"✅ Dataset loaded successfully!")
    print(f"📊 Dataset shape: {df.shape}")
    print(f"📋 Columns: {len(df.columns)}")
except FileNotFoundError:
    print("❌ File not found. Please check the file path.")
except Exception as e:
    print(f"❌ Error loading dataset: {e}")

# Data Exploration

# Basic dataset information
print("🔍 DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print()

print("📋 COLUMN NAMES:")
print("-" * 30)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

# Data types and missing values
print("📊 DATA TYPES & MISSING VALUES")
print("=" * 50)

info_df = pd.DataFrame({
    'Data Type': df.dtypes,
    'Non-Null Count': df.count(),
    'Null Count': df.isnull().sum(),
    'Null %': (df.isnull().sum() / len(df) * 100).round(1)
})

print(info_df)

# First few rows to understand the data structure
print("👀 FIRST 5 ROWS")
print("=" * 50)
display(df.head())

# Key columns for our analysis
key_columns = [
    'What industry do you work in?',
    'Job title', 
    'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)',
    'Please indicate the currency',
    'What country do you work in?',
    'If you\'re in the U.S., what state do you work in?',
    'How many years of professional work experience do you have overall?',
    'What is your highest level of education completed?',
    'What is your gender?'
]

print("🎯 KEY COLUMNS FOR ANALYSIS")
print("=" * 50)
for i, col in enumerate(key_columns, 1):
    if col in df.columns:
        print(f"✅ {i}. {col}")
    else:
        print(f"❌ {i}. {col} - NOT FOUND")

# Let's examine the salary column more closely
salary_col = 'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'

print("💰 SALARY DATA EXPLORATION")
print("=" * 50)
print(f"Salary column: {salary_col}")
print(f"Non-null salary entries: {df[salary_col].count():,}")
print(f"Null salary entries: {df[salary_col].isnull().sum():,}")
print()

print("Sample salary values:")
print(df[salary_col].dropna().head(10).tolist())

# Currency distribution
currency_col = 'Please indicate the currency'
print("💱 CURRENCY DISTRIBUTION")
print("=" * 50)
currency_counts = df[currency_col].value_counts()
print(currency_counts)
print(f"\nTotal currencies: {len(currency_counts)}")

# Country distribution
country_col = 'What country do you work in?'
print("🌍 COUNTRY DISTRIBUTION")
print("=" * 50)
country_counts = df[country_col].value_counts().head(10)
print(country_counts)
print(f"\nTotal countries: {df[country_col].nunique()}")

# Industry distribution
industry_col = 'What industry do you work in?'
print("🏭 INDUSTRY DISTRIBUTION")
print("=" * 50)
industry_counts = df[industry_col].value_counts()
print(industry_counts)
print(f"\nTotal industries: {len(industry_counts)}")

# Check for tech-related industries
tech_industries = industry_counts[industry_counts.index.str.contains('Tech|Computing', case=False, na=False)]
print(f"\n🔧 TECH-RELATED INDUSTRIES:")
print(tech_industries)

# Job titles - let's look for Software Engineers
job_title_col = 'Job title'
print("💼 JOB TITLE EXPLORATION")
print("=" * 50)
print(f"Total unique job titles: {df[job_title_col].nunique():,}")
print()

# Look for software engineer variations
software_engineer_mask = df[job_title_col].str.contains('software|engineer', case=False, na=False)
software_engineer_titles = df[software_engineer_mask][job_title_col].value_counts().head(10)

print("🔧 SOFTWARE ENGINEER JOB TITLES (Top 10):")
print(software_engineer_titles)
print(f"\nTotal Software Engineer entries: {software_engineer_mask.sum():,}")

# US State distribution
state_col = 'If you\'re in the U.S., what state do you work in?'
print("🇺🇸 US STATE DISTRIBUTION")
print("=" * 50)
state_counts = df[state_col].value_counts().head(15)
print(state_counts)
print(f"\nTotal US states represented: {df[state_col].nunique()}")
print(f"Missing state data: {df[state_col].isnull().sum():,}")

# Experience distribution
exp_col = 'How many years of professional work experience do you have overall?'
print("📈 EXPERIENCE DISTRIBUTION")
print("=" * 50)
exp_counts = df[exp_col].value_counts()
print(exp_counts)
print(f"\nTotal experience categories: {len(exp_counts)}")

# Education distribution
edu_col = 'What is your highest level of education completed?'
print("🎓 EDUCATION DISTRIBUTION")
print("=" * 50)
edu_counts = df[edu_col].value_counts()
print(edu_counts)
print(f"\nTotal education levels: {len(edu_counts)}")

# Gender distribution
gender_col = 'What is your gender?'
print("👥 GENDER DISTRIBUTION")
print("=" * 50)
gender_counts = df[gender_col].value_counts()
print(gender_counts)
print(f"\nTotal gender categories: {len(gender_counts)}")

Libraries imported successfully!
✅ Dataset loaded successfully!
📊 Dataset shape: (28062, 18)
📋 Columns: 18
🔍 DATASET OVERVIEW
Shape: 28,062 rows × 18 columns
Memory usage: 25.8 MB

📋 COLUMN NAMES:
------------------------------
 1. Timestamp
 2. How old are you?
 3. What industry do you work in?
 4. Job title
 5. If your job title needs additional context, please clarify here:
 6. What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
 7. How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.
 8. Please indicate the currency
 9. If "Other," please indicate the currency here: 
10. If your income needs additional context, please provide it here:
11. What country do you work in?
12. I

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


🎯 KEY COLUMNS FOR ANALYSIS
✅ 1. What industry do you work in?
✅ 2. Job title
✅ 3. What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
✅ 4. Please indicate the currency
✅ 5. What country do you work in?
✅ 6. If you're in the U.S., what state do you work in?
✅ 7. How many years of professional work experience do you have overall?
✅ 8. What is your highest level of education completed?
✅ 9. What is your gender?
💰 SALARY DATA EXPLORATION
Salary column: What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
Non-null salary entries: 28,062
Null salary entries: 0

Sample salary values:
['55,000', '54,600', '34,000', '62,000', '60,000', '62,000', '3

## Step 2: Data Cleaning


In [4]:
# Create a copy of the original dataset for cleaning
df_clean = df.copy()
print("📋 Created clean dataset copy")
print(f"Original shape: {df.shape}")
print(f"Clean dataset shape: {df_clean.shape}")

# Define column names for easier access
salary_col = 'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'
currency_col = 'Please indicate the currency'
bonus_col = 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.'
country_col = 'What country do you work in?'
state_col = 'If you\'re in the U.S., what state do you work in?'
industry_col = 'What industry do you work in?'
job_title_col = 'Job title'
exp_col = 'How many years of professional work experience do you have overall?'
edu_col = 'What is your highest level of education completed?'
gender_col = 'What is your gender?'

print("✅ Column references defined")

### 2.1: Salary Data Cleaning
#First, let's tackle the most complex part - cleaning and standardizing salary data across multiple currencies.

# Define 2021 exchange rates (approximate averages for the year)
exchange_rates = {
    'USD': 1.0,
    'GBP': 1.37,  # 1 GBP = 1.37 USD (2021 average)
    'CAD': 0.80,  # 1 CAD = 0.80 USD (2021 average)
    'EUR': 1.18,  # 1 EUR = 1.18 USD (2021 average)
    'AUD': 0.75,  # 1 AUD = 0.75 USD (2021 average)
    'CHF': 1.09,  # 1 CHF = 1.09 USD (2021 average)
    'SEK': 0.12,  # 1 SEK = 0.12 USD (2021 average)
    'NOK': 0.12,  # 1 NOK = 0.12 USD (2021 average)
    'DKK': 0.16,  # 1 DKK = 0.16 USD (2021 average)
    'JPY': 0.009, # 1 JPY = 0.009 USD (2021 average)
    'INR': 0.014, # 1 INR = 0.014 USD (2021 average)
    'SGD': 0.74,  # 1 SGD = 0.74 USD (2021 average)
    'NZD': 0.71,  # 1 NZD = 0.71 USD (2021 average)
    'ZAR': 0.067, # 1 ZAR = 0.067 USD (2021 average)
    'BRL': 0.19,  # 1 BRL = 0.19 USD (2021 average)
    'MXN': 0.050, # 1 MXN = 0.050 USD (2021 average)
    'ILS': 0.31,  # 1 ILS = 0.31 USD (2021 average)
    'PLN': 0.26,  # 1 PLN = 0.26 USD (2021 average)
    'CZK': 0.046, # 1 CZK = 0.046 USD (2021 average)
    'HUF': 0.0033,# 1 HUF = 0.0033 USD (2021 average)
    'TRY': 0.12,  # 1 TRY = 0.12 USD (2021 average)
    'RUB': 0.014, # 1 RUB = 0.014 USD (2021 average)
    'CNY': 0.15,  # 1 CNY = 0.15 USD (2021 average)
    'KRW': 0.0009,# 1 KRW = 0.0009 USD (2021 average)
    'THB': 0.030, # 1 THB = 0.030 USD (2021 average)
    'MYR': 0.24,  # 1 MYR = 0.24 USD (2021 average)
    'PHP': 0.020, # 1 PHP = 0.020 USD (2021 average)
    'IDR': 0.00007,# 1 IDR = 0.00007 USD (2021 average)
    'VND': 0.000044,# 1 VND = 0.000044 USD (2021 average)
}

print("💱 Exchange rates defined for 2021")
print(f"Supported currencies: {len(exchange_rates)}")
print("Top currencies by frequency:")
print(df_clean[currency_col].value_counts().head(10))

# Function to clean and convert salary to USD
def clean_salary_to_usd(salary_str, currency_str, bonus_str=None):
    """
    Clean salary string and convert to USD
    """
    if pd.isna(salary_str) or salary_str == '':
        return np.nan
    
    # Convert to string and clean
    salary_str = str(salary_str).strip()
    
    # Remove common non-numeric characters but keep decimal point
    import re
    salary_clean = re.sub(r'[^\d.,]', '', salary_str)
    
    # Handle different number formats
    if ',' in salary_clean and '.' in salary_clean:
        # Format like "1,234.56" - US format
        salary_clean = salary_clean.replace(',', '')
    elif ',' in salary_clean:
        # Could be European format "1.234,56" or thousands separator "1,234"
        if salary_clean.count(',') == 1 and salary_clean.count('.') == 0:
            # Check if it's likely thousands separator (comma near end) or decimal
            comma_pos = salary_clean.find(',')
            if len(salary_clean) - comma_pos <= 3:  # Likely thousands separator
                salary_clean = salary_clean.replace(',', '')
            else:  # Likely decimal separator
                salary_clean = salary_clean.replace(',', '.')
        else:
            # Multiple commas, likely thousands separators
            salary_clean = salary_clean.replace(',', '')
    try:
        salary_num = float(salary_clean)
    except (ValueError, TypeError):
        return np.nan
    
    # Handle bonus if provided
    bonus_num = 0
    if bonus_str is not None and not pd.isna(bonus_str) and str(bonus_str).strip() != '':
        try:
            bonus_clean = re.sub(r'[^\d.,]', '', str(bonus_str))
            if ',' in bonus_clean:
                bonus_clean = bonus_clean.replace(',', '')
            bonus_num = float(bonus_clean)
        except (ValueError, TypeError):
            bonus_num = 0
    
    # Total compensation
    total_comp = salary_num + bonus_num
    
    # Convert to USD
    if pd.isna(currency_str) or currency_str not in exchange_rates:
        return np.nan
    
    usd_amount = total_comp * exchange_rates[currency_str]
    
    return usd_amount

print("🔧 Salary cleaning function defined")

# Apply salary cleaning function
print("💰 Cleaning salary data...")

# Create new column for cleaned USD salaries
df_clean['salary_usd'] = df_clean.apply(
    lambda row: clean_salary_to_usd(
        row[salary_col], 
        row[currency_col], 
        row[bonus_col]
    ), 
    axis=1
)

print(f"✅ Salary cleaning completed")
print(f"Valid USD salaries: {df_clean['salary_usd'].notna().sum():,}")
print(f"Invalid/missing salaries: {df_clean['salary_usd'].isna().sum():,}")

# Show some examples of cleaned salaries
print("\n📊 Sample of cleaned salaries:")
sample_data = df_clean[['salary_usd', salary_col, currency_col, bonus_col]].dropna(subset=['salary_usd']).head(10)
display(sample_data)

# Remove salary outliers (likely data entry errors)
print("🔍 Analyzing salary distribution for outliers...")

# Get basic statistics
salary_stats = df_clean['salary_usd'].describe()
print("Salary statistics (USD):")
print(salary_stats)

# Define reasonable salary bounds (2021 context)
min_salary = 20000  # $20k minimum (very low but possible for part-time/entry level)
max_salary = 500000  # $500k maximum (very high but possible for executives)

# Count outliers
outliers_low = (df_clean['salary_usd'] < min_salary).sum()
outliers_high = (df_clean['salary_usd'] > max_salary).sum()
total_outliers = outliers_low + outliers_high

print(f"\n🚨 Outlier Analysis:")
print(f"Salaries below ${min_salary:,}: {outliers_low:,}")
print(f"Salaries above ${max_salary:,}: {outliers_high:,}")
print(f"Total outliers: {total_outliers:,}")

# Remove outliers
df_clean = df_clean[
    (df_clean['salary_usd'] >= min_salary) | 
    (df_clean['salary_usd'].isna())
].copy()

print(f"\n✅ Outliers removed")
print(f"Remaining records: {len(df_clean):,}")
print(f"Valid salaries after cleaning: {df_clean['salary_usd'].notna().sum():,}")

### 2.2: Location Data Standardization
#Now let's clean up the location data to standardize country and state names.

# Standardize country names
print("🌍 Standardizing country names...")

# Create mapping for US variations
us_variations = {
    'United States': 'United States',
    'US': 'United States', 
    'USA': 'United States',
    'U.S.': 'United States',
    'U.S.A.': 'United States',
    'United States of America': 'United States'
}

# Apply country standardization
df_clean['country_clean'] = df_clean[country_col].replace(us_variations)

print("Country distribution after cleaning:")
print(df_clean['country_clean'].value_counts().head(10))

# Filter for US data only (since our questions focus on US)
us_mask = df_clean['country_clean'] == 'United States'
df_us = df_clean[us_mask].copy()

print(f"\n🇺🇸 US-only dataset:")
print(f"US records: {len(df_us):,}")
print(f"Non-US records: {len(df_clean) - len(df_us):,}")
print(f"US percentage: {len(df_us)/len(df_clean)*100:.1f}%")

# Standardize US state names
print("🗺️ Standardizing US state names...")

# Create comprehensive state mapping
state_mapping = {
    # Standard abbreviations
    'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
    'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
    'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa',
    'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland',
    'MA': 'Massachusetts', 'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri',
    'MT': 'Montana', 'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey',
    'NM': 'New Mexico', 'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio',
    'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'RI': 'Rhode Island', 'SC': 'South Carolina',
    'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont',
    'VA': 'Virginia', 'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming',
    'DC': 'District of Columbia',
    
    # Common variations and full names
    'California': 'California', 'New York': 'New York', 'Texas': 'Texas', 'Florida': 'Florida',
    'Illinois': 'Illinois', 'Pennsylvania': 'Pennsylvania', 'Ohio': 'Ohio', 'Georgia': 'Georgia',
    'North Carolina': 'North Carolina', 'Michigan': 'Michigan', 'New Jersey': 'New Jersey',
    'Virginia': 'Virginia', 'Washington': 'Washington', 'Arizona': 'Arizona', 'Massachusetts': 'Massachusetts',
    'Tennessee': 'Tennessee', 'Indiana': 'Indiana', 'Missouri': 'Missouri', 'Maryland': 'Maryland',
    'Wisconsin': 'Wisconsin', 'Colorado': 'Colorado', 'Minnesota': 'Minnesota', 'South Carolina': 'South Carolina',
    'Alabama': 'Alabama', 'Louisiana': 'Louisiana', 'Kentucky': 'Kentucky', 'Oregon': 'Oregon',
    'Oklahoma': 'Oklahoma', 'Connecticut': 'Connecticut', 'Utah': 'Utah', 'Iowa': 'Iowa',
    'Nevada': 'Nevada', 'Arkansas': 'Arkansas', 'Mississippi': 'Mississippi', 'Kansas': 'Kansas',
    'New Mexico': 'New Mexico', 'Nebraska': 'Nebraska', 'West Virginia': 'West Virginia', 'Idaho': 'Idaho',
    'Hawaii': 'Hawaii', 'New Hampshire': 'New Hampshire', 'Maine': 'Maine', 'Montana': 'Montana',
    'Rhode Island': 'Rhode Island', 'Delaware': 'Delaware', 'South Dakota': 'South Dakota', 'North Dakota': 'North Dakota',
    'Alaska': 'Alaska', 'Vermont': 'Vermont', 'Wyoming': 'Wyoming',

    # Special cases
    'District of Columbia': 'District of Columbia', 'Washington DC': 'District of Columbia',
    'Washington, DC': 'District of Columbia', 'Washington D.C.': 'District of Columbia',
    'D.C.': 'District of Columbia', 'DC': 'District of Columbia'
}

# Apply state standardization
df_us['state_clean'] = df_us[state_col].replace(state_mapping)

print("State distribution after cleaning (top 15):")
print(df_us['state_clean'].value_counts().head(15))

# Check for unmapped states
unmapped_states = df_us[~df_us[state_col].isin(state_mapping.keys()) & df_us[state_col].notna()][state_col].unique()
if len(unmapped_states) > 0:
    print(f"\n⚠️ Unmapped states found: {unmapped_states[:10]}")  # Show first 10
    print(f"Total unmapped state entries: {len(unmapped_states)}")

print(f"\n✅ State standardization completed")
print(f"Records with valid states: {df_us['state_clean'].notna().sum():,}")
print(f"Records with missing states: {df_us['state_clean'].isna().sum():,}")

### 2.3: Job Title and Industry Classification
#Now let's clean job titles and classify industries to identify Software Engineers and tech workers.

# Clean and standardize job titles
print("💼 Cleaning job titles...")

# Function to clean job titles
def clean_job_title(title):
    if pd.isna(title):
        return title
    
    # Convert to string and clean
    title_clean = str(title).strip()
    
    # Remove extra whitespace
    title_clean = ' '.join(title_clean.split())
    
    # Standardize common variations
    title_clean = title_clean.replace('Sr.', 'Senior')
    title_clean = title_clean.replace('Jr.', 'Junior')
    title_clean = title_clean.replace('Sr ', 'Senior ')
    title_clean = title_clean.replace('Jr ', 'Junior ')
    
    return title_clean

# Apply job title cleaning
df_us['job_title_clean'] = df_us[job_title_col].apply(clean_job_title)

print("✅ Job titles cleaned")
print(f"Unique job titles: {df_us['job_title_clean'].nunique():,}")

# Show some examples
print("\nSample cleaned job titles:")
print(df_us['job_title_clean'].value_counts().head(10))

# Identify Software Engineers
print("🔧 Identifying Software Engineers...")

# Define patterns for Software Engineer identification
software_engineer_patterns = [
    'software engineer', 'software developer', 'software architect', 'software analyst',
    'software consultant', 'software manager', 'software lead', 'software director',
    'software specialist', 'software technician', 'software programmer', 'software designer',
    'senior software engineer', 'principal software engineer', 'staff software engineer',
    'lead software engineer', 'software engineering', 'software development',
    'backend engineer', 'frontend engineer', 'full stack engineer', 'full-stack engineer',
    'mobile developer', 'web developer', 'application developer', 'systems developer',
    'devops engineer', 'platform engineer', 'infrastructure engineer', 'cloud engineer',
    'data engineer', 'machine learning engineer', 'ai engineer', 'ml engineer'
]

# Create function to identify software engineers
def is_software_engineer(title):
    if pd.isna(title):
        return False
    
    title_lower = str(title).lower()
    
    # Check for exact matches or contains patterns
    for pattern in software_engineer_patterns:
        if pattern in title_lower:
            return True
    
    return False

# Apply software engineer identification
df_us['is_software_engineer'] = df_us['job_title_clean'].apply(is_software_engineer)

# Count software engineers
se_count = df_us['is_software_engineer'].sum()
print(f"✅ Software Engineers identified: {se_count:,}")

# Show software engineer job titles
print("\n🔧 Software Engineer job titles (top 15):")
se_titles = df_us[df_us['is_software_engineer']]['job_title_clean'].value_counts().head(15)
print(se_titles)

# Identify tech workers (broader category)
print("💻 Identifying tech workers...")

# Define tech industry and job patterns
tech_industries = [
    'Computing or Tech', 'Technology', 'Software', 'IT', 'Information Technology',
    'Computer', 'Tech', 'Digital', 'Cybersecurity', 'Data', 'AI', 'Machine Learning'
]

tech_job_patterns = [
    'engineer', 'developer', 'programmer', 'analyst', 'architect', 'consultant',
    'manager', 'director', 'lead', 'specialist', 'technician', 'designer',
    'data scientist', 'data analyst', 'product manager', 'technical', 'systems',
    'network', 'security', 'cloud', 'devops', 'platform', 'infrastructure',
    'database', 'qa', 'quality assurance', 'test', 'automation', 'scrum',
    'agile', 'product owner', 'business analyst', 'technical writer', 'support'
]

# Function to identify tech workers
def is_tech_worker(industry, title):
    if pd.isna(industry) and pd.isna(title):
        return False
    
    # Check industry
    if not pd.isna(industry):
        industry_lower = str(industry).lower()
        for tech_ind in tech_industries:
            if tech_ind.lower() in industry_lower:
                return True
            
    # Check job title
    if not pd.isna(title):
        title_lower = str(title).lower()
        for pattern in tech_job_patterns:
            if pattern in title_lower:
                return True
    
    return False

# Apply tech worker identification
df_us['is_tech_worker'] = df_us.apply(
    lambda row: is_tech_worker(row[industry_col], row['job_title_clean']), 
    axis=1
)

# Count tech workers
tech_count = df_us['is_tech_worker'].sum()
print(f"✅ Tech workers identified: {tech_count:,}")

# Show tech industry distribution
print("\n💻 Tech industry distribution:")
tech_industries_dist = df_us[df_us['is_tech_worker']][industry_col].value_counts().head(10)
print(tech_industries_dist)

### 2.4: Experience Data Conversion
#Now let's convert experience ranges to numeric values for analysis.

# Convert experience ranges to numeric values
print("📈 Converting experience ranges to numeric values...")

# Define experience mapping (using midpoints of ranges)
experience_mapping = {
    '1 year or less': 0.5,
    '2 - 4 years': 3,
    '5-7 years': 6,
    '8 - 10 years': 9,
    '11 - 20 years': 15.5,
    '21 - 30 years': 25.5,
    '31 - 40 years': 35.5,
    '41 years or more': 45
}

# Apply experience conversion
df_us['experience_years'] = df_us[exp_col].replace(experience_mapping)

print("Experience distribution after conversion:")
exp_dist = df_us['experience_years'].value_counts().sort_index()
print(exp_dist)

print(f"\n✅ Experience conversion completed")
print(f"Records with valid experience: {df_us['experience_years'].notna().sum():,}")
print(f"Records with missing experience: {df_us['experience_years'].isna().sum():,}")

# Show experience statistics
print(f"\n📊 Experience statistics:")
print(df_us['experience_years'].describe())

### 2.5: Education and Gender Standardization
#Finally, let's clean education and gender data for the bonus questions.

# Standardize education levels
print("🎓 Standardizing education levels...")

# Define education mapping
education_mapping = {
    'High School': 'High School',
    'Some college': 'Some College',
    'College degree': 'Bachelor\'s Degree',
    'Bachelor\'s degree': 'Bachelor\'s Degree',
    'Master\'s degree': 'Master\'s Degree',
    'PhD': 'PhD',
    'Professional degree (MD, JD, etc.)': 'Professional Degree',
    'Some high school': 'High School',
    'Trade school': 'Trade School',
    'Associate degree': 'Associate Degree'
}

# Apply education standardization
df_us['education_clean'] = df_us[edu_col].replace(education_mapping)

print("Education distribution after cleaning:")
edu_dist = df_us['education_clean'].value_counts()
print(edu_dist)

print(f"\n✅ Education standardization completed")
print(f"Records with valid education: {df_us['education_clean'].notna().sum():,}")

# Standardize gender categories
print("👥 Standardizing gender categories...")

# Define gender mapping
gender_mapping = {
    'Man': 'Man',
    'Woman': 'Woman',
    'Non-binary': 'Non-binary',
    'Another option not listed here or prefer not to answer': 'Other/Prefer not to answer',
    'Prefer not to answer': 'Other/Prefer not to answer'
}

# Apply gender standardization
df_us['gender_clean'] = df_us[gender_col].replace(gender_mapping)

print("Gender distribution after cleaning:")
gender_dist = df_us['gender_clean'].value_counts()
print(gender_dist)

print(f"\n✅ Gender standardization completed")
print(f"Records with valid gender: {df_us['gender_clean'].notna().sum():,}")

📋 Created clean dataset copy
Original shape: (28062, 18)
Clean dataset shape: (28062, 18)
✅ Column references defined
💱 Exchange rates defined for 2021
Supported currencies: 29
Top currencies by frequency:
Please indicate the currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
Name: count, dtype: int64
🔧 Salary cleaning function defined
💰 Cleaning salary data...
✅ Salary cleaning completed
Valid USD salaries: 27,394
Invalid/missing salaries: 668

📊 Sample of cleaned salaries:


Unnamed: 0,salary_usd,"What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)",Please indicate the currency,"How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits."
0,55.0,55000,USD,0.0
1,5554.802,54600,GBP,4000.0
2,34.0,34000,USD,
3,3062.0,62000,USD,3000.0
4,7060.0,60000,USD,7000.0
5,62.0,62000,USD,
6,2033.0,33000,USD,2000.0
7,50.0,50000,USD,
8,10112.0,112000,USD,10000.0
9,45.0,45000,USD,0.0


🔍 Analyzing salary distribution for outliers...
Salary statistics (USD):
count    2.739400e+04
mean     2.134217e+05
std      2.900788e+07
min      0.000000e+00
25%      7.700000e+01
50%      3.105000e+03
75%      5.000000e+04
max      4.800056e+09
Name: salary_usd, dtype: float64

🚨 Outlier Analysis:
Salaries below $20,000: 17,992
Salaries above $500,000: 75
Total outliers: 18,067

✅ Outliers removed
Remaining records: 10,070
Valid salaries after cleaning: 9,402
🌍 Standardizing country names...
Country distribution after cleaning:
country_clean
United States     7089
Canada             427
Australia          313
United States      276
USA                216
Usa                205
UK                 156
United Kingdom     125
United states      112
New Zealand         93
Name: count, dtype: int64

🇺🇸 US-only dataset:
US records: 7,089
Non-US records: 2,981
US percentage: 70.4%
🗺️ Standardizing US state names...
State distribution after cleaning (top 15):
state_clean
California         

  df_us['experience_years'] = df_us[exp_col].replace(experience_mapping)


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [5]:
# Question 1: What is the median salary for Software Engineers in the United States?

print("🔧 QUESTION 1: Software Engineer Salary Analysis")
print("=" * 60)

# First, let's check if we have the cleaned data or need to do quick cleaning
if 'is_software_engineer' not in df_us.columns:
    print("⚠️ Need to identify Software Engineers from raw data...")
    
    # Quick Software Engineer identification from raw data
    job_title_col = 'Job title'
    software_engineer_patterns = [
        'software engineer', 'software developer', 'software architect', 'software analyst',
        'software consultant', 'software manager', 'software lead', 'software director',
        'software specialist', 'software technician', 'software programmer', 'software designer',
        'senior software engineer', 'principal software engineer', 'staff software engineer',
        'lead software engineer', 'software engineering', 'software development',
        'backend engineer', 'frontend engineer', 'full stack engineer', 'full-stack engineer',
        'mobile developer', 'web developer', 'application developer', 'systems developer',
        'devops engineer', 'platform engineer', 'infrastructure engineer', 'cloud engineer',
        'data engineer', 'machine learning engineer', 'ai engineer', 'ml engineer'
    ]
    
    def is_software_engineer(title):
        if pd.isna(title):
            return False
        title_lower = str(title).lower()
        return any(pattern in title_lower for pattern in software_engineer_patterns)
    
    df_us['is_software_engineer'] = df_us[job_title_col].apply(is_software_engineer)
    print(f"✅ Identified Software Engineers from raw data")

    # Filter for Software Engineers with valid salaries
se_with_salary = df_us[
    (df_us['is_software_engineer'] == True) & 
    (df_us['salary_usd'].notna())
].copy()

print(f"\n📊 Software Engineer Data:")
print(f"Total Software Engineers: {df_us['is_software_engineer'].sum():,}")
print(f"Software Engineers with valid salaries: {len(se_with_salary):,}")

if len(se_with_salary) == 0:
    print("❌ No Software Engineers with valid salary data found!")
    print("Let's check what we have...")
    print(f"Total records: {len(df_us):,}")
    print(f"Records with salary data: {df_us['salary_usd'].notna().sum():,}")
    print(f"Software Engineers (any salary): {df_us['is_software_engineer'].sum():,}")
else:
    # Calculate median salary
    median_salary = se_with_salary['salary_usd'].median()
    
    # Additional statistics
    mean_salary = se_with_salary['salary_usd'].mean()
    min_salary = se_with_salary['salary_usd'].min()
    max_salary = se_with_salary['salary_usd'].max()
    std_salary = se_with_salary['salary_usd'].std()
    
    print(f"\n💰 SOFTWARE ENGINEER SALARY STATISTICS:")
    print(f"Median Salary: ${median_salary:,.0f}")
    print(f"Mean Salary: ${mean_salary:,.0f}")
    print(f"Min Salary: ${min_salary:,.0f}")
    print(f"Max Salary: ${max_salary:,.0f}")
    print(f"Standard Deviation: ${std_salary:,.0f}")

    # Show salary distribution
    print(f"\n📈 Salary Distribution:")
    salary_quartiles = se_with_salary['salary_usd'].quantile([0.25, 0.5, 0.75])
    print(f"25th Percentile: ${salary_quartiles[0.25]:,.0f}")
    print(f"50th Percentile (Median): ${salary_quartiles[0.5]:,.0f}")
    print(f"75th Percentile: ${salary_quartiles[0.75]:,.0f}")
    
    # Show some sample Software Engineer job titles and salaries
    print(f"\n🔧 Sample Software Engineer Job Titles & Salaries:")
    sample_data = se_with_salary[['Job title', 'salary_usd']].head(10)
    for idx, row in sample_data.iterrows():
        print(f"  • {row['Job title']}: ${row['salary_usd']:,.0f}")
    
    print(f"\n🎯 ANSWER TO QUESTION 1:")
    print(f"The median salary for Software Engineers in the United States is ${median_salary:,.0f}")

🔧 QUESTION 1: Software Engineer Salary Analysis

📊 Software Engineer Data:
Total Software Engineers: 475
Software Engineers with valid salaries: 475

💰 SOFTWARE ENGINEER SALARY STATISTICS:
Median Salary: $125,000
Mean Salary: $146,461
Min Salary: $20,075
Max Salary: $1,675,000
Standard Deviation: $147,795

📈 Salary Distribution:
25th Percentile: $60,612
50th Percentile (Median): $125,000
75th Percentile: $175,000

🔧 Sample Software Engineer Job Titles & Salaries:
  • Software Engineer: $189,156
  • Senior Software Engineer : $20,120
  • Senior Data Engineer: $50,110
  • Senior Software Engineer : $24,215
  • Software engineer : $100,160
  • Software Developer: $20,100
  • Director, Software Engineering: $22,651
  • Software Development Manager: $850,250
  • Senior Software Engineering Manager: $50,220
  • Software Developer: $60,225

🎯 ANSWER TO QUESTION 1:
The median salary for Software Engineers in the United States is $125,000


In [8]:
# Question 2: Which US state has the highest average salary for tech workers?

print("💻 QUESTION 2: Tech Worker Salaries by State Analysis")
print("=" * 65)

# First, let's check if we have the cleaned data or need to do quick cleaning
if 'is_tech_worker' not in df_us.columns:
    print("⚠️ Need to identify tech workers from raw data...")
    
    # Quick tech worker identification from raw data
    industry_col = 'What industry do you work in?'
    job_title_col = 'Job title'
    
    # Define tech industry and job patterns
    tech_industries = [
        'Computing or Tech', 'Technology', 'Software', 'IT', 'Information Technology',
        'Computer', 'Tech', 'Digital', 'Cybersecurity', 'Data', 'AI', 'Machine Learning'
    ]
    
    tech_job_patterns = [
        'engineer', 'developer', 'programmer', 'analyst', 'architect', 'consultant',
        'manager', 'director', 'lead', 'specialist', 'technician', 'designer',
        'data scientist', 'data analyst', 'product manager', 'technical', 'systems',
        'network', 'security', 'cloud', 'devops', 'platform', 'infrastructure',
        'database', 'qa', 'quality assurance', 'test', 'automation', 'scrum',
        'agile', 'product owner', 'business analyst', 'technical writer', 'support'
    ]
    
    def is_tech_worker(industry, title):
        if pd.isna(industry) and pd.isna(title):
            return False
        
        # Check industry
        if not pd.isna(industry):
            industry_lower = str(industry).lower()
            for tech_ind in tech_industries:
                if tech_ind.lower() in industry_lower:
                    return True
        
        # Check job title
        if not pd.isna(title):
            title_lower = str(title).lower()
            for pattern in tech_job_patterns:
                if pattern in title_lower:
                    return True
        
        return False
    
    df_us['is_tech_worker'] = df_us.apply(
        lambda row: is_tech_worker(row[industry_col], row[job_title_col]), 
        axis=1
    )
    print(f"✅ Identified tech workers from raw data")

# Check if we have state data
if 'state_clean' not in df_us.columns:
    print("⚠️ Need to clean state data...")

    # Quick state cleaning
    state_col = 'If you\'re in the U.S., what state do you work in?'
    
    # Basic state mapping (most common ones)
    state_mapping = {
        'CA': 'California', 'NY': 'New York', 'TX': 'Texas', 'FL': 'Florida',
        'IL': 'Illinois', 'PA': 'Pennsylvania', 'OH': 'Ohio', 'GA': 'Georgia',
        'NC': 'North Carolina', 'MI': 'Michigan', 'NJ': 'New Jersey', 'VA': 'Virginia',
        'WA': 'Washington', 'AZ': 'Arizona', 'MA': 'Massachusetts', 'TN': 'Tennessee',
        'IN': 'Indiana', 'MO': 'Missouri', 'MD': 'Maryland', 'WI': 'Wisconsin',
        'CO': 'Colorado', 'MN': 'Minnesota', 'SC': 'South Carolina', 'AL': 'Alabama',
        'LA': 'Louisiana', 'KY': 'Kentucky', 'OR': 'Oregon', 'OK': 'Oklahoma',
        'CT': 'Connecticut', 'UT': 'Utah', 'IA': 'Iowa', 'NV': 'Nevada',
        'AR': 'Arkansas', 'MS': 'Mississippi', 'KS': 'Kansas', 'NM': 'New Mexico',
        'NE': 'Nebraska', 'WV': 'West Virginia', 'ID': 'Idaho', 'HI': 'Hawaii',
        'NH': 'New Hampshire', 'ME': 'Maine', 'MT': 'Montana', 'RI': 'Rhode Island',
        'DE': 'Delaware', 'SD': 'South Dakota', 'ND': 'North Dakota', 'AK': 'Alaska',
        'VT': 'Vermont', 'WY': 'Wyoming', 'DC': 'District of Columbia',
        # Full names
        'California': 'California', 'New York': 'New York', 'Texas': 'Texas',
        'Florida': 'Florida', 'Illinois': 'Illinois', 'Pennsylvania': 'Pennsylvania',
        'Ohio': 'Ohio', 'Georgia': 'Georgia', 'North Carolina': 'North Carolina',
        'Michigan': 'Michigan', 'New Jersey': 'New Jersey', 'Virginia': 'Virginia',
        'Washington': 'Washington', 'Arizona': 'Arizona', 'Massachusetts': 'Massachusetts',
        'Tennessee': 'Tennessee', 'Indiana': 'Indiana', 'Missouri': 'Missouri',
        'Maryland': 'Maryland', 'Wisconsin': 'Wisconsin', 'Colorado': 'Colorado',
        'Minnesota': 'Minnesota', 'South Carolina': 'South Carolina', 'Alabama': 'Alabama',
        'Louisiana': 'Louisiana', 'Kentucky': 'Kentucky', 'Oregon': 'Oregon',
        'Oklahoma': 'Oklahoma', 'Connecticut': 'Connecticut', 'Utah': 'Utah',
        'Iowa': 'Iowa', 'Nevada': 'Nevada', 'Arkansas': 'Arkansas', 'Mississippi': 'Mississippi',
        'Kansas': 'Kansas', 'New Mexico': 'New Mexico', 'Nebraska': 'Nebraska',
        'West Virginia': 'West Virginia', 'Idaho': 'Idaho', 'Hawaii': 'Hawaii',
        'New Hampshire': 'New Hampshire', 'Maine': 'Maine', 'Montana': 'Montana',
        'Rhode Island': 'Rhode Island', 'Delaware': 'Delaware', 'South Dakota': 'South Dakota',
        'North Dakota': 'North Dakota', 'Alaska': 'Alaska', 'Vermont': 'Vermont',
        'Wyoming': 'Wyoming', 'District of Columbia': 'District of Columbia',
        'Washington DC': 'District of Columbia', 'Washington, DC': 'District of Columbia',
        'Washington D.C.': 'District of Columbia', 'D.C.': 'District of Columbia'
    }

    df_us['state_clean'] = df_us[state_col].replace(state_mapping)
    print(f"✅ Cleaned state data")

# Filter for tech workers with valid salaries and states
tech_workers_with_data = df_us[
    (df_us['is_tech_worker'] == True) & 
    (df_us['salary_usd'].notna()) & 
    (df_us['state_clean'].notna())
].copy()

print(f"\n📊 Tech Worker Data:")
print(f"Total tech workers: {df_us['is_tech_worker'].sum():,}")
print(f"Tech workers with valid salary and state data: {len(tech_workers_with_data):,}")

if len(tech_workers_with_data) == 0:
    print("❌ No tech workers with valid salary and state data found!")
    print("Let's check what we have...")
    print(f"Total records: {len(df_us):,}")
    print(f"Tech workers: {df_us['is_tech_worker'].sum():,}")
    print(f"Records with salary data: {df_us['salary_usd'].notna().sum():,}")
    print(f"Records with state data: {df_us['state_clean'].notna().sum():,}")
else:
    # Calculate average salary by state
    state_salary_stats = tech_workers_with_data.groupby('state_clean')['salary_usd'].agg([
        'count', 'mean', 'median', 'std'
    ]).round(0)
    
    # Filter states with at least 5 tech workers for statistical significance
    state_salary_stats = state_salary_stats[state_salary_stats['count'] >= 5]
    
    # Sort by average salary (descending)
    state_salary_stats = state_salary_stats.sort_values('mean', ascending=False)
    
    print(f"\n💰 TECH WORKER SALARY BY STATE (Top 15):")
    print("=" * 80)
    print(f"{'State':<20} {'Count':<8} {'Avg Salary':<12} {'Median':<12} {'Std Dev':<12}")
    print("-" * 80)

    for state, row in state_salary_stats.head(15).iterrows():
        print(f"{state:<20} {int(row['count']):<8} ${row['mean']:,.0f}     ${row['median']:,.0f}     ${row['std']:,.0f}")
    
    # Get the state with highest average salary
    highest_paying_state = state_salary_stats.index[0]
    highest_avg_salary = state_salary_stats.iloc[0]['mean']
    state_count = int(state_salary_stats.iloc[0]['count'])
    
    print(f"\n🎯 ANSWER TO QUESTION 2:")
    print(f"The US state with the highest average salary for tech workers is {highest_paying_state}")
    print(f"Average salary: ${highest_avg_salary:,.0f}")
    print(f"Based on {state_count} tech workers in {highest_paying_state}")

💻 QUESTION 2: Tech Worker Salaries by State Analysis

📊 Tech Worker Data:
Total tech workers: 5,071
Tech workers with valid salary and state data: 5,042

💰 TECH WORKER SALARY BY STATE (Top 15):
State                Count    Avg Salary   Median       Std Dev     
--------------------------------------------------------------------------------
California           779      $129,805     $92,000     $154,028
Washington           347      $120,796     $85,000     $134,259
New York             526      $111,528     $90,000     $90,746
District of Columbia 165      $104,777     $93,000     $125,791
Connecticut          44       $102,410     $87,250     $76,827
Colorado             140      $101,116     $89,500     $83,529
Massachusetts        353      $100,532     $83,000     $107,106
Wisconsin            83       $99,381     $81,774     $106,065
North Carolina       124      $99,350     $80,744     $147,265
Florida              125      $98,281     $70,000     $230,525
New Jersey           8

In [9]:
# Question 3: How much does salary increase on average for each year of experience in tech?

print("📈 QUESTION 3: Experience vs Salary Analysis for Tech Workers")
print("=" * 70)

# Check if we have experience data cleaned
if 'experience_years' not in df_us.columns:
    print("⚠️ Need to convert experience ranges to numeric values...")
    
    # Quick experience conversion from raw data
    exp_col = 'How many years of professional work experience do you have overall?'
    
    # Define experience mapping (using midpoints of ranges)
    experience_mapping = {
        '1 year or less': 0.5,
        '2 - 4 years': 3,
        '5-7 years': 6,
        '8 - 10 years': 9,
        '11 - 20 years': 15.5,
        '21 - 30 years': 25.5,
        '31 - 40 years': 35.5,
        '41 years or more': 45
    }
    
    df_us['experience_years'] = df_us[exp_col].replace(experience_mapping)
    print(f"✅ Converted experience ranges to numeric values")

# Filter for tech workers with valid salary and experience data
tech_workers_with_exp = df_us[
    (df_us['is_tech_worker'] == True) & 
    (df_us['salary_usd'].notna()) & 
    (df_us['experience_years'].notna())
].copy()

print(f"\n📊 Tech Worker Experience Data:")
print(f"Tech workers with valid salary and experience: {len(tech_workers_with_exp):,}")

if len(tech_workers_with_exp) == 0:
    print("❌ No tech workers with valid salary and experience data found!")
    print("Let's check what we have...")
    print(f"Tech workers: {df_us['is_tech_worker'].sum():,}")
    print(f"Records with salary data: {df_us['salary_usd'].notna().sum():,}")
    print(f"Records with experience data: {df_us['experience_years'].notna().sum():,}")
else:
    # Basic statistics
    print(f"\n📈 EXPERIENCE STATISTICS:")
    print(f"Experience range: {tech_workers_with_exp['experience_years'].min():.1f} - {tech_workers_with_exp['experience_years'].max():.1f} years")
    print(f"Average experience: {tech_workers_with_exp['experience_years'].mean():.1f} years")
    print(f"Median experience: {tech_workers_with_exp['experience_years'].median():.1f} years")
    
    # Calculate correlation
    correlation = tech_workers_with_exp['experience_years'].corr(tech_workers_with_exp['salary_usd'])
    print(f"Correlation between experience and salary: {correlation:.3f}")
    
    # Linear regression analysis
    from scipy import stats
    
    # Perform linear regression
    slope, intercept, r_value, p_value, std_err = stats.linregress(
        tech_workers_with_exp['experience_years'], 
        tech_workers_with_exp['salary_usd']
    )
    
    print(f"\n🔬 LINEAR REGRESSION ANALYSIS:")
    print(f"Slope (salary increase per year): ${slope:,.0f}")
    print(f"Intercept (starting salary at 0 years): ${intercept:,.0f}")
    print(f"R-squared: {r_value**2:.3f}")
    print(f"P-value: {p_value:.2e}")
    print(f"Standard error: ${std_err:,.0f}")
    
    # Calculate salary increase per year
    salary_increase_per_year = slope

    print(f"\n🎯 ANSWER TO QUESTION 3:")
    print(f"On average, salary increases by ${salary_increase_per_year:,.0f} per year of experience in tech")
    print(f"This is based on {len(tech_workers_with_exp):,} tech workers")
    
    # Additional insights
    print(f"\n💡 ADDITIONAL INSIGHTS:")
    print(f"• R-squared of {r_value**2:.3f} means experience explains {r_value**2*100:.1f}% of salary variation")
    print(f"• Starting salary (0 years experience): ${intercept:,.0f}")
    print(f"• Expected salary at 5 years: ${intercept + slope*5:,.0f}")
    print(f"• Expected salary at 10 years: ${intercept + slope*10:,.0f}")
    print(f"• Expected salary at 20 years: ${intercept + slope*20:,.0f}")
    
    # Experience level analysis
    print(f"\n📊 SALARY BY EXPERIENCE LEVEL:")
    exp_levels = tech_workers_with_exp.groupby('experience_years')['salary_usd'].agg([
        'count', 'mean', 'median', 'std'
    ]).round(0)
    
    # Filter for experience levels with at least 3 people
    exp_levels = exp_levels[exp_levels['count'] >= 3]
    
    print(f"{'Experience':<12} {'Count':<8} {'Avg Salary':<12} {'Median':<12} {'Std Dev':<12}")
    print("-" * 60)
    for exp_years, row in exp_levels.iterrows():
        print(f"{exp_years:<12.1f} {int(row['count']):<8} ${row['mean']:,.0f}     ${row['median']:,.0f}     ${row['std']:,.0f}")
    
    # Calculate actual vs predicted salaries for key experience levels
    print(f"\n🔍 ACTUAL VS PREDICTED SALARIES:")
    key_experience_levels = [0.5, 3, 6, 9, 15.5, 25.5]
    print(f"{'Experience':<12} {'Predicted':<12} {'Actual Avg':<12} {'Difference':<12}")
    print("-" * 50)

    for exp in key_experience_levels:
        predicted = intercept + slope * exp
        actual_data = tech_workers_with_exp[tech_workers_with_exp['experience_years'] == exp]
        if len(actual_data) > 0:
            actual_avg = actual_data['salary_usd'].mean()
            difference = actual_avg - predicted
            print(f"{exp:<12.1f} ${predicted:,.0f}     ${actual_avg:,.0f}     ${difference:,.0f}")
        else:
            print(f"{exp:<12.1f} ${predicted:,.0f}     N/A         N/A")

📈 QUESTION 3: Experience vs Salary Analysis for Tech Workers

📊 Tech Worker Experience Data:
Tech workers with valid salary and experience: 5,064

📈 EXPERIENCE STATISTICS:
Experience range: 0.5 - 45.0 years
Average experience: 13.5 years
Median experience: 15.5 years
Correlation between experience and salary: 0.052

🔬 LINEAR REGRESSION ANALYSIS:
Slope (salary increase per year): $682
Intercept (starting salary at 0 years): $91,273
R-squared: 0.003
P-value: 1.98e-04
Standard error: $183

🎯 ANSWER TO QUESTION 3:
On average, salary increases by $682 per year of experience in tech
This is based on 5,064 tech workers

💡 ADDITIONAL INSIGHTS:
• R-squared of 0.003 means experience explains 0.3% of salary variation
• Starting salary (0 years experience): $91,273
• Expected salary at 5 years: $94,685
• Expected salary at 10 years: $98,097
• Expected salary at 20 years: $104,921

📊 SALARY BY EXPERIENCE LEVEL:
Experience   Count    Avg Salary   Median       Std Dev     
---------------------------

In [15]:

# Question 4: What percentage of respondents work remotely vs. in-office?

print("🏠 QUESTION 4: Remote Work vs In-Office Analysis")
print("=" * 60)

# First, let's check what columns we have that might indicate remote work
print("🔍 Checking for remote work indicators in the dataset...")

# Look for columns that might contain remote work information
potential_remote_columns = []
for col in df_us.columns:
    if any(keyword in col.lower() for keyword in ['remote', 'location', 'work', 'office', 'city']):
        potential_remote_columns.append(col)

print(f"Potential remote work related columns: {potential_remote_columns}")

# Check the city column for remote work indicators
city_col = 'What city do you work in?'
if city_col in df_us.columns:
    print(f"\n📊 Analyzing city data for remote work patterns...")
    
    # Look for common remote work indicators in city data
    city_data = df_us[city_col].fillna('').astype(str)
    
    # Common remote work indicators
    remote_indicators = [
        'remote', 'work from home', 'wfh', 'virtual', 'online', 'telecommute',
        'home office', 'distributed', 'anywhere', 'flexible location'
    ]

    # Check for remote work patterns
    remote_mask = city_data.str.lower().str.contains('|'.join(remote_indicators), na=False)
    remote_count = remote_mask.sum()
    
    # Also check for empty or generic city entries that might indicate remote work
    empty_city_mask = (city_data == '') | (city_data == 'nan') | (city_data.isin(['N/A', 'n/a', 'None', 'none']))
    empty_city_count = empty_city_mask.sum()
    
    # Check for specific remote work entries
    explicit_remote_mask = city_data.str.lower().str.contains('remote', na=False)
    explicit_remote_count = explicit_remote_mask.sum()
    
    print(f"📈 Remote Work Analysis Results:")
    print(f"Total respondents: {len(df_us):,}")
    print(f"Explicit 'remote' in city: {explicit_remote_count:,}")
    print(f"Any remote indicators: {remote_count:,}")
    print(f"Empty/missing city data: {empty_city_count:,}")
    
    # Create remote work classification
    df_us['work_location_type'] = 'In-Office'
    
    # Mark as remote if explicit remote indicators
    df_us.loc[remote_mask, 'work_location_type'] = 'Remote'
    
    # Mark as 'Unknown' if city data is missing/empty
    df_us.loc[empty_city_mask, 'work_location_type'] = 'Unknown'
    
    # Calculate percentages
    location_counts = df_us['work_location_type'].value_counts()
    location_percentages = df_us['work_location_type'].value_counts(normalize=True) * 100
    
    print(f"\n🏠 WORK LOCATION BREAKDOWN:")
    print("=" * 50)
    for location_type in ['Remote', 'In-Office', 'Unknown']:
        if location_type in location_counts:
            count = location_counts[location_type]
            percentage = location_percentages[location_type]
            print(f"{location_type:<10}: {count:>6,} ({percentage:>5.1f}%)")

    # Answer the question
    remote_percentage = location_percentages.get('Remote', 0)
    in_office_percentage = location_percentages.get('In-Office', 0)
    unknown_percentage = location_percentages.get('Unknown', 0)
    
    print(f"\n🎯 ANSWER TO QUESTION 4:")
    print(f"Remote work: {remote_percentage:.1f}% of respondents")
    print(f"In-office work: {in_office_percentage:.1f}% of respondents")
    print(f"Unknown/Unclear: {unknown_percentage:.1f}% of respondents")

🏠 QUESTION 4: Remote Work vs In-Office Analysis
🔍 Checking for remote work indicators in the dataset...
Potential remote work related columns: ['What industry do you work in?', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional work experience do you have in your field?', 'is_tech_worker', 'work_location_type']

📊 Analyzing city data for remote work patterns...
📈 Remote Work Analysis Results:
Total respondents: 7,089
Explicit 'remote' in city: 96
Any remote indicators: 108
Empty/missing city data: 19

🏠 WORK LOCATION BREAKDOWN:
Remote    :    108 (  1.5%)
In-Office :  6,962 ( 98.2%)
Unknown   : 

In [20]:
# Question 5: Which industry (besides tech) has the highest median salary?

print("🏭 QUESTION 5: Non-Tech Industry Median Salary Analysis")
print("=" * 65)

# Ensure we have tech worker classification
if 'is_tech_worker' not in df_us.columns:
    print("⚠️ Need to identify tech workers from raw data...")
    
    # Quick tech worker identification from raw data
    industry_col = 'What industry do you work in?'
    job_title_col = 'Job title'
    
    # Define tech industry and job patterns
    tech_industries = [
        'Computing or Tech', 'Technology', 'Software', 'IT', 'Information Technology',
        'Computer', 'Tech', 'Digital', 'Cybersecurity', 'Data', 'AI', 'Machine Learning'
    ]
    
    tech_job_patterns = [
        'engineer', 'developer', 'programmer', 'analyst', 'architect', 'consultant',
        'manager', 'director', 'lead', 'specialist', 'technician', 'designer',
        'data scientist', 'data analyst', 'product manager', 'technical', 'systems',
        'network', 'security', 'cloud', 'devops', 'platform', 'infrastructure',
        'database', 'qa', 'quality assurance', 'test', 'automation', 'scrum',
        'agile', 'product owner', 'business analyst', 'technical writer', 'support'
    ]
    
    def is_tech_worker(industry, title):
        if pd.isna(industry) and pd.isna(title):
            return False
        
        # Check industry
        if not pd.isna(industry):
            industry_lower = str(industry).lower()
            for tech_ind in tech_industries:
                if tech_ind.lower() in industry_lower:
                    return True
        
        # Check job title
        if not pd.isna(title):
            title_lower = str(title).lower()
            for pattern in tech_job_patterns:
                if pattern in title_lower:
                    return True
        
        return False
    
    df_us['is_tech_worker'] = df_us.apply(
        lambda row: is_tech_worker(row[industry_col], row[job_title_col]), 
        axis=1
    )
    print(f"✅ Identified tech workers from raw data")

# Filter for non-tech workers with valid salaries
non_tech_with_salary = df_us[
    (df_us['is_tech_worker'] == False) & 
    (df_us['salary_usd'].notna())
].copy()

print(f"\n📊 Non-Tech Industry Data:")
print(f"Total non-tech workers: {(df_us['is_tech_worker'] == False).sum():,}")
print(f"Non-tech workers with valid salaries: {len(non_tech_with_salary):,}")

if len(non_tech_with_salary) == 0:
    print("❌ No non-tech workers with valid salary data found!")
    print("Let's check what we have...")
    print(f"Total records: {len(df_us):,}")
    print(f"Tech workers: {df_us['is_tech_worker'].sum():,}")
    print(f"Non-tech workers: {(df_us['is_tech_worker'] == False).sum():,}")
    print(f"Records with salary data: {df_us['salary_usd'].notna().sum():,}")
else:
    # Calculate median salary by industry for non-tech workers
    industry_salary_stats = non_tech_with_salary.groupby('What industry do you work in?')['salary_usd'].agg([
        'count', 'median', 'mean', 'std', 'min', 'max'
    ]).round(0)

    # Filter industries with at least 10 workers for statistical significance
    industry_salary_stats = industry_salary_stats[industry_salary_stats['count'] >= 10]
    
    # Sort by median salary (descending)
    industry_salary_stats = industry_salary_stats.sort_values('median', ascending=False)
    
    print(f"\n💰 NON-TECH INDUSTRY MEDIAN SALARIES (Top 15):")
    print("=" * 90)
    print(f"{'Industry':<35} {'Count':<8} {'Median':<10} {'Mean':<10} {'Min':<10} {'Max':<10}")
    print("-" * 90)
    
    for industry, row in industry_salary_stats.head(15).iterrows():
        print(f"{industry:<35} {int(row['count']):<8} ${row['median']:,.0f}   ${row['mean']:,.0f}   ${row['min']:,.0f}   ${row['max']:,.0f}")
    
    # Show salary distribution for the top industry
    top_industry_data = non_tech_with_salary[
        non_tech_with_salary['What industry do you work in?'] == highest_median_industry
    ]
    
    print(f"\n📊 DETAILED ANALYSIS: {highest_median_industry}")
    print(f"Sample size: {len(top_industry_data):,} workers")
    print(f"Median salary: ${top_industry_data['salary_usd'].median():,.0f}")
    print(f"Mean salary: ${top_industry_data['salary_usd'].mean():,.0f}")
    print(f"25th percentile: ${top_industry_data['salary_usd'].quantile(0.25):,.0f}")
    print(f"75th percentile: ${top_industry_data['salary_usd'].quantile(0.75):,.0f}")
    print(f"Salary range: ${top_industry_data['salary_usd'].min():,.0f} - ${top_industry_data['salary_usd'].max():,.0f}")
    
    # Show some job titles from the top industry
    print(f"\n💼 Sample job titles in {highest_median_industry}:")
    job_titles = top_industry_data['Job title'].value_counts().head(10)
    for title, count in job_titles.items():
        print(f"  • {title}: {count} people")
    
    # Show industries with highest and lowest median salaries
    lowest_median_industry = industry_salary_stats.index[-1]
    lowest_median_salary = industry_salary_stats.iloc[-1]['median']
    
    print(f"• Lowest median salary: {lowest_median_industry} (${lowest_median_salary:,.0f})")
    print(f"• Salary gap between highest and lowest: ${highest_median_salary - lowest_median_salary:,.0f}")

    # Get the industry with highest median salary
    highest_median_industry = industry_salary_stats.index[0]
    highest_median_salary = industry_salary_stats.iloc[0]['median']
    industry_count = int(industry_salary_stats.iloc[0]['count'])
    
    print(f"\n🎯 ANSWER TO QUESTION 5:")
    print(f"The non-tech industry with the highest median salary is: {highest_median_industry}")
    print(f"Median salary: ${highest_median_salary:,.0f}")
    print(f"Based on {industry_count} workers in {highest_median_industry}")


🏭 QUESTION 5: Non-Tech Industry Median Salary Analysis

📊 Non-Tech Industry Data:
Total non-tech workers: 2,018
Non-tech workers with valid salaries: 2,012

💰 NON-TECH INDUSTRY MEDIAN SALARIES (Top 15):
Industry                            Count    Median     Mean       Min        Max       
------------------------------------------------------------------------------------------
Sales                               43       $90,000   $102,701   $20,060   $380,000
Law                                 290      $76,103   $116,220   $20,088   $1,200,000
Business or Consulting              68       $76,000   $115,854   $20,116   $1,300,000
Government and Public Administration 123      $73,648   $89,115   $20,060   $1,334,782
Health care                         264      $70,860   $93,261   $20,075   $1,900,000
Marketing, Advertising & PR         103      $70,000   $77,159   $20,175   $249,000
Accounting, Banking & Finance       243      $65,060   $90,643   $20,070   $510,000
Transport or Logi

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

In [21]:
# Bonus Question 6: What's the salary gap between men and women in similar roles?

print("👥 BONUS QUESTION 6: Gender Salary Gap Analysis")
print("=" * 60)

# Check if we have gender data cleaned
if 'gender_clean' not in df_us.columns:
    print("⚠️ Need to clean gender data...")
    
    # Quick gender cleaning
    gender_col = 'What is your gender?'
    gender_mapping = {
        'Man': 'Man',
        'Woman': 'Woman',
        'Non-binary': 'Non-binary',
        'Another option not listed here or prefer not to answer': 'Other/Prefer not to answer',
        'Prefer not to answer': 'Other/Prefer not to answer'
    }
    
    df_us['gender_clean'] = df_us[gender_col].replace(gender_mapping)
    print(f"✅ Cleaned gender data")

# Filter for men and women with valid salaries
gender_salary_data = df_us[
    (df_us['gender_clean'].isin(['Man', 'Woman'])) & 
    (df_us['salary_usd'].notna())
].copy()

print(f"\n📊 Gender Salary Data:")
print(f"Men with valid salaries: {len(gender_salary_data[gender_salary_data['gender_clean'] == 'Man']):,}")
print(f"Women with valid salaries: {len(gender_salary_data[gender_salary_data['gender_clean'] == 'Woman']):,}")

if len(gender_salary_data) == 0:
    print("❌ No gender salary data found!")
else:
    # Calculate overall gender salary gap
    men_salaries = gender_salary_data[gender_salary_data['gender_clean'] == 'Man']['salary_usd']
    women_salaries = gender_salary_data[gender_salary_data['gender_clean'] == 'Woman']['salary_usd']
    
    men_median = men_salaries.median()
    women_median = women_salaries.median()
    men_mean = men_salaries.mean()
    women_mean = women_salaries.mean()
    
    # Calculate gaps
    median_gap = men_median - women_median
    mean_gap = men_mean - women_mean
    median_gap_pct = (median_gap / women_median) * 100
    mean_gap_pct = (mean_gap / women_mean) * 100
    
    print(f"\n💰 OVERALL GENDER SALARY GAP:")
    print(f"Men median salary: ${men_median:,.0f}")
    print(f"Women median salary: ${women_median:,.0f}")
    print(f"Median gap: ${median_gap:,.0f} ({median_gap_pct:.1f}%)")
    print(f"Men mean salary: ${men_mean:,.0f}")
    print(f"Women mean salary: ${women_mean:,.0f}")
    print(f"Mean gap: ${mean_gap:,.0f} ({mean_gap_pct:.1f}%)")

    # Statistical test
    from scipy import stats
    t_stat, p_value = stats.ttest_ind(men_salaries, women_salaries)
    
    print(f"\n🔬 STATISTICAL TEST:")
    print(f"T-test p-value: {p_value:.2e}")
    if p_value < 0.05:
        print(f"Significant difference between men and women salaries (p < 0.05)")
    else:
        print(f"No significant difference between men and women salaries (p ≥ 0.05)")
    
    # Analysis by tech vs non-tech
    if 'is_tech_worker' in df_us.columns:
        print(f"\n💻 GENDER GAP BY SECTOR:")
        
        # Tech workers
        tech_men = gender_salary_data[
            (gender_salary_data['gender_clean'] == 'Man') & 
            (gender_salary_data['is_tech_worker'] == True)
        ]['salary_usd']
        tech_women = gender_salary_data[
            (gender_salary_data['gender_clean'] == 'Woman') & 
            (gender_salary_data['is_tech_worker'] == True)
        ]['salary_usd']
        
        if len(tech_men) > 0 and len(tech_women) > 0:
            tech_men_median = tech_men.median()
            tech_women_median = tech_women.median()
            tech_gap = tech_men_median - tech_women_median
            tech_gap_pct = (tech_gap / tech_women_median) * 100
            
            print(f"Tech - Men median: ${tech_men_median:,.0f}, Women median: ${tech_women_median:,.0f}")
            print(f"Tech gap: ${tech_gap:,.0f} ({tech_gap_pct:.1f}%)")

        # Non-tech workers
        non_tech_men = gender_salary_data[
            (gender_salary_data['gender_clean'] == 'Man') & 
            (gender_salary_data['is_tech_worker'] == False)
        ]['salary_usd']
        non_tech_women = gender_salary_data[
            (gender_salary_data['gender_clean'] == 'Woman') & 
            (gender_salary_data['is_tech_worker'] == False)
        ]['salary_usd']
        
        if len(non_tech_men) > 0 and len(non_tech_women) > 0:
            non_tech_men_median = non_tech_men.median()
            non_tech_women_median = non_tech_women.median()
            non_tech_gap = non_tech_men_median - non_tech_women_median
            non_tech_gap_pct = (non_tech_gap / non_tech_women_median) * 100
            
            print(f"Non-tech - Men median: ${non_tech_men_median:,.0f}, Women median: ${non_tech_women_median:,.0f}")
            print(f"Non-tech gap: ${non_tech_gap:,.0f} ({non_tech_gap_pct:.1f}%)")
    
    # Analysis by experience level (similar roles)
    if 'experience_years' in df_us.columns:
        print(f"\n📈 GENDER GAP BY EXPERIENCE LEVEL:")
        
        # Group by experience ranges
        gender_salary_data['exp_range'] = pd.cut(
            gender_salary_data['experience_years'], 
            bins=[0, 5, 10, 20, 50], 
            labels=['0-5 years', '5-10 years', '10-20 years', '20+ years']
        )

        for exp_range in ['0-5 years', '5-10 years', '10-20 years', '20+ years']:
            exp_data = gender_salary_data[gender_salary_data['exp_range'] == exp_range]
            if len(exp_data) > 0:
                exp_men = exp_data[exp_data['gender_clean'] == 'Man']['salary_usd']
                exp_women = exp_data[exp_data['gender_clean'] == 'Woman']['salary_usd']
                
                if len(exp_men) > 0 and len(exp_women) > 0:
                    exp_men_median = exp_men.median()
                    exp_women_median = exp_women.median()
                    exp_gap = exp_men_median - exp_women_median
                    exp_gap_pct = (exp_gap / exp_women_median) * 100
                    
                    print(f"{exp_range}: Men ${exp_men_median:,.0f}, Women ${exp_women_median:,.0f}, Gap ${exp_gap:,.0f} ({exp_gap_pct:.1f}%)")
    
    print(f"\n🎯 ANSWER TO BONUS QUESTION 6:")
    print(f"The overall salary gap between men and women is ${median_gap:,.0f} ({median_gap_pct:.1f}%)")
    print(f"Men earn a median of ${men_median:,.0f} vs women's ${women_median:,.0f}")
    if p_value < 0.05:
        print(f"This difference is statistically significant (p < 0.05)")
    else:
        print(f"This difference is not statistically significant (p ≥ 0.05)")

👥 BONUS QUESTION 6: Gender Salary Gap Analysis

📊 Gender Salary Data:
Men with valid salaries: 1,709
Women with valid salaries: 5,094

💰 OVERALL GENDER SALARY GAP:
Men median salary: $88,000
Women median salary: $72,272
Median gap: $15,728 (21.8%)
Men mean salary: $123,463
Women mean salary: $88,070
Mean gap: $35,392 (40.2%)

🔬 STATISTICAL TEST:
T-test p-value: 1.04e-31
Significant difference between men and women salaries (p < 0.05)

💻 GENDER GAP BY SECTOR:
Tech - Men median: $94,000, Women median: $77,000
Tech gap: $17,000 (22.1%)
Non-tech - Men median: $65,000, Women median: $65,000
Non-tech gap: $0 (0.0%)

📈 GENDER GAP BY EXPERIENCE LEVEL:
0-5 years: Men $69,626, Women $59,500, Gap $10,126 (17.0%)
5-10 years: Men $81,800, Women $71,000, Gap $10,800 (15.2%)
10-20 years: Men $100,216, Women $80,000, Gap $20,216 (25.3%)
20+ years: Men $90,172, Women $78,000, Gap $12,172 (15.6%)

🎯 ANSWER TO BONUS QUESTION 6:
The overall salary gap between men and women is $15,728 (21.8%)
Men earn a me

In [22]:
# Bonus Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?

print("\n🎓 BONUS QUESTION 7: Education Level Salary Analysis")
print("=" * 65)

# Check if we have education data cleaned
if 'education_clean' not in df_us.columns:
    print("⚠️ Need to clean education data...")
    
    # Quick education cleaning
    edu_col = 'What is your highest level of education completed?'
    education_mapping = {
        'High School': 'High School',
        'Some college': 'Some College',
        'College degree': 'Bachelor\'s Degree',
        'Bachelor\'s degree': 'Bachelor\'s Degree',
        'Master\'s degree': 'Master\'s Degree',
        'PhD': 'PhD',
        'Professional degree (MD, JD, etc.)': 'Professional Degree',
        'Some high school': 'High School',
        'Trade school': 'Trade School',
        'Associate degree': 'Associate Degree'
    }
    
    df_us['education_clean'] = df_us[edu_col].replace(education_mapping)
    print(f"✅ Cleaned education data")

# Filter for Bachelor's and Master's degree holders with valid salaries
education_salary_data = df_us[
    (df_us['education_clean'].isin(['Bachelor\'s Degree', 'Master\'s Degree'])) & 
    (df_us['salary_usd'].notna())
].copy()

print(f"\n📊 Education Salary Data:")
bachelors_count = len(education_salary_data[education_salary_data['education_clean'] == 'Bachelor\'s Degree'])
masters_count = len(education_salary_data[education_salary_data['education_clean'] == 'Master\'s Degree'])
print(f"Bachelor's degree holders with valid salaries: {bachelors_count:,}")
print(f"Master's degree holders with valid salaries: {masters_count:,}")

if bachelors_count == 0 or masters_count == 0:
    print("❌ Insufficient education salary data found!")
else:
    # Calculate salary statistics for each education level
    bachelors_salaries = education_salary_data[education_salary_data['education_clean'] == 'Bachelor\'s Degree']['salary_usd']
    masters_salaries = education_salary_data[education_salary_data['education_clean'] == 'Master\'s Degree']['salary_usd']
    
    bachelors_median = bachelors_salaries.median()
    masters_median = masters_salaries.median()
    bachelors_mean = bachelors_salaries.mean()
    masters_mean = masters_salaries.mean()
    
    # Calculate differences
    median_diff = masters_median - bachelors_median
    mean_diff = masters_mean - bachelors_mean
    median_diff_pct = (median_diff / bachelors_median) * 100
    mean_diff_pct = (mean_diff / bachelors_mean) * 100
    
    print(f"\n💰 EDUCATION LEVEL SALARY COMPARISON:")
    print(f"Bachelor's median salary: ${bachelors_median:,.0f}")
    print(f"Master's median salary: ${masters_median:,.0f}")
    print(f"Median difference: ${median_diff:,.0f} ({median_diff_pct:.1f}%)")
    print(f"Bachelor's mean salary: ${bachelors_mean:,.0f}")
    print(f"Master's mean salary: ${masters_mean:,.0f}")
    print(f"Mean difference: ${mean_diff:,.0f} ({mean_diff_pct:.1f}%)")
    
    # Statistical test
    from scipy import stats
    t_stat, p_value = stats.ttest_ind(masters_salaries, bachelors_salaries)
    
    print(f"\n🔬 STATISTICAL TEST:")
    print(f"T-test p-value: {p_value:.2e}")
    if p_value < 0.05:
        print(f"Significant difference between Master's and Bachelor's salaries (p < 0.05)")
    else:
        print(f"No significant difference between Master's and Bachelor's salaries (p ≥ 0.05)")
    
    # Analysis by tech vs non-tech
    if 'is_tech_worker' in df_us.columns:
        print(f"\n💻 EDUCATION IMPACT BY SECTOR:")
        
        # Tech workers
        tech_bachelors = education_salary_data[
            (education_salary_data['education_clean'] == 'Bachelor\'s Degree') & 
            (education_salary_data['is_tech_worker'] == True)
        ]['salary_usd']
        tech_masters = education_salary_data[
            (education_salary_data['education_clean'] == 'Master\'s Degree') & 
            (education_salary_data['is_tech_worker'] == True)
        ]['salary_usd']
        
        if len(tech_bachelors) > 0 and len(tech_masters) > 0:
            tech_bach_median = tech_bachelors.median()
            tech_mast_median = tech_masters.median()
            tech_diff = tech_mast_median - tech_bach_median
            tech_diff_pct = (tech_diff / tech_bach_median) * 100
            
            print(f"Tech - Bachelor's median: ${tech_bach_median:,.0f}, Master's median: ${tech_mast_median:,.0f}")
            print(f"Tech difference: ${tech_diff:,.0f} ({tech_diff_pct:.1f}%)")
        
        # Non-tech workers
        non_tech_bachelors = education_salary_data[
            (education_salary_data['education_clean'] == 'Bachelor\'s Degree') & 
            (education_salary_data['is_tech_worker'] == False)
        ]['salary_usd']
        non_tech_masters = education_salary_data[
            (education_salary_data['education_clean'] == 'Master\'s Degree') & 
            (education_salary_data['is_tech_worker'] == False)
        ]['salary_usd']
        
        if len(non_tech_bachelors) > 0 and len(non_tech_masters) > 0:
            non_tech_bach_median = non_tech_bachelors.median()
            non_tech_mast_median = non_tech_masters.median()
            non_tech_diff = non_tech_mast_median - non_tech_bach_median
            non_tech_diff_pct = (non_tech_diff / non_tech_bach_median) * 100

            print(f"Non-tech - Bachelor's median: ${non_tech_bach_median:,.0f}, Master's median: ${non_tech_mast_median:,.0f}")
            print(f"Non-tech difference: ${non_tech_diff:,.0f} ({non_tech_diff_pct:.1f}%)")
    
    # Analysis by experience level
    if 'experience_years' in df_us.columns:
        print(f"\n📈 EDUCATION IMPACT BY EXPERIENCE LEVEL:")
        
        # Group by experience ranges
        education_salary_data['exp_range'] = pd.cut(
            education_salary_data['experience_years'], 
            bins=[0, 5, 10, 20, 50], 
            labels=['0-5 years', '5-10 years', '10-20 years', '20+ years']
        )
        
        for exp_range in ['0-5 years', '5-10 years', '10-20 years', '20+ years']:
            exp_data = education_salary_data[education_salary_data['exp_range'] == exp_range]
            if len(exp_data) > 0:
                exp_bachelors = exp_data[exp_data['education_clean'] == 'Bachelor\'s Degree']['salary_usd']
                exp_masters = exp_data[exp_data['education_clean'] == 'Master\'s Degree']['salary_usd']
                
                if len(exp_bachelors) > 0 and len(exp_masters) > 0:
                    exp_bach_median = exp_bachelors.median()
                    exp_mast_median = exp_masters.median()
                    exp_diff = exp_mast_median - exp_bach_median
                    exp_diff_pct = (exp_diff / exp_bach_median) * 100
                    
                    print(f"{exp_range}: Bachelor's ${exp_bach_median:,.0f}, Master's ${exp_mast_median:,.0f}, Diff ${exp_diff:,.0f} ({exp_diff_pct:.1f}%)")
    
    # ROI Analysis (if we had cost data, we'd calculate ROI)
    print(f"\n💡 EDUCATION ROI INSIGHTS:")
    print(f"• Master's degree holders earn ${median_diff:,.0f} more per year on average")
    print(f"• This represents a {median_diff_pct:.1f}% salary premium")
    print(f"• Over a 30-year career, this could amount to ${median_diff * 30:,.0f} in additional earnings")
    
    print(f"\n🎯 ANSWER TO BONUS QUESTION 7:")
    if median_diff > 0:
        print(f"Yes, people with Master's degrees earn significantly more than those with Bachelor's degrees")
        print(f"Master's degree holders earn ${median_diff:,.0f} more per year ({median_diff_pct:.1f}% higher)")
    else:
        print(f"No, people with Master's degrees do not earn significantly more than those with Bachelor's degrees")
        print(f"Bachelor's degree holders actually earn ${abs(median_diff):,.0f} more per year")
    
    if p_value < 0.05:
        print(f"This difference is statistically significant (p < 0.05)")
    else:
        print(f"This difference is not statistically significant (p ≥ 0.05)")


🎓 BONUS QUESTION 7: Education Level Salary Analysis

📊 Education Salary Data:
Bachelor's degree holders with valid salaries: 3,472
Master's degree holders with valid salaries: 2,186

💰 EDUCATION LEVEL SALARY COMPARISON:
Bachelor's median salary: $72,000
Master's median salary: $78,000
Median difference: $6,000 (8.3%)
Bachelor's mean salary: $91,044
Master's mean salary: $98,005
Mean difference: $6,961 (7.6%)

🔬 STATISTICAL TEST:
T-test p-value: 8.03e-03
Significant difference between Master's and Bachelor's salaries (p < 0.05)

💻 EDUCATION IMPACT BY SECTOR:
Tech - Bachelor's median: $79,500, Master's median: $85,000
Tech difference: $5,500 (6.9%)
Non-tech - Bachelor's median: $59,875, Master's median: $67,000
Non-tech difference: $7,125 (11.9%)

📈 EDUCATION IMPACT BY EXPERIENCE LEVEL:
0-5 years: Bachelor's $60,070, Master's $63,850, Diff $3,780 (6.3%)
5-10 years: Bachelor's $70,100, Master's $74,000, Diff $3,900 (5.6%)
10-20 years: Bachelor's $83,000, Master's $84,000, Diff $1,000 (1.

In [24]:
# Bonus Question 8: Which company size (startup, medium, large) pays the most on average?

print("\n🏢 BONUS QUESTION 8: Company Size Salary Analysis")
print("=" * 60)

# Check if we have company size information
print("🔍 Checking for company size data in the dataset...")

# Look for columns that might contain company size information
company_size_columns = []
for col in df_us.columns:
    if any(keyword in col.lower() for keyword in ['company', 'size', 'employees', 'organization', 'firm']):
        company_size_columns.append(col)

print(f"Potential company size related columns: {company_size_columns}")

# Since we don't have explicit company size data, let's try to infer it from other fields
# We could use job titles, industry, or other indicators, but this would be speculative
# Let's check if there are any other fields that might give us company size hints

print(f"\n📊 Available columns that might help infer company size:")
relevant_columns = []
for col in df_us.columns:
    if any(keyword in col.lower() for keyword in ['title', 'role', 'level', 'senior', 'manager', 'director', 'executive']):
        relevant_columns.append(col)

print(f"Job-related columns: {relevant_columns}")

# Let's try to infer company size from job titles and levels
print(f"\n🔍 Attempting to infer company size from job titles...")

# Define job level patterns that might correlate with company size
startup_indicators = [
    'founder', 'co-founder', 'startup', 'entrepreneur', 'owner', 'ceo', 'cto', 'cfo',
    'head of', 'lead', 'senior', 'principal', 'staff', 'architect'
]

large_company_indicators = [
    'manager', 'director', 'vp', 'vice president', 'senior manager', 'senior director',
    'executive', 'chief', 'president', 'analyst', 'specialist', 'coordinator'
]

def infer_company_size_from_title(title):
    if pd.isna(title):
        return 'Unknown'
    
    title_lower = str(title).lower()
    
    # Check for startup indicators
    startup_score = sum(1 for indicator in startup_indicators if indicator in title_lower)
    large_score = sum(1 for indicator in large_company_indicators if indicator in title_lower)
    
    if startup_score > large_score:
        return 'Startup/Small'
    elif large_score > startup_score:
        return 'Large'
    else:
        return 'Medium'

# Apply company size inference
df_us['inferred_company_size'] = df_us['Job title'].apply(infer_company_size_from_title)

# Filter for valid salary data
company_size_salary_data = df_us[
    (df_us['salary_usd'].notna()) & 
    (df_us['inferred_company_size'] != 'Unknown')
].copy()

print(f"\n📊 Inferred Company Size Data:")
size_counts = company_size_salary_data['inferred_company_size'].value_counts()
for size, count in size_counts.items():
    print(f"{size}: {count:,} workers")

if len(company_size_salary_data) == 0:
    print("❌ No company size salary data found!")
    print("\n💡 Note: The dataset doesn't contain explicit company size information.")
    print("We attempted to infer company size from job titles, but this is speculative.")
    print("For accurate company size analysis, we would need explicit company size data.")
else:
    # Calculate salary statistics by inferred company size
    size_salary_stats = company_size_salary_data.groupby('inferred_company_size')['salary_usd'].agg([
        'count', 'median', 'mean', 'std', 'min', 'max'
    ]).round(0)
    
    # Sort by median salary (descending)
    size_salary_stats = size_salary_stats.sort_values('median', ascending=False)
    
    print(f"\n💰 INFERRED COMPANY SIZE SALARY ANALYSIS:")
    print("=" * 80)
    print(f"{'Company Size':<15} {'Count':<8} {'Median':<10} {'Mean':<10} {'Min':<10} {'Max':<10}")
    print("-" * 80)
    
    for size, row in size_salary_stats.iterrows():
        print(f"{size:<15} {int(row['count']):<8} ${row['median']:,.0f}   ${row['mean']:,.0f}   ${row['min']:,.0f}   ${row['max']:,.0f}")
    
    # Get the company size with highest median salary
    highest_paying_size = size_salary_stats.index[0]
    highest_median_salary = size_salary_stats.iloc[0]['median']
    size_count = int(size_salary_stats.iloc[0]['count'])

    # Statistical test
    from scipy import stats
    
    # Get salary data for each company size
    size_groups = []
    size_labels = []
    for size in size_salary_stats.index:
        size_data = company_size_salary_data[company_size_salary_data['inferred_company_size'] == size]['salary_usd']
        if len(size_data) > 0:
            size_groups.append(size_data.values)
            size_labels.append(size)
    
    # Perform ANOVA test if we have multiple groups
    if len(size_groups) > 1:
        f_stat, p_value = stats.f_oneway(*size_groups)
        print(f"\n🔬 STATISTICAL TEST (ANOVA):")
        print(f"F-statistic: {f_stat:.2f}")
        print(f"P-value: {p_value:.2e}")
        if p_value < 0.05:
            print(f"Significant difference between company sizes (p < 0.05)")
        else:
            print(f"No significant difference between company sizes (p ≥ 0.05)")
    
    # Show sample job titles for each company size
    print(f"\n💼 Sample Job Titles by Inferred Company Size:")
    for size in size_salary_stats.index:
        size_data = company_size_salary_data[company_size_salary_data['inferred_company_size'] == size]
        sample_titles = size_data['Job title'].value_counts().head(5)
        print(f"\n{size}:")
        for title, count in sample_titles.items():
            print(f"  • {title}: {count} people")

    print(f"\n🎯 ANSWER TO BONUS QUESTION 8:")
    print(f"Based on inferred company size from job titles: {highest_paying_size} companies pay the most on average")
    print(f"Median salary: ${highest_median_salary:,.0f}")
    print(f"Based on {size_count} workers in {highest_paying_size} companies")


🏢 BONUS QUESTION 8: Company Size Salary Analysis
🔍 Checking for company size data in the dataset...
Potential company size related columns: ['inferred_company_size']

📊 Available columns that might help infer company size:
Job-related columns: ['Job title', 'If your job title needs additional context, please clarify here:', 'What is your highest level of education completed?', 'job_title_clean']

🔍 Attempting to infer company size from job titles...

📊 Inferred Company Size Data:
Medium: 3,796 workers
Large: 2,421 workers
Startup/Small: 859 workers

💰 INFERRED COMPANY SIZE SALARY ANALYSIS:
Company Size    Count    Median     Mean       Min        Max       
--------------------------------------------------------------------------------
Startup/Small   859      $96,670   $123,804   $20,093   $3,000,000
Large           2421     $73,000   $92,640   $20,015   $1,600,000
Medium          3796     $72,000   $93,013   $20,000   $1,950,000

🔬 STATISTICAL TEST (ANOVA):
F-statistic: 31.08
P-val

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $125,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $682 per year
4. **Remote vs office percentage:** 1.5% remote, 98.2% office
5. **Highest paying non-tech industry:** Sales

**Bonus questions**
6. **Salary gap between men and women in similar roles**: $15,728
7. **Master's degrees earn more than Bachelor's degrees**: 8.3% higher
8. **Company size (startup, medium, large) pays the most on average**: Startup/Small

**Key insights:**
- Insight 1: The average salary increase of $682 per year of experience is surprisingly low and linear, suggesting that large salary jumps are less tied to tenure and more to changing companies or obtaining promotions/new roles (which aren't fully captured by the simple regression model). The true value of experience is likely unlocked through mobility rather than long-term loyalty.

- Insight 2: The finding that California is the highest-paying US state for tech, followed by Washington then New York, suggests that high salaries are determined by location and initial cost-of-living adjustments. This highlights that access to the highest baseline compensation requires being geographically anchored in a major tech hub.

- Insight 3: While individuals with Master's degrees earn 8.3% higher median salaries than those with Bachelor's degrees, this difference might not fully justify the cost and time of a graduate degree for some people. The marginal increase suggests that practical experience and job-specific skills often outweigh the pure credential, especially in fast-moving fields like tech, where the salary premium for a Master's degree is less pronounced than in highly regulated fields like medicine or law.

- Side Note: The significant gender pay gap requires deeper investigation

**Challenges faced:**
- Challenge 1 and how you solved it:

    Trouble with Cursor: Goal - Connect the local Cursor code editor to a remote GitHub Codespace environment.

    Challenge: The Codespace connection method that relies on the proprietary GitHub Codespaces extension is blocked, and the SSH workaround is failing because the remote environment is not set up correctly.

    Solution: Don't connect cursor to codespace, Hussam recommended me to clone the repository locally. 
    Thank you Hussam! - I learned that it's because codespaces runs on a server and can be accessed on localhost so it's not stable.

    Side Note: Cursor is having trouble validating my ID, so I contacted support and I'm waiting on them to get back to me; for now I'm on the free version.

- Challenge 2 and how you solved it: 

    File Loading and Pathing:

    Challenge: The initial hurdle was a persistent FileNotFoundError due to the relative path structure (../../Week-02/data/...) in the execution environment.

    Solution: I told Gemini the file I'm currently in which is in the homeworks folder in week 5 as well as the location of the dataset file in the data folder in week 2.

**What you learned about vibe coding:**
- Learning 1: It's very easy to get lazy and get lost, so it's important to debug and make sure the code works and outputs what you intend it to. Cursor is super helpful and I wish I used it sooner. 

- Learning 2: I need to work on my prompt engineering. Specifically - output format, brevity, and inclusion/exclusion of specific details.

- Learning 3: Cursor gave me very detailed information and extra information that I didn't ask it to. Basically need to go back and do a TL;DR check and make sure it's actually answering the questions despite it overloading you with data. 