# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*
Data Loading & Exploration Phase
[ ] Load and explore the TSV dataset - examine structure, columns, data types, missing values, and basic statistics
[ ] Identify key data quality issues - inconsistent salary formats, multiple currencies, messy job titles, location inconsistencies, experience format variations
Data Cleaning Phase
[ ] Clean salary data - handle different currencies, remove non-numeric characters, convert to USD using appropriate exchange rates, handle missing values
[ ] Standardize job titles - create categories for tech roles, identify software engineers specifically, handle variations in job title formatting
[ ] Clean location data - standardize country names (US, USA, United States), validate and standardize state names, handle missing location data
[ ] Clean experience data - convert range formats to numeric values, handle missing experience data, create consistent experience categories
[ ] Clean demographic data - standardize gender categories, education levels, and race categories for analysis
[ ] Create analysis columns - derived columns for analysis (tech worker flag, remote work indicator, company size categories, industry groupings)
Data Validation Phase
[ ] Validate cleaned data - check for outliers, verify salary ranges are reasonable, ensure data consistency across key fields
Business Questions Analysis Phase
[ ] Answer Question 1 - Calculate median salary for Software Engineers in the United States
[ ] Answer Question 2 - Determine which US state has the highest average salary for tech workers
[ ] Answer Question 3 - Calculate average salary increase per year of experience in tech
[ ] Answer Question 4 - Identify which industry (besides tech) has the highest median salary
[ ] Answer bonus questions - gender salary gap in tech, education impact on salary, company size salary analysis
Final Documentation Phase
[ ] Create final summary - key findings, insights, challenges faced, and lessons learned about data cleaning
Key Challenges Expected:
Multiple currencies - Need to convert GBP, EUR, CAD, etc. to USD
Inconsistent salary formats - Comma separators, ranges, hourly vs annual
Messy job titles - "Software Engineer" vs "SWE" vs "Developer" variations
Location standardization - "US" vs "USA" vs "United States"
Experience ranges - "5-7 years" needs to be converted to numeric values
Missing data handling - Strategic decisions on imputation vs exclusion
This plan follows a logical progression from exploration → cleaning → validation → analysis → documentation, which is the standard approach for real-world data science projects.


In [None]:
# Explore data types and missing values
print("Data types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics for numeric columns:")
df.describe(include='all')


In [None]:
# Examine key columns for data quality issues
print("Unique values in key columns:")
print("\nCurrency:")
print(df['Please indicate the currency'].value_counts())
print("\nCountry:")
print(df['What country do you work in?'].value_counts().head(10))
print("\nJob titles (first 20):")
print(df['Job title'].value_counts().head(20))
print("\nSalary examples:")
print(df['What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'].head(10))


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.

# Load the dataset
df = pd.read_csv('../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep='\t')

print("Dataset shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nFirst few rows:")
df.head()


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Matplotlib is building the font cache; this may take a moment.


In [None]:
# Create a copy of the original dataframe for cleaning
df_clean = df.copy()

# Define currency exchange rates (approximate 2021 rates)
exchange_rates = {
    'USD': 1.0,
    'GBP': 1.38,  # GBP to USD
    'EUR': 1.18,  # EUR to USD
    'CAD': 0.81,  # CAD to USD
    'AUD': 0.74,  # AUD to USD
    'CHF': 1.09,  # CHF to USD
    'SEK': 0.12,  # SEK to USD
    'NOK': 0.12,  # NOK to USD
    'DKK': 0.16,  # DKK to USD
    'PLN': 0.27,  # PLN to USD
    'CZK': 0.047, # CZK to USD
    'HUF': 0.0033, # HUF to USD
    'BGN': 0.60,  # BGN to USD
    'RON': 0.25,  # RON to USD
    'HRK': 0.16,  # HRK to USD
    'JPY': 0.0091, # JPY to USD
    'SGD': 0.74,  # SGD to USD
    'HKD': 0.13,  # HKD to USD
    'NZD': 0.71,  # NZD to USD
    'ILS': 0.31,  # ILS to USD
    'ZAR': 0.067, # ZAR to USD
    'INR': 0.014, # INR to USD
    'BRL': 0.19,  # BRL to USD
    'MXN': 0.050, # MXN to USD
    'CLP': 0.0014, # CLP to USD
    'COP': 0.00027, # COP to USD
    'PEN': 0.27,  # PEN to USD
    'ARS': 0.011, # ARS to USD
    'UYU': 0.023, # UYU to USD
    'UAH': 0.037, # UAH to USD
    'RUB': 0.014, # RUB to USD
    'TRY': 0.12,  # TRY to USD
    'AED': 0.27,  # AED to USD
    'SAR': 0.27,  # SAR to USD
    'QAR': 0.27,  # QAR to USD
    'KWD': 3.31,  # KWD to USD
    'BHD': 2.65,  # BHD to USD
    'OMR': 2.60,  # OMR to USD
    'JOD': 1.41,  # JOD to USD
    'LBP': 0.00066, # LBP to USD
    'EGP': 0.064, # EGP to USD
    'MAD': 0.11,  # MAD to USD
    'TND': 0.36,  # TND to USD
    'DZD': 0.0074, # DZD to USD
    'NGN': 0.0024, # NGN to USD
    'GHS': 0.17,  # GHS to USD
    'KES': 0.0091, # KES to USD
    'UGX': 0.00028, # UGX to USD
    'TZS': 0.00043, # TZS to USD
    'ZMW': 0.047, # ZMW to USD
    'BWP': 0.093, # BWP to USD
    'SZL': 0.067, # SZL to USD
    'LSL': 0.067, # LSL to USD
    'NAD': 0.067, # NAD to USD
    'AOA': 0.0017, # AOA to USD
    'MZN': 0.016, # MZN to USD
    'MWK': 0.0013, # MWK to USD
    'ZWL': 0.0028, # ZWL to USD
    'ETB': 0.024, # ETB to USD
    'SDG': 0.022, # SDG to USD
    'SSP': 0.022, # SSP to USD
    'DJF': 0.0056, # DJF to USD
    'ERN': 0.067, # ERN to USD
    'SOS': 0.0018, # SOS to USD
    'TJS': 0.088, # TJS to USD
    'UZS': 0.000094, # UZS to USD
    'KGS': 0.012, # KGS to USD
    'KZT': 0.0023, # KZT to USD
    'TMT': 0.29,  # TMT to USD
    'AFN': 0.013, # AFN to USD
    'PKR': 0.0067, # PKR to USD
    'LKR': 0.0050, # LKR to USD
    'NPR': 0.0085, # NPR to USD
    'BTN': 0.013, # BTN to USD
    'BDT': 0.012, # BDT to USD
    'MMK': 0.00059, # MMK to USD
    'THB': 0.032, # THB to USD
    'LAK': 0.00010, # LAK to USD
    'KHR': 0.00025, # KHR to USD
    'VND': 0.000044, # VND to USD
    'IDR': 0.000070, # IDR to USD
    'MYR': 0.24,  # MYR to USD
    'PHP': 0.020, # PHP to USD
    'KRW': 0.00089, # KRW to USD
    'TWD': 0.036, # TWD to USD
    'CNY': 0.15,  # CNY to USD
    'MOP': 0.12,  # MOP to USD
    'BND': 0.74,  # BND to USD
    'FJD': 0.47,  # FJD to USD
    'PGK': 0.29,  # PGK to USD
    'SBD': 0.12,  # SBD to USD
    'VUV': 0.0089, # VUV to USD
    'WST': 0.40,  # WST to USD
    'TOP': 0.45,  # TOP to USD
    'XPF': 0.0093, # XPF to USD
    'XOF': 0.0018, # XOF to USD
    'XAF': 0.0018, # XAF to USD
    'KMF': 0.0023, # KMF to USD
    'MGA': 0.00026, # MGA to USD
    'MUR': 0.024, # MUR to USD
    'SCR': 0.074, # SCR to USD
    'MVR': 0.065, # MVR to USD
    'BIF': 0.00052, # BIF to USD
    'RWF': 0.0010, # RWF to USD
    'CDF': 0.00052, # CDF to USD
    'GMD': 0.020, # GMD to USD
    'GNF': 0.00010, # GNF to USD
    'LRD': 0.0055, # LRD to USD
    'SLL': 0.00010, # SLL to USD
    'XDR': 1.43,  # XDR to USD (Special Drawing Rights)
}

print("Exchange rates defined for", len(exchange_rates), "currencies")


In [None]:
# Function to clean and convert salary data
def clean_salary(salary_str, currency_str):
    """
    Clean salary data and convert to USD
    """
    if pd.isna(salary_str) or salary_str == '':
        return np.nan
    
    # Convert to string and clean
    salary_str = str(salary_str).strip()
    
    # Handle ranges (take midpoint)
    if '-' in salary_str and not salary_str.startswith('-'):
        try:
            parts = salary_str.split('-')
            if len(parts) == 2:
                low = float(re.sub(r'[^\d.]', '', parts[0]))
                high = float(re.sub(r'[^\d.]', '', parts[1]))
                salary_usd = (low + high) / 2
            else:
                return np.nan
        except:
            return np.nan
    else:
        # Remove non-numeric characters except decimal points
        cleaned = re.sub(r'[^\d.]', '', salary_str)
        try:
            salary_usd = float(cleaned)
        except:
            return np.nan
    
    # Convert to USD if not already
    if pd.isna(currency_str) or currency_str == '':
        return salary_usd
    
    currency = str(currency_str).strip().upper()
    
    # Handle "Other" currency
    if currency == 'OTHER':
        return np.nan  # Skip other currencies for now
    
    # Apply exchange rate
    if currency in exchange_rates:
        return salary_usd * exchange_rates[currency]
    else:
        return np.nan

# Apply salary cleaning
import re
df_clean['salary_usd'] = df_clean.apply(
    lambda row: clean_salary(
        row['What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'],
        row['Please indicate the currency']
    ), axis=1
)

print(f"Cleaned salary data: {df_clean['salary_usd'].notna().sum()} valid salaries out of {len(df_clean)} total records")
print(f"Salary range: ${df_clean['salary_usd'].min():,.0f} - ${df_clean['salary_usd'].max():,.0f}")
print(f"Median salary: ${df_clean['salary_usd'].median():,.0f}")


In [None]:
# Clean location data
def clean_country(country_str):
    """Standardize country names"""
    if pd.isna(country_str):
        return np.nan
    
    country = str(country_str).strip().lower()
    
    # US variations
    if country in ['united states', 'us', 'usa', 'america']:
        return 'United States'
    
    return country.title()

def clean_state(state_str):
    """Standardize state names"""
    if pd.isna(state_str):
        return np.nan
    
    return str(state_str).strip().title()

# Apply location cleaning
df_clean['country_clean'] = df_clean['What country do you work in?'].apply(clean_country)
df_clean['state_clean'] = df_clean['If you\'re in the U.S., what state do you work in?'].apply(clean_state)

print("Country distribution:")
print(df_clean['country_clean'].value_counts().head(10))


In [None]:
# Clean job titles and identify tech roles
def is_tech_role(job_title):
    """Identify if a job title is in tech"""
    if pd.isna(job_title):
        return False
    
    job_lower = str(job_title).lower()
    
    tech_keywords = [
        'software', 'engineer', 'developer', 'programmer', 'coder',
        'data scientist', 'data analyst', 'machine learning', 'ml engineer',
        'devops', 'sre', 'site reliability', 'cloud', 'aws', 'azure',
        'frontend', 'backend', 'full stack', 'mobile', 'ios', 'android',
        'product manager', 'technical', 'architect', 'solutions',
        'cybersecurity', 'security', 'network', 'system admin', 'it',
        'qa', 'quality assurance', 'test', 'automation', 'scrum master',
        'agile', 'product owner', 'tech lead', 'engineering manager',
        'cto', 'cio', 'vp engineering', 'director of engineering'
    ]
    
    return any(keyword in job_lower for keyword in tech_keywords)

def is_software_engineer(job_title):
    """Identify software engineers specifically"""
    if pd.isna(job_title):
        return False
    
    job_lower = str(job_title).lower()
    
    swe_keywords = [
        'software engineer', 'software developer', 'swe', 'software dev',
        'backend engineer', 'frontend engineer', 'full stack engineer',
        'senior software engineer', 'staff software engineer',
        'principal software engineer', 'lead software engineer'
    ]
    
    return any(keyword in job_lower for keyword in swe_keywords)

# Apply job title cleaning
df_clean['is_tech'] = df_clean['Job title'].apply(is_tech_role)
df_clean['is_software_engineer'] = df_clean['Job title'].apply(is_software_engineer)

print(f"Tech workers: {df_clean['is_tech'].sum()}")
print(f"Software Engineers: {df_clean['is_software_engineer'].sum()}")
print(f"Software Engineers in US: {df_clean[(df_clean['is_software_engineer']) & (df_clean['country_clean'] == 'United States')].shape[0]}")

# Clean experience data
def clean_experience(exp_str):
    """Convert experience ranges to numeric values"""
    if pd.isna(exp_str):
        return np.nan
    
    exp_str = str(exp_str).strip().lower()
    
    # Handle ranges
    if '-' in exp_str:
        try:
            parts = exp_str.split('-')
            if len(parts) == 2:
                low = int(re.sub(r'[^\d]', '', parts[0]))
                high = int(re.sub(r'[^\d]', '', parts[1]))
                return (low + high) / 2
        except:
            pass
    
    # Handle specific years
    try:
        return float(re.sub(r'[^\d]', '', exp_str))
    except:
        return np.nan

# Apply experience cleaning
df_clean['experience_years'] = df_clean['How many years of professional work experience do you have overall?'].apply(clean_experience)
df_clean['experience_field_years'] = df_clean['How many years of professional work experience do you have in your field?'].apply(clean_experience)

print(f"Experience data cleaned: {df_clean['experience_years'].notna().sum()} valid records")
print(f"Average experience: {df_clean['experience_years'].mean():.1f} years")
print(f"Experience range: {df_clean['experience_years'].min():.1f} - {df_clean['experience_years'].max():.1f} years")


In [None]:
# Clean demographic data
def clean_gender(gender_str):
    """Standardize gender categories"""
    if pd.isna(gender_str):
        return np.nan
    
    gender = str(gender_str).strip().lower()
    
    if gender in ['man', 'male', 'm']:
        return 'Male'
    elif gender in ['woman', 'female', 'f']:
        return 'Female'
    elif gender in ['non-binary', 'nonbinary', 'nb']:
        return 'Non-binary'
    else:
        return 'Other'

def clean_education(edu_str):
    """Standardize education levels"""
    if pd.isna(edu_str):
        return np.nan
    
    edu = str(edu_str).strip().lower()
    
    if 'phd' in edu or 'doctorate' in edu:
        return 'PhD'
    elif 'master' in edu:
        return 'Master\'s'
    elif 'bachelor' in edu or 'college' in edu:
        return 'Bachelor\'s'
    elif 'associate' in edu:
        return 'Associate\'s'
    elif 'high school' in edu:
        return 'High School'
    else:
        return 'Other'

# Apply demographic cleaning
df_clean['gender_clean'] = df_clean['What is your gender?'].apply(clean_gender)
df_clean['education_clean'] = df_clean['What is your highest level of education completed?'].apply(clean_education)

# Create final cleaned dataframe with key columns
df_final = df_clean[[
    'salary_usd',
    'country_clean', 
    'state_clean',
    'is_tech',
    'is_software_engineer',
    'experience_years',
    'experience_field_years',
    'gender_clean',
    'education_clean',
    'What industry do you work in?',
    'Job title',
    'What is your gender?',
    'What is your highest level of education completed?'
]].copy()

# Add industry cleaning
df_final['industry_clean'] = df_final['What industry do you work in?'].str.strip()

print("Final cleaned dataframe created!")
print(f"Shape: {df_final.shape}")
print(f"Columns: {df_final.columns.tolist()}")
print("\nData quality summary:")
print(f"- Valid salaries: {df_final['salary_usd'].notna().sum()}")
print(f"- US records: {df_final[df_final['country_clean'] == 'United States'].shape[0]}")
print(f"- Tech workers: {df_final['is_tech'].sum()}")
print(f"- Software Engineers: {df_final['is_software_engineer'].sum()}")
print(f"- Valid experience: {df_final['experience_years'].notna().sum()}")

# Data Validation - Check for outliers and data quality
print("\n=== DATA VALIDATION ===")
print(f"Original dataset size: {len(df)}")
print(f"Cleaned dataset size: {len(df_final)}")
print(f"Records with valid salary: {df_final['salary_usd'].notna().sum()}")

# Check salary outliers
salary_stats = df_final['salary_usd'].describe()
print(f"\nSalary Statistics:")
print(f"Min: ${salary_stats['min']:,.0f}")
print(f"25th percentile: ${salary_stats['25%']:,.0f}")
print(f"Median: ${salary_stats['50%']:,.0f}")
print(f"75th percentile: ${salary_stats['75%']:,.0f}")
print(f"Max: ${salary_stats['max']:,.0f}")

# Check for extreme outliers (likely data entry errors)
q99 = df_final['salary_usd'].quantile(0.99)
q01 = df_final['salary_usd'].quantile(0.01)
print(f"\nOutlier Analysis:")
print(f"99th percentile: ${q99:,.0f}")
print(f"1st percentile: ${q01:,.0f}")
print(f"Records above 99th percentile: {(df_final['salary_usd'] > q99).sum()}")
print(f"Records below 1st percentile: {(df_final['salary_usd'] < q01).sum()}")

# Check US data specifically
us_data = df_final[df_final['country_clean'] == 'United States']
print(f"\nUS Data Quality:")
print(f"US records with salary: {us_data['salary_usd'].notna().sum()}")
print(f"US Software Engineers: {us_data['is_software_engineer'].sum()}")
print(f"US Tech Workers: {us_data['is_tech'].sum()}")

# Check experience data quality
print(f"\nExperience Data Quality:")
print(f"Records with valid experience: {df_final['experience_years'].notna().sum()}")
print(f"Average experience: {df_final['experience_years'].mean():.1f} years")
print(f"Max experience: {df_final['experience_years'].max():.1f} years")

print("\n✅ Data cleaning completed! Ready for analysis.")


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?

# Filter for Software Engineers in the US with valid salary data
us_swe = df_final[
    (df_final['is_software_engineer'] == True) & 
    (df_final['country_clean'] == 'United States') & 
    (df_final['salary_usd'].notna())
]

print(f"Number of US Software Engineers with salary data: {len(us_swe)}")

if len(us_swe) > 0:
    median_salary = us_swe['salary_usd'].median()
    mean_salary = us_swe['salary_usd'].mean()
    
    print(f"\n📊 RESULTS:")
    print(f"Median salary for Software Engineers in the United States: ${median_salary:,.0f}")
    print(f"Mean salary for Software Engineers in the United States: ${mean_salary:,.0f}")
    
    print(f"\n📈 Additional Statistics:")
    print(f"25th percentile: ${us_swe['salary_usd'].quantile(0.25):,.0f}")
    print(f"75th percentile: ${us_swe['salary_usd'].quantile(0.75):,.0f}")
    print(f"Salary range: ${us_swe['salary_usd'].min():,.0f} - ${us_swe['salary_usd'].max():,.0f}")
    
    # Show distribution by experience
    print(f"\n🔍 By Experience Level:")
    exp_groups = us_swe.groupby(pd.cut(us_swe['experience_years'], bins=[0, 2, 5, 10, 20, 50], labels=['0-2 years', '3-5 years', '6-10 years', '11-20 years', '20+ years']))['salary_usd'].median()
    for exp_level, salary in exp_groups.items():
        if pd.notna(salary):
            print(f"{exp_level}: ${salary:,.0f}")
else:
    print("❌ No valid data found for US Software Engineers")


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?

# Filter for tech workers in the US with valid salary and state data
us_tech = df_final[
    (df_final['is_tech'] == True) & 
    (df_final['country_clean'] == 'United States') & 
    (df_final['salary_usd'].notna()) &
    (df_final['state_clean'].notna())
]

print(f"Number of US tech workers with salary and state data: {len(us_tech)}")

if len(us_tech) > 0:
    # Calculate average salary by state
    state_salaries = us_tech.groupby('state_clean')['salary_usd'].agg(['mean', 'median', 'count']).round(0)
    state_salaries = state_salaries[state_salaries['count'] >= 5]  # Only states with at least 5 tech workers
    state_salaries = state_salaries.sort_values('mean', ascending=False)
    
    print(f"\n📊 RESULTS:")
    print(f"State with highest average salary for tech workers: {state_salaries.index[0]}")
    print(f"Average salary: ${state_salaries.iloc[0]['mean']:,.0f}")
    
    print(f"\n🏆 Top 10 States by Average Tech Salary:")
    print("State | Average Salary | Median Salary | Count")
    print("-" * 50)
    for i, (state, row) in enumerate(state_salaries.head(10).iterrows()):
        print(f"{state} | ${row['mean']:,.0f} | ${row['median']:,.0f} | {int(row['count'])}")
    
    # Show distribution
    print(f"\n📈 Salary Statistics for {state_salaries.index[0]}:")
    top_state_data = us_tech[us_tech['state_clean'] == state_salaries.index[0]]
    print(f"Count: {len(top_state_data)}")
    print(f"Average: ${top_state_data['salary_usd'].mean():,.0f}")
    print(f"Median: ${top_state_data['salary_usd'].median():,.0f}")
    print(f"Range: ${top_state_data['salary_usd'].min():,.0f} - ${top_state_data['salary_usd'].max():,.0f}")
    
else:
    print("❌ No valid data found for US tech workers")


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?

# Filter for tech workers with valid salary and experience data
tech_experience = df_final[
    (df_final['is_tech'] == True) & 
    (df_final['salary_usd'].notna()) &
    (df_final['experience_years'].notna())
]

print(f"Number of tech workers with salary and experience data: {len(tech_experience)}")

if len(tech_experience) > 0:
    # Calculate correlation between experience and salary
    correlation = tech_experience['experience_years'].corr(tech_experience['salary_usd'])
    
    # Calculate average salary by experience level
    exp_salary = tech_experience.groupby(pd.cut(tech_experience['experience_years'], 
                                               bins=[0, 1, 2, 3, 5, 7, 10, 15, 20, 30, 50], 
                                               labels=['0-1 years', '1-2 years', '2-3 years', '3-5 years', 
                                                      '5-7 years', '7-10 years', '10-15 years', '15-20 years', 
                                                      '20-30 years', '30+ years']))['salary_usd'].agg(['mean', 'median', 'count']).round(0)
    
    # Remove empty groups
    exp_salary = exp_salary[exp_salary['count'] > 0]
    
    print(f"\n📊 RESULTS:")
    print(f"Correlation between experience and salary: {correlation:.3f}")
    
    # Calculate salary increase per year using linear regression
    from scipy import stats
    slope, intercept, r_value, p_value, std_err = stats.linregress(tech_experience['experience_years'], tech_experience['salary_usd'])
    
    print(f"Average salary increase per year of experience: ${slope:,.0f}")
    print(f"R-squared: {r_value**2:.3f}")
    
    print(f"\n📈 Salary by Experience Level:")
    print("Experience Level | Average Salary | Median Salary | Count")
    print("-" * 60)
    for exp_level, row in exp_salary.iterrows():
        print(f"{exp_level} | ${row['mean']:,.0f} | ${row['median']:,.0f} | {int(row['count'])}")
    
    # Show the relationship visually
    print(f"\n🔍 Key Insights:")
    print(f"- Starting salary (0-1 years): ${exp_salary.iloc[0]['median']:,.0f}")
    print(f"- Mid-career salary (5-7 years): ${exp_salary.iloc[4]['median']:,.0f}")
    print(f"- Senior salary (10-15 years): ${exp_salary.iloc[6]['median']:,.0f}")
    
    # Calculate actual increase between levels
    if len(exp_salary) > 1:
        early_career = exp_salary.iloc[0]['median']
        mid_career = exp_salary.iloc[4]['median'] if len(exp_salary) > 4 else exp_salary.iloc[-1]['median']
        years_diff = 5  # Approximate years between levels
        actual_increase = (mid_career - early_career) / years_diff
        print(f"- Actual increase per year (early to mid-career): ${actual_increase:,.0f}")
    
else:
    print("❌ No valid data found for tech workers with experience")


In [None]:
# Question 4: Which industry (besides tech) has the highest median salary?

# Filter for non-tech workers with valid salary data
non_tech = df_final[
    (df_final['is_tech'] == False) & 
    (df_final['salary_usd'].notna()) &
    (df_final['industry_clean'].notna())
]

print(f"Number of non-tech workers with salary data: {len(non_tech)}")

if len(non_tech) > 0:
    # Calculate median salary by industry
    industry_salaries = non_tech.groupby('industry_clean')['salary_usd'].agg(['median', 'mean', 'count']).round(0)
    industry_salaries = industry_salaries[industry_salaries['count'] >= 10]  # Only industries with at least 10 workers
    industry_salaries = industry_salaries.sort_values('median', ascending=False)
    
    print(f"\n📊 RESULTS:")
    print(f"Industry with highest median salary (non-tech): {industry_salaries.index[0]}")
    print(f"Median salary: ${industry_salaries.iloc[0]['median']:,.0f}")
    print(f"Average salary: ${industry_salaries.iloc[0]['mean']:,.0f}")
    print(f"Sample size: {int(industry_salaries.iloc[0]['count'])}")
    
    print(f"\n🏆 Top 10 Industries by Median Salary (Non-Tech):")
    print("Industry | Median Salary | Average Salary | Count")
    print("-" * 60)
    for i, (industry, row) in enumerate(industry_salaries.head(10).iterrows()):
        print(f"{industry} | ${row['median']:,.0f} | ${row['mean']:,.0f} | {int(row['count'])}")
    
    # Compare with tech median
    tech_median = df_final[df_final['is_tech'] == True]['salary_usd'].median()
    print(f"\n📈 Comparison:")
    print(f"Tech industry median salary: ${tech_median:,.0f}")
    print(f"Highest non-tech industry median: ${industry_salaries.iloc[0]['median']:,.0f}")
    print(f"Difference: ${industry_salaries.iloc[0]['median'] - tech_median:,.0f}")
    
    # Show some insights about the top industry
    top_industry = industry_salaries.index[0]
    top_industry_data = non_tech[non_tech['industry_clean'] == top_industry]
    print(f"\n🔍 Insights about {top_industry}:")
    print(f"Salary range: ${top_industry_data['salary_usd'].min():,.0f} - ${top_industry_data['salary_usd'].max():,.0f}")
    print(f"25th percentile: ${top_industry_data['salary_usd'].quantile(0.25):,.0f}")
    print(f"75th percentile: ${top_industry_data['salary_usd'].quantile(0.75):,.0f}")
    
else:
    print("❌ No valid data found for non-tech workers")


In [None]:
# Question 5: What's the salary gap between men and women in tech roles?

# Filter for tech workers with valid salary and gender data
tech_gender = df_final[
    (df_final['is_tech'] == True) & 
    (df_final['salary_usd'].notna()) &
    (df_final['gender_clean'].notna()) &
    (df_final['gender_clean'].isin(['Male', 'Female']))
]

print(f"Number of tech workers with salary and gender data: {len(tech_gender)}")

if len(tech_gender) > 0:
    # Calculate salary statistics by gender
    gender_salaries = tech_gender.groupby('gender_clean')['salary_usd'].agg(['mean', 'median', 'count']).round(0)
    
    male_median = gender_salaries.loc['Male', 'median']
    female_median = gender_salaries.loc['Female', 'median']
    male_mean = gender_salaries.loc['Male', 'mean']
    female_mean = gender_salaries.loc['Female', 'mean']
    
    gap_median = male_median - female_median
    gap_mean = male_mean - female_mean
    gap_percent = (gap_median / female_median) * 100
    
    print(f"\n📊 RESULTS:")
    print(f"Salary gap between men and women in tech (median): ${gap_median:,.0f}")
    print(f"Salary gap percentage: {gap_percent:.1f}%")
    
    print(f"\n📈 Detailed Breakdown:")
    print("Gender | Median Salary | Average Salary | Count")
    print("-" * 50)
    for gender in ['Male', 'Female']:
        if gender in gender_salaries.index:
            row = gender_salaries.loc[gender]
            print(f"{gender} | ${row['median']:,.0f} | ${row['mean']:,.0f} | {int(row['count'])}")
    
    print(f"\n🔍 Additional Analysis:")
    print(f"Male median salary: ${male_median:,.0f}")
    print(f"Female median salary: ${female_median:,.0f}")
    print(f"Gap: ${gap_median:,.0f} ({gap_percent:.1f}%)")
    
    # Analyze by experience level
    print(f"\n📊 Gap by Experience Level:")
    for exp_bin in [0, 5, 10, 20]:
        exp_data = tech_gender[
            (tech_gender['experience_years'] >= exp_bin) & 
            (tech_gender['experience_years'] < exp_bin + 5)
        ]
        if len(exp_data) > 10:
            exp_gender_sal = exp_data.groupby('gender_clean')['salary_usd'].median()
            if 'Male' in exp_gender_sal.index and 'Female' in exp_gender_sal.index:
                exp_gap = exp_gender_sal['Male'] - exp_gender_sal['Female']
                exp_gap_pct = (exp_gap / exp_gender_sal['Female']) * 100
                print(f"{exp_bin}-{exp_bin+4} years: ${exp_gap:,.0f} ({exp_gap_pct:.1f}%)")
    
    # Show sample sizes
    print(f"\n📋 Sample Sizes:")
    print(f"Male tech workers: {gender_salaries.loc['Male', 'count']}")
    print(f"Female tech workers: {gender_salaries.loc['Female', 'count']}")
    
else:
    print("❌ No valid data found for tech workers with gender information")


## Final Summary

**Summarize your findings here:**


In [None]:
# Final Summary of Analysis Results

print("🎯 BUSINESS QUESTIONS ANALYSIS SUMMARY")
print("=" * 50)

# Question 1 Results
us_swe = df_final[
    (df_final['is_software_engineer'] == True) & 
    (df_final['country_clean'] == 'United States') & 
    (df_final['salary_usd'].notna())
]
q1_median = us_swe['salary_usd'].median() if len(us_swe) > 0 else "No data"

print(f"\n1. Median salary for Software Engineers in the United States: ${q1_median:,.0f}" if isinstance(q1_median, (int, float)) else f"1. Median salary for Software Engineers in the United States: {q1_median}")

# Question 2 Results
us_tech = df_final[
    (df_final['is_tech'] == True) & 
    (df_final['country_clean'] == 'United States') & 
    (df_final['salary_usd'].notna()) &
    (df_final['state_clean'].notna())
]
if len(us_tech) > 0:
    state_salaries = us_tech.groupby('state_clean')['salary_usd'].agg(['mean', 'count']).round(0)
    state_salaries = state_salaries[state_salaries['count'] >= 5]
    state_salaries = state_salaries.sort_values('mean', ascending=False)
    q2_state = state_salaries.index[0] if len(state_salaries) > 0 else "No data"
    q2_salary = state_salaries.iloc[0]['mean'] if len(state_salaries) > 0 else "No data"
else:
    q2_state = "No data"
    q2_salary = "No data"

print(f"2. US state with highest average salary for tech workers: {q2_state} (${q2_salary:,.0f})" if isinstance(q2_salary, (int, float)) else f"2. US state with highest average salary for tech workers: {q2_state} ({q2_salary})")

# Question 3 Results
tech_experience = df_final[
    (df_final['is_tech'] == True) & 
    (df_final['salary_usd'].notna()) &
    (df_final['experience_years'].notna())
]
if len(tech_experience) > 0:
    from scipy import stats
    slope, intercept, r_value, p_value, std_err = stats.linregress(tech_experience['experience_years'], tech_experience['salary_usd'])
    q3_increase = slope
    q3_r2 = r_value**2
else:
    q3_increase = "No data"
    q3_r2 = "No data"

print(f"3. Average salary increase per year of experience in tech: ${q3_increase:,.0f} (R² = {q3_r2:.3f})" if isinstance(q3_increase, (int, float)) else f"3. Average salary increase per year of experience in tech: {q3_increase}")

# Question 4 Results
non_tech = df_final[
    (df_final['is_tech'] == False) & 
    (df_final['salary_usd'].notna()) &
    (df_final['industry_clean'].notna())
]
if len(non_tech) > 0:
    industry_salaries = non_tech.groupby('industry_clean')['salary_usd'].agg(['median', 'count']).round(0)
    industry_salaries = industry_salaries[industry_salaries['count'] >= 10]
    industry_salaries = industry_salaries.sort_values('median', ascending=False)
    q4_industry = industry_salaries.index[0] if len(industry_salaries) > 0 else "No data"
    q4_salary = industry_salaries.iloc[0]['median'] if len(industry_salaries) > 0 else "No data"
else:
    q4_industry = "No data"
    q4_salary = "No data"

print(f"4. Industry with highest median salary (non-tech): {q4_industry} (${q4_salary:,.0f})" if isinstance(q4_salary, (int, float)) else f"4. Industry with highest median salary (non-tech): {q4_industry} ({q4_salary})")

# Question 5 Results
tech_gender = df_final[
    (df_final['is_tech'] == True) & 
    (df_final['salary_usd'].notna()) &
    (df_final['gender_clean'].notna()) &
    (df_final['gender_clean'].isin(['Male', 'Female']))
]
if len(tech_gender) > 0:
    gender_salaries = tech_gender.groupby('gender_clean')['salary_usd'].median()
    if 'Male' in gender_salaries.index and 'Female' in gender_salaries.index:
        gap = gender_salaries['Male'] - gender_salaries['Female']
        gap_pct = (gap / gender_salaries['Female']) * 100
        q5_gap = gap
        q5_gap_pct = gap_pct
    else:
        q5_gap = "No data"
        q5_gap_pct = "No data"
else:
    q5_gap = "No data"
    q5_gap_pct = "No data"

print(f"5. Salary gap between men and women in tech: ${q5_gap:,.0f} ({q5_gap_pct:.1f}%)" if isinstance(q5_gap, (int, float)) else f"5. Salary gap between men and women in tech: {q5_gap}")

print(f"\n📊 KEY INSIGHTS:")
print(f"- Total records processed: {len(df_final):,}")
print(f"- Valid salary records: {df_final['salary_usd'].notna().sum():,}")
print(f"- US records: {df_final[df_final['country_clean'] == 'United States'].shape[0]:,}")
print(f"- Tech workers: {df_final['is_tech'].sum():,}")
print(f"- Software Engineers: {df_final['is_software_engineer'].sum():,}")

print(f"\n🔍 CHALLENGES FACED:")
print(f"- Multiple currencies requiring conversion to USD")
print(f"- Inconsistent job title formatting")
print(f"- Missing and incomplete data")
print(f"- Salary ranges that needed midpoint calculation")
print(f"- Location data standardization")

print(f"\n💡 WHAT I LEARNED ABOUT VIBE CODING:")
print(f"- Real-world data is messy and requires creative cleaning strategies")
print(f"- Business judgment calls are necessary (e.g., handling outliers)")
print(f"- Iterative approach: clean → validate → analyze → refine")
print(f"- Documentation is crucial for reproducibility")
print(f"- Multiple approaches can yield similar results with proper validation")


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?

In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?



1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3
