# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# TODO Plan for the Homework
1. **Load the data**
    - Read the TSV file into a pandas DataFrame
    - Inspect columns and data types
2. **Initial Data Exploration**
    - Check for missing values and obvious data issues
    - Explore salary, job title, location, and experience columns
3. **Data Cleaning**
    - Standardize salary values (remove currency symbols, convert to USD, handle ranges)
    - Clean job titles (normalize "Software Engineer" and tech roles)
    - Clean location data (extract US states, handle international entries)
    - Handle missing or invalid entries
    - Standardize experience and education columns
4. **Answer Core Business Questions**
    - Median salary for Software Engineers in the US
    - US state with highest average tech salary
    - Average salary increase per year of experience in tech
    - Remote vs. in-office percentage
    - Industry (besides tech) with highest median salary
5. **(Optional) Answer Bonus Questions**
    - Salary gap between men and women in tech
    - Salary difference: Master’s vs Bachelor’s degrees
    - Which company size pays the most on average
6. **Validate Results**
    - Compare with official/expected results if available
    - Ensure answers are within 5% of expected values
7. **Document Your Process**
    - Add comments and markdown explanations for each step
---

## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [71]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re



In [72]:
# Load the dataset using the absolute path
file_path = "/workspaces/ds-fall-2025-tue/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"
df = pd.read_csv(file_path, sep="\t")

## Step 2: Data Cleaning


In [73]:
# Rename columns to shorter, well-formatted names
df = df.rename(columns={
    'How old are you?': 'age',
    'What industry do you work in?': 'industry',
    'Job title': 'job_title',
    'If your job title needs additional context, please clarify here:': 'job_context',
    "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)": 'salary',
    'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.': 'additional_comp',
    'Please indicate the currency': 'currency',
    'If "Other," please indicate the currency here: ': 'currency_other',
    'If your income needs additional context, please provide it here:': 'income_context',
    'What country do you work in?': 'country',
    "If you're in the U.S., what state do you work in?": 'state',
    'What city do you work in?': 'city',
    'How many years of professional work experience do you have overall?': 'years_exp_overall',
    'How many years of professional work experience do you have in your field?': 'years_exp_field',
    'What is your highest level of education completed?': 'education',
    'What is your gender?': 'gender',
    'What is your race? (Choose all that apply.)': 'race'
})
df.columns

Index(['Timestamp', 'age', 'industry', 'job_title', 'job_context', 'salary',
       'additional_comp', 'currency', 'currency_other', 'income_context',
       'country', 'state', 'city', 'years_exp_overall', 'years_exp_field',
       'education', 'gender', 'race'],
      dtype='object')

In [74]:
# Clean all object columns: strip whitespace and standardize formatting
def clean_obj_col(val):
    if isinstance(val, str):
        return val.strip()
    return val

for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].apply(clean_obj_col)

In [75]:
# Clean the salary column

def parse_salary(val):
    if pd.isnull(val):
        return np.nan
    val = str(val).replace(",", "").replace("$", "").strip()
    # Handle salary ranges like '50000-60000' or '50,000 - 60,000'
    if "-" in val:
        parts = re.split(r"-", val)
        try:
            nums = [float(p.strip()) for p in parts if p.strip().replace('.', '', 1).isdigit()]
            if len(nums) == 2:
                return sum(nums) / 2  # take the average
            elif len(nums) == 1:
                return nums[0]
        except:
            return np.nan
    # Remove any non-numeric characters left
    val = re.sub(r"[^0-9.]", "", val)
    try:
        return float(val)
    except:
        return np.nan

# Apply the parser to create a new column
df["salary_clean"] = df["salary"].apply(parse_salary)

In [76]:
# Clean additional_comp column
def parse_additional_comp(val):
    if pd.isnull(val):
        return np.nan
    val = str(val).replace(",", "").replace("$", "").strip()
    # Handle ranges like '5000-7000'
    if "-" in val:
        parts = re.split(r"-", val)
        try:
            nums = [float(p.strip()) for p in parts if p.strip().replace('.', '', 1).isdigit()]
            if len(nums) == 2:
                return sum(nums) / 2
            elif len(nums) == 1:
                return nums[0]
        except:
            return np.nan
    # Remove any non-numeric characters left
    val = re.sub(r"[^0-9.]", "", val)
    try:
        return float(val)
    except:
        return np.nan

# Apply the parser to create a new column
df["additional_comp_clean"] = df["additional_comp"].apply(parse_additional_comp)

In [77]:
# Add total compensation column
df["total_comp_clean"] = df["salary_clean"].fillna(0) + df["additional_comp_clean"].fillna(0)

In [78]:
# Clean the currency column: standardize and map similar to USD
usd_aliases = [
    'usd', 'us dollars', 'us dollar', 'u.s. dollars', 'u.s. dollar', 'dollars', 'dollar', 'us$', 'u.s.$', 'american dollars', 'american dollar', 'usa dollars', 'usa dollar', 'usdollars', 'usdollar', 'usd (united states dollar)', 'united states dollar', 'united states dollars', 'u.s.d.', 'u.s.d', 'us', 'us currency', 'us (united states dollar)', 'united states', 'usa', 'u.s.', 'u.s', 'american'
]

def clean_currency(val):
    if pd.isnull(val):
        return np.nan
    val = str(val).strip().lower()
    if val in usd_aliases:
        return 'USD'
    # Try to catch common variants
    if 'usd' in val or 'us dollar' in val or 'u.s.' in val or 'dollar' in val or 'american' in val or 'united states' in val or 'usa' in val:
        return 'USD'
    return val.upper()

# Apply the cleaning function
df['currency_clean'] = df['currency'].apply(clean_currency)

In [79]:
# Clean the currency_other column: move values to currency_clean where needed

def merge_currency_columns(row):
    # If currency is 'Other' or similar, and currency_other is not null, use currency_other
    if pd.isnull(row['currency_clean']) or row['currency_clean'].lower() in ['other', 'nan', '']:
        if pd.notnull(row['currency_other']) and str(row['currency_other']).strip() != '':
            # Clean the value as we did for currency
            val = str(row['currency_other']).strip().lower()
            if val in usd_aliases or 'usd' in val or 'us dollar' in val or 'u.s.' in val or 'dollar' in val or 'american' in val or 'united states' in val or 'usa' in val:
                return 'USD'
            return val.upper()
        else:
            return row['currency_clean']
    else:
        return row['currency_clean']

# Apply the merge function to create a unified currency column
df['currency_final'] = df.apply(merge_currency_columns, axis=1)

In [80]:
# Clean the country column: standardize all USA/US/United States variants to 'USA' (final, more robust, with more typos and emoji variants)
usa_aliases = [
    'usa', 'us', 'united states', 'united states of america', 'u.s.', 'u.s', 'u.s.a.', 'u.s.a', 'america',
    'united states america', 'united states, america', 'united states, usa', 'united states (usa)', 'us of a', 'usofa',
    'the united states', 'united states - usa', 'united states (us)', 'united states (america)', 'united states (u.s.)',
    'united states (u.s.a.)', 'united states (u.s.a)', 'united states (u.s)', 'united states (america)', 'united states (usofa)',
    'united states (us of a)', 'united state', 'unites states', 'u nited states', 'united stated', 'united ststes',
    'united sates', 'united staes', 'united stats', 'united stares', 'united stares', 'united states.', 'united states ',
    'united state ', 'unites states ', 'us ', 'usa ', 'america ', 'u.s ', 'u.s. ', 'u.s.a ', 'u.s.a. ', 'unitedstates',
    'unitedstates ', 'unitedstatss', 'uniterstatez', 'uniitedstates', '🇺🇸', 'is', 'isa', 'unitestates', 'unitedstatus',
    'uniyedstates', 'unitedstattes', 'uniyesstates', 'unitedstatws', 'untedstates', 'uniitedstates', 'us>', 'uss',
    'uniteedstates', 'unitedsttes', 'unitedstatues', 'untiedstates', 'uniterstatez', 'unitefstated', 'unitedstatss'
    # Add more as needed
 ]

def clean_country(val):
    if pd.isnull(val):
        return np.nan
    val = str(val).strip().lower()
    # Remove trailing/leading punctuation and whitespace
    val = val.replace('.', '').replace(',', '').replace('-', '').replace('_', '').replace(' ', '').strip()
    if val in [alias.replace(' ', '').replace('.', '').replace(',', '').replace('-', '').replace('_', '').strip() for alias in usa_aliases]:
        return 'USA'
    # Try to catch common variants and typos
    if 'unitedstate' in val or 'unitesstates' in val or 'usa' in val or val == 'us' or 'america' in val or 'usofa' in val:
        return 'USA'
    return str(val).title()

# Apply the improved cleaning function
df['country_clean'] = df['country'].apply(clean_country)

In [81]:
# Only keep the US rows
# If df_clean is not defined yet, use the cleaned columns from df directly
us_rows = (df['country_clean'] == 'USA')
df_us = df[us_rows].copy().reset_index(drop=True)

In [82]:
# Clean state column: standardize US state names and abbreviations
us_states = {
    'ALABAMA': 'AL', 'ALASKA': 'AK', 'ARIZONA': 'AZ', 'ARKANSAS': 'AR', 'CALIFORNIA': 'CA', 'COLORADO': 'CO',
    'CONNECTICUT': 'CT', 'DELAWARE': 'DE', 'FLORIDA': 'FL', 'GEORGIA': 'GA', 'HAWAII': 'HI', 'IDAHO': 'ID',
    'ILLINOIS': 'IL', 'INDIANA': 'IN', 'IOWA': 'IA', 'KANSAS': 'KS', 'KENTUCKY': 'KY', 'LOUISIANA': 'LA',
    'MAINE': 'ME', 'MARYLAND': 'MD', 'MASSACHUSETTS': 'MA', 'MICHIGAN': 'MI', 'MINNESOTA': 'MN', 'MISSISSIPPI': 'MS',
    'MISSOURI': 'MO', 'MONTANA': 'MT', 'NEBRASKA': 'NE', 'NEVADA': 'NV', 'NEW HAMPSHIRE': 'NH', 'NEW JERSEY': 'NJ',
    'NEW MEXICO': 'NM', 'NEW YORK': 'NY', 'NORTH CAROLINA': 'NC', 'NORTH DAKOTA': 'ND', 'OHIO': 'OH', 'OKLAHOMA': 'OK',
    'OREGON': 'OR', 'PENNSYLVANIA': 'PA', 'RHODE ISLAND': 'RI', 'SOUTH CAROLINA': 'SC', 'SOUTH DAKOTA': 'SD',
    'TENNESSEE': 'TN', 'TEXAS': 'TX', 'UTAH': 'UT', 'VERMONT': 'VT', 'VIRGINIA': 'VA', 'WASHINGTON': 'WA',
    'WEST VIRGINIA': 'WV', 'WISCONSIN': 'WI', 'WYOMING': 'WY', 'DISTRICT OF COLUMBIA': 'DC', 'WASHINGTON DC': 'DC',
    'DC': 'DC', 'D.C.': 'DC', 'D C': 'DC',
}
abbr_to_abbr = {abbr: abbr for abbr in us_states.values()}

def clean_state(val):
    if pd.isnull(val):
        return np.nan
    val = str(val).strip().upper().replace('.', '').replace(',', '').replace('-', '').replace('_', '').replace(' ', '')
    # Try to match state abbreviations
    if val in abbr_to_abbr:
        return val
    # Try to match state names (remove spaces for matching)
    for name, abbr in us_states.items():
        if val == name.replace(' ', ''):
            return abbr
    # Handle common typos and variants
    state_aliases = {
        'CALIF': 'CA', 'CAL': 'CA', 'CALI': 'CA', 'TEX': 'TX', 'TEXS': 'TX', 'NYC': 'NY', 'FLA': 'FL',
        'ILL': 'IL', 'ILLINOIS': 'IL', 'MASS': 'MA', 'MASSACHUSETTS': 'MA', 'PENN': 'PA', 'PENNSYLVANIA': 'PA',
        'GEORGIA': 'GA', 'GA': 'GA', 'NJ': 'NJ', 'OHIO': 'OH', 'OH': 'OH',
        'WASHINGTONDC': 'DC', 'D C': 'DC', 'DISTRICTOFCOLUMBIA': 'DC',
        # Add more as needed
    }
    if val in state_aliases:
        return state_aliases[val]
    return np.nan

# Apply the cleaning function
df_us['state_clean'] = df_us['state'].apply(clean_state)

In [83]:
# Clean years_exp_field column: extract a numeric value for analysis
def parse_years_exp_field(val):
    if pd.isnull(val):
        return np.nan
    val = str(val).strip().lower()
    # Handle ranges like '11 - 20 years', '3-5', '21+ years', etc.
    match = re.findall(r'\d+', val)
    if len(match) == 0:
        return np.nan
    if '+' in val:
        # For '21+ years', take the number as is
        return float(match[0])
    if len(match) == 1:
        return float(match[0])
    if len(match) == 2:
        # For ranges, take the average
        return (float(match[0]) + float(match[1])) / 2
    return np.nan

df_us['years_exp_field_clean'] = df_us['years_exp_field'].apply(parse_years_exp_field)

In [84]:
# Clean industry column: map to broader categories
# Optional: exact labels you know appear in the data
DIRECT_MAP = {
    "computing or tech": "Tech",
    "nonprofits": "Nonprofit/NGO",
    "education (higher education)": "Education",
    "education (primary/secondary)": "Education",
    "health care": "Healthcare",
    "accounting, banking & finance": "Finance",
    "engineering or manufacturing": "Manufacturing",
    "government and public administration": "Government/Nonprofit",
    "law": "Legal",
    "marketing, advertising & pr": "Media/Arts/Marketing",
    "media & digital": "Media/Arts/Marketing",
    "insurance": "Insurance",
    "recruitment or hr": "HR/Staffing",
    "retail": "Retail",
    "property or construction": "Construction/Real Estate",
    "utilities & telecommunications": "Energy/Utilities",
    "transport or logistics": "Transportation/Logistics",
    "art & design": "Media/Arts/Marketing",
    "sales": "Retail",
}

def map_industry(val):
    if pd.isnull(val):
        return np.nan
    v_raw = str(val).strip()
    v = v_raw.lower()

    # 0) exact label shortcut (fast path)
    if v in DIRECT_MAP:
        return DIRECT_MAP[v]

    # 1) Tech
    if any(x in v for x in [
        "tech","software","it "," information technology","computer","web","developer","programmer",
        "data ","data-", "data/","data science","data engineer","data analyst","ml","machine learning",
        "ai "," artificial intelligence","cloud","devops","cyber","security engineer","infosec",
        "network","system admin","systems engineer","database","backend","front end","frontend","full stack",
        "saas","platform","product engineer","qa","quality assurance","ui","ux","mobile app","ios","android",
        "help desk","site reliability","sre","embedded","firmware","robotics","automation engineer",
        "mlops","data platform","it support","software engineer","solutions architect","scrum","agile"
    ]):
        return "Tech"

    # 2) Education
    if any(x in v for x in [
        "school","education","higher ed","university","universit","college","academic","teaching","teacher",
        "professor","tutor","k-12","k12","student affairs","registrar","campus","admissions","classroom",
        "curriculum","instructional designer","edtech","education technology"
    ]):
        return "Education"

    # 3) Healthcare
    if any(x in v for x in [
        "health care","healthcare","hospital","medical","medicine","clinic","nurse","rn ","lpn","physician",
        "doctor","dental","mental health","therapy","therapist","counselor","pharmacy","public health",
        "radiology","laboratory","biomedical","home health","caregiver"
    ]):
        return "Healthcare"

    # 4) Finance
    if any(x in v for x in [
        "finance","financial","bank","credit union","brokerage","trading","investment","asset management",
        "wealth management","hedge fund","private equity","venture capital","fintech","accounting",
        "accountant","cpa","audit","bookkeep","payroll","treasury","mortgage","actuar","risk management",
        "compliance","financial analyst","loan officer","underwriter"
    ]):
        return "Finance"

    # 5) Insurance
    if any(x in v for x in [
        "insurance","insurer","underwriting","claims adjuster","broker","actuary"
    ]):
        return "Insurance"

    # 6) Government / Nonprofit
    if any(x in v for x in [
        "government","public sector","municipal","state agency","federal","city","county","civic",
        "public policy","policy","public administration","court","judiciar","legislativ","authority",
        "nonprofit","not for profit","ngo","charity","foundation","community organization"
    ]):
        return "Government/Nonprofit"

    # 7) Legal
    if any(x in v for x in [
        "legal","law firm","attorney","lawyer","paralegal","litigation","counsel","prosecutor","legal aid","compliance counsel"
    ]):
        return "Legal"

    # 8) Manufacturing
    if any(x in v for x in [
        "manufactur","factory","production","plant","assembly","industrial","machinery","fabrication",
        "mill","foundry","tool and die","process engineer","quality engineer","six sigma","lean manufacturing"
    ]):
        return "Manufacturing"

    # 9) Construction / Real Estate
    if any(x in v for x in [
        "construction","contractor","general contractor","building","carpentry","masonry","hvac","plumbing",
        "electrical","roofing","drywall","civil","infrastructure","real estate","realtor","brokerage",
        "property management","leasing","landlord","facilities management","hoa","commercial real estate","cre"
    ]):
        return "Construction/Real Estate"

    # 10) Media / Arts / Marketing
    if any(x in v for x in [
        "media","journalism","news","publishing","writer","editor","film","tv","television","radio",
        "music","entertainment","broadcast","streaming","video production","animation","design","graphic",
        "creative","advertising","adtech","brand","branding","public relations","pr","communications",
        "social media","seo","sem","content marketing","copywriter","art director","photography","ui/ux designer"
    ]):
        return "Media/Arts/Marketing"

    # 11) Transportation / Logistics
    if any(x in v for x in [
        "transport","transportation","trucking","driver","delivery","rideshare","taxi","bus","rail","railroad",
        "metro","subway","airline","aviation","airport","freight","shipping","maritime","ocean","courier",
        "last mile","logistics","warehouse","warehousing","fulfillment","3pl","distribution center","inventory",
        "supply chain","procurement","purchasing","sourcing"
    ]):
        return "Transportation/Logistics"

    # 12) Hospitality / Food
    if any(x in v for x in [
        "hospitality","hotel","lodging","resort","motel","restaurant","cafe","bar","catering","food service",
        "kitchen","chef","housekeeping","host","server","line cook","banquet"
    ]):
        return "Hospitality/Food"

    # 13) Energy / Utilities
    if any(x in v for x in [
        "energy","utility","utilities","electricity","electric power","power","renewable","solar","wind","hydro",
        "geothermal","nuclear","oil","gas","petroleum","pipeline","water","waste","telecom","telecommunications",
        "fiber","broadband","wireless","cellular","isp","carrier"
    ]):
        return "Energy/Utilities"

    # 14) Science / Engineering (non-software)
    if any(x in v for x in [
        "research","r&d","r and d","scientist","physics","chemistry","biology","geology","environmental science",
        "ecology","biostat","epidemiolog","lab technician","field technician","process scientist"
    ]):
        return "Science/Engineering"

    # 15) Consulting
    if "consult" in v or "advisory" in v or "management consulting" in v or "strategy consulting" in v:
        return "Consulting"

    # 16) HR / Staffing
    if any(x in v for x in [
        "human resource","hr "," hr","staffing","recruit","headhunt","talent acquisition","rpo","people ops","hr generalist"
    ]):
        return "HR/Staffing"

    # 17) Customer Support / BPO
    if any(x in v for x in [
        "bpo","call center","contact center","customer support","customer service","csr","support specialist","help desk"
    ]):
        return "Customer Support/BPO"

    # 18) Agriculture / Food Production
    if any(x in v for x in [
        "agriculture","farm","agri","horticulture","livestock","dairy","ranch","fishery","aquaculture","poultry",
        "grain","crop","food processing","packhouse"
    ]):
        return "Agriculture/Food Production"

    # 19) Automotive
    if any(x in v for x in [
        "automotive","auto part","car dealership","dealership","oem","aftermarket","body shop","service advisor"
    ]):
        return "Automotive"

    # 20) Telecommunications (extra catch)
    if any(x in v for x in [
        "telecom","telco","5g","4g","fiber","broadband","isp","carrier","switching","rf engineer","network planner"
    ]):
        return "Telecommunications"

    # 21) Pharma / Biotech
    if any(x in v for x in [
        "pharma","biotech","biopharma","life science","clinical trial","fda","drug","vaccine","gxp","gmp"
    ]):
        return "Pharmaceuticals/Biotech"

    # 22) Aerospace / Defense
    if any(x in v for x in [
        "aerospace","space","satellite","defense","defence","defense contractor","aeronautic","missile"
    ]):
        return "Aerospace/Defense"

    # 23) Beauty / Personal Care
    if any(x in v for x in [
        "beauty","salon","spa","barber","cosmetolog","aesthetic","skincare","skin care","hair care","nail salon","esthetician"
    ]):
        return "Beauty/Personal Care"

    # 24) Printing / Packaging
    if any(x in v for x in [
        "printing","print shop","press","packaging","label","carton","box","prepress","finishing"
    ]):
        return "Printing/Packaging"

    # 25) Textiles / Apparel / Fashion
    if any(x in v for x in [
        "apparel","clothing","garment","textile","fashion","footwear","shoe","merchandiser","pattern maker"
    ]):
        return "Textiles/Apparel/Fashion"

    # 26) Mining / Metals / Chemicals / Materials
    if any(x in v for x in [
        "mining","mineral","metal","steel","aluminum","copper","smelt","quarry","chemical","polymer","plastic",
        "resin","coating","paint","adhesive","materials science"
    ]):
        return "Mining/Chemicals/Materials"

    # 27) Travel / Tourism
    if any(x in v for x in [
        "travel","tourism","tour operator","travel agency","cruise","tour guide","concierge","itinerary"
    ]):
        return "Travel/Tourism"

    # 28) Religion / Faith-Based
    if any(x in v for x in [
        "church","mosque","temple","synagogue","parish","ministry","faith based","faith-based"
    ]):
        return "Religion/Faith-Based"

    # Fallback
    return "Other"


# Apply mapping
df_us['industry_clean'] = df_us['industry'].apply(map_industry)

In [85]:
# Clean job_title column: standardize common titles
import re
def clean_job_title(val):
    if pd.isnull(val):
        return np.nan
    v = str(val).strip().lower()
    v = re.sub(r'[^a-z0-9\s]', '', v)
    v = re.sub(r'\s+', ' ', v).strip()
    # Tech/Software roles
    if any(x in v for x in [
        'software engineer', 'swe', 'developer', 'programmer', 'dev', 'full stack', 'frontend', 'front end', 'backend', 'back end', 'web developer',
        'data scientist', 'data science', 'data analyst', 'machine learning', 'ml engineer', 'ai engineer', 'mlops', 'data engineer',
        'cloud engineer', 'site reliability', 'sre', 'devops', 'qa engineer', 'quality assurance', 'test engineer', 'automation engineer',
        'it support', 'it specialist', 'systems engineer', 'network engineer', 'security engineer', 'cybersecurity', 'database admin', 'db admin',
        'solutions architect', 'product engineer', 'ui designer', 'ux designer', 'uiux', 'mobile developer', 'ios developer', 'android developer',
        'embedded engineer', 'firmware engineer', 'robotics engineer', 'scrum master', 'agile coach'
    ]):
        return 'Software Engineer' if 'engineer' in v or 'swe' in v or 'software' in v else 'Tech/IT'
    # Management roles
    if any(x in v for x in [
        'manager', 'lead', 'head', 'director', 'vp', 'vice president', 'chief', 'cto', 'cio', 'ceo', 'coo', 'cso', 'cfo', 'president', 'principal', 'supervisor', 'team lead', 'project manager', 'product manager', 'program manager', 'owner'
    ]):
        return 'Manager/Lead'
    # Analyst roles
    if 'analyst' in v:
        return 'Analyst'
    # Scientist roles
    if 'scientist' in v:
        return 'Scientist'
    # Consultant roles
    if 'consultant' in v or 'consulting' in v:
        return 'Consultant'
    # Designer roles
    if 'designer' in v or 'design' in v:
        return 'Designer'
    # Teacher/Education roles
    if any(x in v for x in ['teacher', 'professor', 'instructor', 'educator', 'lecturer', 'tutor', 'adjunct', 'faculty']):
        return 'Teacher/Education'
    # Healthcare roles
    if any(x in v for x in ['nurse', 'rn', 'lpn', 'doctor', 'physician', 'md', 'do', 'dentist', 'dental', 'therapist', 'counselor', 'psychologist', 'psychiatrist', 'pharmacist', 'pharmacy', 'medical', 'surgeon', 'veterinarian', 'vet']):
        return 'Healthcare'
    # Legal roles
    if any(x in v for x in ['attorney', 'lawyer', 'paralegal', 'legal', 'counsel', 'judge', 'prosecutor', 'public defender']):
        return 'Legal'
    # HR/Recruiting roles
    if any(x in v for x in ['hr', 'human resources', 'recruiter', 'talent acquisition', 'staffing', 'people ops']):
        return 'HR/Recruiting'
    # Sales roles
    if any(x in v for x in ['sales', 'account executive', 'account manager', 'business development', 'bdm', 'salesperson', 'sales rep', 'sales representative']):
        return 'Sales'
    # Marketing roles
    if any(x in v for x in ['marketing', 'brand', 'public relations', 'pr', 'communications', 'advertising', 'seo', 'content', 'copywriter', 'social media']):
        return 'Marketing/PR'
    # Customer Support roles
    if any(x in v for x in ['customer support', 'customer service', 'csr', 'call center', 'help desk', 'support specialist']):
        return 'Customer Support'
    # Operations roles
    if any(x in v for x in ['operations', 'ops', 'logistics', 'supply chain', 'warehouse', 'inventory', 'fulfillment']):
        return 'Operations/Logistics'
    # Finance roles
    if any(x in v for x in ['finance', 'financial', 'accountant', 'accounting', 'cpa', 'controller', 'bookkeeper', 'auditor', 'treasury', 'payroll', 'tax', 'investment', 'bank', 'banker', 'loan officer', 'underwriter', 'analyst']):
        return 'Finance/Accounting'
    # Administrative roles
    if any(x in v for x in ['admin', 'administrative', 'executive assistant', 'assistant', 'secretary', 'office manager', 'receptionist', 'clerk']):
        return 'Administrative'
    # Research roles
    if 'research' in v:
        return 'Research'
    # Student roles
    if 'student' in v:
        return 'Student'
    # Intern roles
    if 'intern' in v:
        return 'Intern'
    # Owner/Founder roles
    if any(x in v for x in ['owner', 'founder', 'cofounder', 'entrepreneur']):
        return 'Owner/Founder'
    # Other catch-all
    return v.title() if len(v) < 40 else 'Other'
# Apply cleaning to job_title
df_us['job_title_clean'] = df_us['job_title'].apply(clean_job_title)

In [86]:
# Create a DataFrame with only the columns needed to answer the business questions, with clean names
columns_needed = [
    'salary_clean',
    'additional_comp_clean',
    'total_comp_clean',
    'currency_final',
    'state_clean',
    'years_exp_field_clean',
    'industry_clean',
    'job_title_clean',
    'education',
    'gender',
    'race'
]

# Rename columns to original names (but now all are clean)
rename_map = {
    'salary_clean': 'salary',
    'additional_comp_clean': 'additional_comp',
    'total_comp_clean': 'total_comp',
    'currency_final': 'currency',
    'state_clean': 'state',
    'years_exp_field_clean': 'years_exp_field',
    'industry_clean': 'industry',
    'job_title_clean': 'job_title',
}

df_us_clean = df_us[columns_needed].copy().rename(columns=rename_map)

In [87]:
pd.set_option('display.max_columns', None)
df_us_clean.head()

Unnamed: 0,salary,additional_comp,total_comp,currency,state,years_exp_field,industry,job_title,education,gender,race
0,55000.0,0.0,55000.0,USD,MA,6.0,Education,Research,Master's degree,Woman,White
1,34000.0,,34000.0,USD,TN,3.0,Finance,Marketing/PR,College degree,Woman,White
2,62000.0,3000.0,65000.0,USD,WI,6.0,Nonprofit/NGO,Manager/Lead,College degree,Woman,White
3,60000.0,7000.0,67000.0,USD,SC,6.0,Finance,Manager/Lead,College degree,Woman,White
4,62000.0,,62000.0,USD,NH,3.0,Education,Scholarly Publishing Librarian,Master's degree,Man,White


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [88]:
# Question 1: What is the median salary for Software Engineers in the United States?

software_engineers = df_us_clean[df_us_clean['job_title'] == 'Software Engineer']

# Median base salary
median_salary = software_engineers['salary'].median()

# Median total compensation (salary + additional comp)
median_total_comp = software_engineers['total_comp'].median()

print(f"Median salary for Software Engineers in the United States: ${median_salary:,.0f}")
print(f"Median total compensation for Software Engineers in the United States: ${median_total_comp:,.0f}")

Median salary for Software Engineers in the United States: $135,000
Median total compensation for Software Engineers in the United States: $142,000


In [89]:
# Question 2: Which US state has the highest average salary for tech workers?

tech_roles = df_us_clean[df_us_clean['job_title'].isin(['Software Engineer', 'Tech/IT'])]

# Average salary
state_salary = tech_roles.groupby('state')['salary'].mean().sort_values(ascending=False)
top_state = state_salary.idxmax()
top_salary = state_salary.max()
print(f"US state with the highest average salary for tech workers: {top_state} (${top_salary:,.0f})")

# Average total compensation
state_total_comp = tech_roles.groupby('state')['total_comp'].mean().sort_values(ascending=False)
top_state_total = state_total_comp.idxmax()
top_total = state_total_comp.max()
print(f"US state with the highest average total compensation for tech workers: {top_state_total} (${top_total:,.0f})")

US state with the highest average salary for tech workers: WA ($155,008)
US state with the highest average total compensation for tech workers: CA ($200,332)


In [90]:
# Question 3: How much does salary increase on average for each year of experience in tech?

# Filter for tech roles
tech_roles = df_us_clean[df_us_clean['job_title'].isin(['Software Engineer', 'Tech/IT'])]

# Drop rows with missing salary or years of experience
tech_roles_exp = tech_roles.dropna(subset=['salary', 'years_exp_field'])

# Fit a linear regression: salary ~ years_exp_field
from scipy.stats import linregress

slope, intercept, r_value, p_value, std_err = linregress(tech_roles_exp['years_exp_field'], tech_roles_exp['salary'])

print(f"Estimated average salary increase per year of experience in tech: ${slope:,.0f} per year")

Estimated average salary increase per year of experience in tech: $2,097 per year


In [91]:
# Question 4: Which industry (besides tech) has the highest median salary?


# Exclude 'Tech' industry and any missing values
non_tech = df_us_clean[(df_us_clean['industry'] != 'Tech') & (~df_us_clean['industry'].isna())]


# Group by industry and calculate median salary
industry_medians = non_tech.groupby('industry')['salary'].median().sort_values(ascending=False)


top_industry = industry_medians.idxmax()
top_median_salary = industry_medians.max()


print(f"Industry (besides tech) with the highest median salary: {top_industry} (${top_median_salary:,.0f})")

Industry (besides tech) with the highest median salary: Pharmaceuticals/Biotech ($119,500)


In [92]:
# Question 5: What’s the salary gap between men and women in tech roles?

# Filter for tech roles (Software Engineer and Tech/IT)
tech_roles = df_us_clean[df_us_clean['job_title'].isin(['Software Engineer', 'Tech/IT'])]

# Only keep rows with gender as 'Man' or 'Woman' (case-insensitive, robust)
tech_roles_gender = tech_roles[tech_roles['gender'].str.lower().isin(['man', 'woman'])].copy()

# Calculate median salary for each gender
median_salary_by_gender = tech_roles_gender.groupby(tech_roles_gender['gender'].str.title())['salary'].median()
men_median = median_salary_by_gender.get('Man', float('nan'))
women_median = median_salary_by_gender.get('Woman', float('nan'))

# Calculate the gap (men - women)
gap = men_median - women_median


print(f"Median salary for men in tech roles: ${men_median:,.0f}")
print(f"Median salary for women in tech roles: ${women_median:,.0f}")
print(f"Salary gap (men - women): ${gap:,.0f}")

Median salary for men in tech roles: $132,999
Median salary for women in tech roles: $96,000
Salary gap (men - women): $36,999


In [93]:
# Question 6: Do people with Master’s degrees earn significantly more than those with Bachelor’s degrees?

# Filter for Bachelor's (College degree) and Master's only
degree_roles = df_us_clean[df_us_clean['education'].isin(['College degree', "Master's degree"])].copy()

# Median base salary by degree
median_salary_by_degree = degree_roles.groupby('education')['salary'].median()
bachelor_salary = median_salary_by_degree.get('College degree', float('nan'))
master_salary = median_salary_by_degree.get("Master's degree", float('nan'))
gap_salary = master_salary - bachelor_salary

# Median total compensation by degree
median_total_by_degree = degree_roles.groupby('education')['total_comp'].median()
bachelor_total = median_total_by_degree.get('College degree', float('nan'))
master_total = median_total_by_degree.get("Master's degree", float('nan'))
gap_total = master_total - bachelor_total

# Print results
print(f"Median salary for Bachelor's degree holders (College degree): ${bachelor_salary:,.0f}")
print(f"Median salary for Master's degree holders: ${master_salary:,.0f}")
print(f"Salary gap (Master's - Bachelor's): ${gap_salary:,.0f}\n")

print(f"Median total compensation for Bachelor's degree holders (College degree): ${bachelor_total:,.0f}")
print(f"Median total compensation for Master's degree holders: ${master_total:,.0f}")
print(f"Total compensation gap (Master's - Bachelor's): ${gap_total:,.0f}")


Median salary for Bachelor's degree holders (College degree): $75,000
Median salary for Master's degree holders: $80,000
Salary gap (Master's - Bachelor's): $5,000

Median total compensation for Bachelor's degree holders (College degree): $78,000
Median total compensation for Master's degree holders: $83,000
Total compensation gap (Master's - Bachelor's): $5,000


## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $135,000 (Total Comp: $142,000)  
2. **Highest paying US state for tech:** WA $155,008 (Total Comp: CA $200,332)  
3. **Salary increase per year of experience:** $2,097 per year  
4. **Remote vs office percentage:** Not available in dataset  
5. **Highest paying non-tech industry:** Pharmaceuticals/Biotech $119,500  

**Key insights:**  
* Software Engineers make much higher salaries than many other jobs.  
* Washington and California are the best paying states for tech workers.  
* Salaries in tech go up about $2,097 for each year of experience.  

**Challenges faced:**  
* The country section was really messy. I had to use value_counts many times to find all the values that were close to US and put them in the same category. I spent most of my time doing this, and I still feel like I might have missed a few.  
* Job titles and industries were also messy, so I had to create cleaning functions and group similar ones together.  

**What you learned about vibe coding:**  
* This was new to me, but vibe coding helped me spot mistakes faster.  
* I don’t need to copy and paste from ChatGPT anymore because Codespaces makes the cell for me.  
* Coding together made my work quicker and easier.  