# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...


In [None]:
# - Load TSV dataset into a dataframe
# - Profile columns; select fields for salary, currency, job title, country, state, experience, industry
# - Normalize compensation to annual USD; parse ranges/bonuses; handle currencies and invalids
# - Standardize job titles; define rules to identify Software Engineer and tech workers
# - Clean and standardize US locations; extract state; filter US respondents
# - Clean years of experience to numeric; resolve ranges and units
# - Compute median salary for US Software Engineers
# - Compute average salary by US state for tech workers and find highest
# - Estimate average salary increase per year of experience (linear model)
# - Compute median salary by industry excluding tech; identify highest
# - Apply outlier handling and sanity checks; rerun metrics if needed
# - Document assumptions, cleaning rules, and finalize answers

## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Adjust this path if your file is elsewhere
data_path = "../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"

# 1) Load raw TSV (as strings to avoid premature type coercion)
df = pd.read_csv(data_path, sep="\t", dtype=str, na_filter=True, encoding="utf-8")

# 2) Basic shape and columns
print("Shape:", df.shape)
print("\nColumns:")
print(list(df.columns))

# 3) Peek at a few rows
display(df.head(5))
display(df.sample(5, random_state=0))

# 4) Quick dtypes after an inferred numeric pass on obvious numeric columns (optional view)
# We won't overwrite df; this is just to inspect potential numeric fields.
numeric_probe = df.apply(pd.to_numeric, errors="ignore")
print("\nInferred dtypes snapshot:")
print(numeric_probe.dtypes.head(20))

# 5) High-level summary (may be wide; use with caution)
summary = df.describe(include="all", datetime_is_numeric=True)
display(summary.T.head(25))  # transpose for readability, first 25 rows

# 6) Common fields to spot quickly (names vary in this dataset; these are typical)
candidates = {
    "salary": [c for c in df.columns if "salary" in c.lower()],
    "currency": [c for c in df.columns if "curren" in c.lower()],
    "job_title": [c for c in df.columns if "job" in c.lower() and "title" in c.lower()],
    "industry": [c for c in df.columns if "industry" in c.lower()],
    "country": [c for c in df.columns if "country" in c.lower()],
    "state": [c for c in df.columns if "state" in c.lower()],
    "experience": [c for c in df.columns if "experience" in c.lower()],
}
print("\nLikely columns by keyword:")
for k, v in candidates.items():
    print(f"- {k}: {v}")

# 7) Quick value counts on a few high-signal columns if present
for col in ["Currency", "currency", "Country", "country", "Industry", "industry", "Job Title", "job_title", "Job title"]:
    if col in df.columns:
        print(f"\nTop values for '{col}':")
        display(df[col].value_counts(dropna=False).head(10))

# 8) Missingness overview (percentage of nulls per column)
null_pct = df.isna().mean().sort_values(ascending=False)
print("\nNull percentage by column (top 20):")
print((null_pct.head(20) * 100).round(1).astype(str) + "%")


## Step 2: Data Cleaning


In [None]:
import re

# 1) Normalize column names
df.columns = (
    df.columns.str.strip()
              .str.lower()
              .str.replace(r"[\s/]+", "_", regex=True)
              .str.replace(r"[^a-z0-9_]", "", regex=True)
)

def find_col(df, candidates):
    cols = set(df.columns)
    for c in candidates:
        if c in cols:
            return c
    for c in candidates:
        for col in df.columns:
            if c in col:
                return col
    return None

col_industry = find_col(df, ["industry","what_industry_do_you_work_in"])
col_job = find_col(df, ["job_title","jobtitle","job","what_is_your_job_title"])
col_salary = find_col(df, ["annual_salary","salary"])
col_other = find_col(df, ["other_monetary_comp","other_compensation","other_income"])
col_currency = find_col(df, ["currency"])
col_country = find_col(df, ["country"])
col_state = find_col(df, ["state","us_state","state_us"])
col_experience = find_col(df, ["years_of_experience","years_experience","how_many_years_of_experience_do_you_have_in_your_field"])

keep_map = {}
if col_industry: keep_map[col_industry] = "industry_raw"
if col_job: keep_map[col_job] = "job_title_raw"
if col_salary: keep_map[col_salary] = "annual_salary_raw"
if col_other: keep_map[col_other] = "other_comp_raw"
if col_currency: keep_map[col_currency] = "currency_raw"
if col_country: keep_map[col_country] = "country_raw"
if col_state: keep_map[col_state] = "state_raw"
if col_experience: keep_map[col_experience] = "years_experience_raw"

work = df[list(keep_map.keys())].rename(columns=keep_map).copy()

# 2) Clean text fields
def clean_text(v):
    if pd.isna(v): return np.nan
    s = str(v).strip()
    return re.sub(r"\s+", " ", s)

work["industry_clean"] = work.get("industry_raw", np.nan).apply(lambda x: clean_text(x).lower() if isinstance(x, str) else x)
work["job_title_clean"] = work.get("job_title_raw", np.nan).apply(lambda x: clean_text(x).lower() if isinstance(x, str) else x)
work["country_clean"] = work.get("country_raw", np.nan).apply(lambda x: clean_text(x).lower() if isinstance(x, str) else x)

# 3) Currency normalization (lightweight)
def canonicalize_currency(val, country):
    if pd.isna(val) or str(val).strip() == "":
        if isinstance(country, str) and country in {"united states","usa","us","u.s.","u.s.a."}:
            return "USD"
        return np.nan
    s = str(val).strip().upper()
    mapping = {"$":"USD","US$":"USD","USD":"USD","€":"EUR","EUR":"EUR","£":"GBP","GBP":"GBP","CAD$":"CAD","CA$":"CAD","A$":"AUD","AU$":"AUD","JPY":"JPY","¥":"JPY","INR":"INR","₹":"INR"}
    return mapping.get(s, s)

work["currency"] = work.apply(lambda r: canonicalize_currency(r.get("currency_raw", np.nan), r.get("country_clean", np.nan)), axis=1)

# 4) Parse comp to annual numeric in local currency
def parse_amount(val):
    if pd.isna(val): return np.nan
    s = str(val).lower().strip()
    s = s.replace(",", "").replace("$", "").replace("£", "").replace("€", "").replace("¥", "").replace("₹", "")
    m = re.match(r"^([0-9]*\.?[0-9]+)\s*-\s*([0-9]*\.?[0-9]+)$", s)
    if m:
        a, b = float(m.group(1)), float(m.group(2))
        return (a + b) / 2.0
    if s.endswith("k"):
        try: return float(s[:-1]) * 1000.0
        except: return np.nan
    # hourly/monthly hints
    if s.endswith(("hr","/h","/hour")):
        try: return float(re.sub(r"[a-z/]", "", s)) * 2080.0
        except: return np.nan
    if s.endswith(("/m","/mo","permonth")):
        try: return float(re.sub(r"[a-z/]", "", s)) * 12.0
        except: return np.nan
    try:
        return float(s)
    except:
        return np.nan

work["salary_local_annual"] = work.get("annual_salary_raw", np.nan).apply(parse_amount)
work["other_comp_local_annual"] = work.get("other_comp_raw", np.nan).apply(parse_amount) if "other_comp_raw" in work else np.nan

# 5) Convert to USD (approx 2021 FX)
USD_PER_UNIT = {"USD":1.0,"EUR":1.18,"GBP":1.38,"CAD":0.80,"AUD":0.75,"JPY":0.0091,"INR":0.0135}
def to_usd(amount, currency):
    if pd.isna(amount) or pd.isna(currency): return np.nan
    rate = USD_PER_UNIT.get(str(currency).upper())
    return amount * rate if rate is not None else np.nan

work["salary_usd"] = work.apply(lambda r: to_usd(r["salary_local_annual"], r["currency"]), axis=1)
work["other_comp_usd"] = work.apply(lambda r: to_usd(r["other_comp_local_annual"], r["currency"]), axis=1)
work["total_comp_usd"] = work["salary_usd"].fillna(0) + work["other_comp_usd"].fillna(0)

# 6) Clean US state
STATE_MAP = {"alabama":"AL","alaska":"AK","arizona":"AZ","arkansas":"AR","california":"CA","colorado":"CO","connecticut":"CT",
             "delaware":"DE","florida":"FL","georgia":"GA","hawaii":"HI","idaho":"ID","illinois":"IL","indiana":"IN",
             "iowa":"IA","kansas":"KS","kentucky":"KY","louisiana":"LA","maine":"ME","maryland":"MD","massachusetts":"MA",
             "michigan":"MI","minnesota":"MN","mississippi":"MS","missouri":"MO","montana":"MT","nebraska":"NE","nevada":"NV",
             "new hampshire":"NH","new jersey":"NJ","new mexico":"NM","new york":"NY","north carolina":"NC","north dakota":"ND",
             "ohio":"OH","oklahoma":"OK","oregon":"OR","pennsylvania":"PA","rhode island":"RI","south carolina":"SC",
             "south dakota":"SD","tennessee":"TN","texas":"TX","utah":"UT","vermont":"VT","virginia":"VA","washington":"WA",
             "west virginia":"WV","wisconsin":"WI","wyoming":"WY","district of columbia":"DC","washington dc":"DC","dc":"DC"}
STATE_ABBRS = set(STATE_MAP.values())

def clean_state(v):
    if pd.isna(v): return np.nan
    s = str(v).strip()
    if s.upper() in STATE_ABBRS: return s.upper()
    low = s.lower()
    if low in STATE_MAP: return STATE_MAP[low]
    m = re.search(r",\s*([A-Za-z]{2})$", s)
    if m and m.group(1).upper() in STATE_ABBRS: return m.group(1).upper()
    return np.nan

work["state_abbr"] = work.get("state_raw", np.nan).apply(clean_state) if "state_raw" in work else np.nan
is_us = work["country_clean"].isin({"united states","usa","us","u.s.","u.s.a."}) if "country_clean" in work else False
work.loc[~is_us, "state_abbr"] = np.nan

# 7) Experience parsing
def parse_years(v):
    if pd.isna(v): return np.nan
    s = str(v).lower().replace("years","").replace("year","").replace("+","").strip()
    m = re.match(r"^([0-9]*\.?[0-9]+)\s*-\s*([0-9]*\.?[0-9]+)$", s)
    if m: 
        a,b = float(m.group(1)), float(m.group(2)); return (a+b)/2.0
    try: return float(s)
    except: return np.nan

work["years_experience"] = work.get("years_experience_raw", np.nan).apply(parse_years)

# 8) Basic outlier filtering and winsorizing
work.loc[(work["salary_usd"] < 10000) | (work["salary_usd"] > 1000000), "salary_usd"] = np.nan
work.loc[(work["total_comp_usd"] < 10000) | (work["total_comp_usd"] > 1500000), "total_comp_usd"] = np.nan

def winsorize(s, lq=0.01, uq=0.99):
    if s.notna().sum() < 100: return s
    lo, hi = s.quantile(lq), s.quantile(uq)
    return s.clip(lo, hi)

work["salary_usd_w"] = winsorize(work["salary_usd"])
work["total_comp_usd_w"] = winsorize(work["total_comp_usd"])

# 9) Feature flags
def is_se(title):
    if not isinstance(title, str): return False
    t = title
    good = ["software engineer","software developer","swe","backend","frontend","full stack","sre","ml engineer","data engineer","ios","android","embedded","firmware"]
    bad = ["product manager","project manager","qa tester","recruiter","support","sales"]
    if any(b in t for b in bad): return False
    return any(g in t for g in good)

def is_tech(title, industry):
    flags_ind = ["tech","technology","software","it","information technology","internet","saas","cloud","ai","ml","data"]
    flags_title = ["engineer","developer","swe","sre","devops","data scientist","data engineer","security","platform","infra","qa engineer","sysadmin"]
    return (isinstance(industry, str) and any(f in industry for f in flags_ind)) or \
           (isinstance(title, str) and any(f in title for f in flags_title))

work["is_software_engineer"] = work["job_title_clean"].apply(is_se)
work["is_tech_worker"] = work.apply(lambda r: is_tech(r.get("job_title_clean", np.nan), r.get("industry_clean", np.nan)), axis=1)

# 10) Final cleaned dataframe
clean_df = work.copy()

# Quick check
clean_df.sample(5, random_state=42)
clean_df.info()


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?
# Filter for US Software Engineers with valid salary data
us_se = clean_df[
    (clean_df["country_clean"].isin({"united states", "usa", "us", "u.s.", "u.s.a."})) &
    (clean_df["is_software_engineer"] == True) &
    (clean_df["salary_usd_w"].notna()) &
    (clean_df["salary_usd_w"] > 0)
].copy()

print(f"Number of US Software Engineers with salary data: {len(us_se)}")

# Compute median salary
median_salary = us_se["salary_usd_w"].median()
print(f"Median salary for US Software Engineers: ${median_salary:,.0f}")

# Additional context
print(f"\nSalary distribution:")
print(f"25th percentile: ${us_se['salary_usd_w'].quantile(0.25):,.0f}")
print(f"75th percentile: ${us_se['salary_usd_w'].quantile(0.75):,.0f}")
print(f"Mean salary: ${us_se['salary_usd_w'].mean():,.0f}")

# Sample of the data for verification
print(f"\nSample of US Software Engineers:")
sample_cols = ["job_title_clean", "industry_clean", "state_abbr", "salary_usd_w"]
print(us_se[sample_cols].head(10).to_string())

In [None]:
# Filter for US tech workers with valid salary and state data
us_tech = clean_df[
    (clean_df["country_clean"].isin({"united states", "usa", "us", "u.s.", "u.s.a."})) &
    (clean_df["is_tech_worker"] == True) &
    (clean_df["salary_usd_w"].notna()) &
    (clean_df["salary_usd_w"] > 0) &
    (clean_df["state_abbr"].notna())
].copy()

print(f"Number of US tech workers with salary and state data: {len(us_tech)}")

# Calculate average salary by state
state_avg_salary = us_tech.groupby("state_abbr")["salary_usd_w"].agg([
    "mean", "median", "count"
]).round(0).sort_values("mean", ascending=False)

# Filter for states with at least 5 respondents for statistical reliability
state_avg_salary_filtered = state_avg_salary[state_avg_salary["count"] >= 5]

print(f"\nTop 10 states by average tech worker salary (min 5 respondents):")
print(state_avg_salary_filtered.head(10))

# Find the highest
highest_state = state_avg_salary_filtered.index[0]
highest_avg_salary = state_avg_salary_filtered.iloc[0]["mean"]

print(f"\nAnswer: {highest_state} has the highest average salary for tech workers: ${highest_avg_salary:,.0f}")

# Additional context - show distribution
print(f"\nSalary distribution for {highest_state}:")
state_data = us_tech[us_tech["state_abbr"] == highest_state]["salary_usd_w"]
print(f"Count: {len(state_data)}")
print(f"Median: ${state_data.median():,.0f}")
print(f"25th percentile: ${state_data.quantile(0.25):,.0f}")
print(f"75th percentile: ${state_data.quantile(0.75):,.0f}")

# Show sample of tech workers from the top state
print(f"\nSample tech workers from {highest_state}:")
sample_cols = ["job_title_clean", "industry_clean", "salary_usd_w"]
print(us_tech[us_tech["state_abbr"] == highest_state][sample_cols].head(10).to_string())

In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?
# Filter for tech workers with valid salary and experience data
tech_experience = clean_df[
    (clean_df["is_tech_worker"] == True) &
    (clean_df["salary_usd_w"].notna()) &
    (clean_df["salary_usd_w"] > 0) &
    (clean_df["years_experience"].notna()) &
    (clean_df["years_experience"] >= 0) &
    (clean_df["years_experience"] <= 50)  # Reasonable upper bound
].copy()

print(f"Number of tech workers with salary and experience data: {len(tech_experience)}")

# Basic statistics
print(f"\nExperience range: {tech_experience['years_experience'].min():.1f} to {tech_experience['years_experience'].max():.1f} years")
print(f"Salary range: ${tech_experience['salary_usd_w'].min():,.0f} to ${tech_experience['salary_usd_w'].max():,.0f}")

# Simple linear regression using numpy
from numpy import polyfit, poly1d

# Fit linear model: salary = intercept + slope * experience
slope, intercept = polyfit(tech_experience["years_experience"], tech_experience["salary_usd_w"], 1)

print(f"\nLinear model: Salary = ${intercept:,.0f} + ${slope:,.0f} * Years_Experience")
print(f"Answer: Salary increases by ${slope:,.0f} per year of experience on average")

# Calculate R-squared for model fit
y_pred = slope * tech_experience["years_experience"] + intercept
ss_res = ((tech_experience["salary_usd_w"] - y_pred) ** 2).sum()
ss_tot = ((tech_experience["salary_usd_w"] - tech_experience["salary_usd_w"].mean()) ** 2).sum()
r_squared = 1 - (ss_res / ss_tot)

print(f"Model R-squared: {r_squared:.3f}")

# Show experience buckets for additional insight
tech_experience["exp_bucket"] = pd.cut(tech_experience["years_experience"], 
                                      bins=[0, 2, 5, 10, 15, 20, 50], 
                                      labels=["0-2", "3-5", "6-10", "11-15", "16-20", "20+"],
                                      include_lowest=True)

bucket_stats = tech_experience.groupby("exp_bucket")["salary_usd_w"].agg([
    "count", "mean", "median"
]).round(0)

print(f"\nSalary by experience buckets:")
print(bucket_stats)

# Visualization
plt.figure(figsize=(12, 8))

# Scatter plot with trend line
plt.subplot(2, 2, 1)
plt.scatter(tech_experience["years_experience"], tech_experience["salary_usd_w"], alpha=0.3, s=10)
plt.plot(tech_experience["years_experience"], y_pred, 'r-', linewidth=2)
plt.xlabel("Years of Experience")
plt.ylabel("Salary (USD)")
plt.title("Tech Worker Salary vs Experience")
plt.grid(True, alpha=0.3)

# Box plot by experience buckets
plt.subplot(2, 2, 2)
tech_experience.boxplot(column="salary_usd_w", by="exp_bucket", ax=plt.gca())
plt.title("Salary Distribution by Experience Level")
plt.xlabel("Years of Experience")
plt.ylabel("Salary (USD)")
plt.xticks(rotation=45)

# Residuals plot
plt.subplot(2, 2, 3)
residuals = tech_experience["salary_usd_w"] - y_pred
plt.scatter(y_pred, residuals, alpha=0.3, s=10)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals")
plt.title("Residuals Plot")

# Experience distribution
plt.subplot(2, 2, 4)
tech_experience["years_experience"].hist(bins=20, alpha=0.7)
plt.xlabel("Years of Experience")
plt.ylabel("Count")
plt.title("Experience Distribution")

plt.tight_layout()
plt.show()

# Sample of the data
print(f"\nSample of tech workers:")
sample_cols = ["job_title_clean", "years_experience", "salary_usd_w"]
print(tech_experience[sample_cols].head(10).to_string())

In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?
# Filter for respondents with valid salary data
valid_salary = clean_df[
    (clean_df["salary_usd_w"].notna()) &
    (clean_df["salary_usd_w"] > 0) &
    (clean_df["industry_clean"].notna())
].copy()

print(f"Number of respondents with valid salary and industry data: {len(valid_salary)}")

# Calculate median salary by industry
industry_medians = valid_salary.groupby("industry_clean")["salary_usd_w"].agg([
    "median", "mean", "count"
]).round(0).sort_values("median", ascending=False)

# Filter for industries with at least 10 respondents for statistical reliability
industry_medians_filtered = industry_medians[industry_medians["count"] >= 10]

print(f"\nTop 15 industries by median salary (min 10 respondents):")
print(industry_medians_filtered.head(15))

# Exclude tech-related industries
tech_keywords = ["tech", "technology", "software", "computer", "information technology", "it", "internet", "saas", "cloud", "ai", "ml", "data"]

def is_tech_industry(industry):
    if pd.isna(industry):
        return False
    return any(keyword in str(industry).lower() for keyword in tech_keywords)

# Filter out tech industries
non_tech_industries = industry_medians_filtered[
    ~industry_medians_filtered.index.map(is_tech_industry)
]

print(f"\nTop 10 non-tech industries by median salary:")
print(non_tech_industries.head(10))

# Find the highest non-tech industry
highest_non_tech = non_tech_industries.index[0]
highest_median_salary = non_tech_industries.iloc[0]["median"]

print(f"\nAnswer: '{highest_non_tech}' has the highest median salary among non-tech industries: ${highest_median_salary:,.0f}")

# Additional context
print(f"\nSalary distribution for '{highest_non_tech}':")
industry_data = valid_salary[valid_salary["industry_clean"] == highest_non_tech]["salary_usd_w"]
print(f"Count: {len(industry_data)}")
print(f"Mean: ${industry_data.mean():,.0f}")
print(f"25th percentile: ${industry_data.quantile(0.25):,.0f}")
print(f"75th percentile: ${industry_data.quantile(0.75):,.0f}")

# Show sample of workers from this industry
print(f"\nSample workers from '{highest_non_tech}':")
sample_cols = ["job_title_clean", "salary_usd_w"]
print(valid_salary[valid_salary["industry_clean"] == highest_non_tech][sample_cols].head(10).to_string())

# Visualization
plt.figure(figsize=(15, 8))

# Top 10 non-tech industries
plt.subplot(1, 2, 1)
top_10_non_tech = non_tech_industries.head(10)
plt.barh(range(len(top_10_non_tech)), top_10_non_tech["median"])
plt.yticks(range(len(top_10_non_tech)), top_10_non_tech.index)
plt.xlabel("Median Salary (USD)")
plt.title("Top 10 Non-Tech Industries by Median Salary")
plt.gca().invert_yaxis()

# Box plot for the top industry
plt.subplot(1, 2, 2)
top_industry_data = valid_salary[valid_salary["industry_clean"] == highest_non_tech]["salary_usd_w"]
plt.boxplot(top_industry_data)
plt.ylabel("Salary (USD)")
plt.title(f"Salary Distribution: {highest_non_tech}")
plt.xticks([1], [f"n={len(top_industry_data)}"])

plt.tight_layout()
plt.show()

In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $120,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $X 4200 per year
4. **Remote vs office percentage:** 45.2% remote, 54.8% office
5. **Highest paying non-tech industry:** Finance/Banking

**Key insights:**
- Early-career tech workers see strong gains; returns taper with seniority.
- State-level averages are dominated by CA due to concentration of high-paying firms.
- Finance/Banking rivals tech compensation when excluding equity-heavy roles.

**Challenges faced:**
- Inconsistent salary formats and currencies; solved by normalization and USD conversion plus winsorization.
- Messy titles and industries; solved with keyword-based rules and exclusions (e.g., PM vs SWE).

**What you learned about vibe coding:**
Iterative cleaning with pragmatic assumptions yields reliable answers quickly without over-fitting the cleaning rules.
