# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

### My Data Cleaning Plan (Step 0)

- [ ] Confirm scope and decisions
  - **Authoritative questions**: Use the top “Core Questions” list. Clarify if the later “remote vs office” is required.
  - **Dataset path**: Point to local TSV; confirm encoding and separator.
  - **Currency**: Convert all to USD using a fixed rate table (if needed) or restrict to USD-only rows.
  - **Annualization**: Treat values as annualized (per survey prompt). Exclude rows with non-annual or unclear entries.
  - **Tech roles**: Define “tech” via `What industry do you work in?` contains “Computing or Tech” OR job title keyword list.
  - **Software Engineer definition**: Job title contains keywords: software engineer, SWE, SDE, developer, backend, frontend, full stack, etc. Exclude “manager” unless explicitly included.
  - **Experience column**: Use `How many years of professional work experience do you have overall?` Map ranges to midpoints.
  - **Geography**: Use `What country do you work in?` for U.S. filter; for states, use `If you're in the U.S., what state do you work in?`.
  - **Outliers**: Remove extreme salaries (e.g., < $10k or > $1M), drop missing salary/currency, enforce positive numeric.
  - **Full-time**: If context indicates part-time/contract, exclude unless clearly annualized.

- [ ] Load and sanity-check
  - Read TSV with `sep="\t"`, try `encoding="utf-8"`, fallback to `latin-1` if needed.
  - Verify row/column counts, preview `head()`, check unique currencies, countries, industries, and null rates.

- [ ] Profile schema and quality
  - Identify key columns: salary, additional comp, currency, country, state, job title, industry, experience, gender, education.
  - Compute missingness, unique counts, and distributions for salary/experience.
  - Detect currency mix and presence of non-numeric salary strings.

- [ ] Normalize fields
  - **Salary**: Strip commas/whitespace; coerce to numeric. Create `total_comp = salary + additional_comp` where valid.
  - **Currency**:
    - Option A: Convert to USD using a fixed 2021 mapping (USD=1, GBP≈1.38, EUR≈1.20; confirm table).
    - Option B: Restrict to `currency == USD` for core answers; document exclusion.
  - **Experience**: Map ranges to midpoint (e.g., “2 - 4 years” → 3).
  - **Geography**: Normalize country (`US`, `USA` → `United States`). Standardize state names/abbreviations.
  - **Titles/Industries**: Lowercase, trim, collapse whitespace. Build keyword sets for tech roles and software engineer.

- [ ] Apply filters by question
  - U.S.-only when required (Q1, Q2).
  - Tech roles for Q2, Q3; software engineer subset for Q1.
  - Exclude rows failing numeric/clean checks or missing critical fields.

- [ ] Compute core questions
  - Q1 (Median salary for Software Engineers in U.S.): median USD `total_comp` (or base salary if required) on software engineer subset in `United States`.
  - Q2 (Highest average salary by U.S. state for tech workers): group tech subset by state; mean USD `total_comp`; return top state.
  - Q3 (Average increase per year of experience in tech): linear regression of USD `total_comp` on experience years in tech subset; report slope ($/year). Sanity-check with binned means.
  - Q4 (Highest median salary by industry excluding tech): group by industry, exclude tech; median USD `total_comp`; return top industry.

- [ ] Validation and sensitivity
  - Re-run results with/without additional comp; note differences.
  - Trim top/bottom 1% and confirm stability.
  - Ensure results within 5% tolerance of expected benchmarks (if given); document assumptions.

- [ ] Documentation and deliverables
  - Paste this plan at top of notebook.
  - Record final numeric answers and short rationale.
  - List key cleaning decisions: currency approach, filters, outlier rules.
  - Note limitations and potential impact.

- [ ] Ready-to-implement checklist
  - [ ] Confirm currency approach (convert vs USD-only)
  - [ ] Finalize tech/software engineer keyword lists
  - [ ] Confirm whether to include additional comp in “salary”
  - [ ] Confirm whether “remote vs office” question is required

## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Data Loading and Exploration (Codespaces-friendly)
from pathlib import Path

# Show working directory to debug relative paths
print("CWD:", Path.cwd())

# Primary path (as provided)
relative_path = Path("../Week-02-Pandas-Part-2-and-DS-Overview/data") / "Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"

# Fallbacks
possible_paths = [
    relative_path,
    Path("data") / "Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv",
    Path("/workspaces") / Path.cwd().name / "data" / "Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv",
]

# If none of the above exist, search upwards for the file by name
file_name = "Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"
if not any(p.exists() for p in possible_paths):
    search_roots = [Path.cwd(), Path.cwd().parent, Path.cwd().parent.parent]
    found = []
    for root in search_roots:
        try:
            found = list(root.rglob(file_name))
        except Exception:
            found = []
        if found:
            break
    if found:
        data_path = found[0]
    else:
        data_path = relative_path
else:
    # Choose the first existing path
    for p in possible_paths:
        if p.exists():
            data_path = p
            break

print("Using path:", data_path)

# Try common encodings
encodings_to_try = ["utf-8", "latin-1"]
last_err = None
for enc in encodings_to_try:
    try:
        df = pd.read_csv(data_path, sep="\t", encoding=enc)
        used_encoding = enc
        break
    except Exception as e:
        last_err = e
        df = None

if df is None:
    raise RuntimeError(f"Failed to read TSV at {data_path}. Last error: {last_err}")

print(f"Loaded file: {data_path}")
print(f"Encoding used: {used_encoding}")
print("Shape:", df.shape)
print("Columns (count=", len(df.columns), "):")
print(list(df.columns))

# Quick peek at the data
display(df.head(5))

# Basic sanity checks
num_missing_salary = df["What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)"].isna().sum() if "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)" in df.columns else None
num_missing_currency = df["Please indicate the currency"].isna().sum() if "Please indicate the currency" in df.columns else None
print({
    "num_rows": len(df),
    "num_missing_salary": num_missing_salary,
    "num_missing_currency": num_missing_currency,
})

# Common categorical distributions (top 10)
if "Please indicate the currency" in df.columns:
    print("Top currencies:")
    display(df["Please indicate the currency"].value_counts().head(10))
if "What country do you work in?" in df.columns:
    print("Top countries:")
    display(df["What country do you work in?"].value_counts().head(10))
if "What industry do you work in?" in df.columns:
    print("Top industries:")
    display(df["What industry do you work in?"].value_counts().head(10))

CWD: /workspaces/ds-fall-2025-fri-1230/Week-05-Vibe-Coding-101/homeworks
Using path: /workspaces/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv
Loaded file: /workspaces/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv
Encoding used: utf-8
Shape: (28062, 18)
Columns (count= 18 ):
['Timestamp', 'How old are you?', 'What industry do you work in?', 'Job title', 'If your job title needs additional context, please clarify here:', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not t

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


{'num_rows': 28062, 'num_missing_salary': np.int64(0), 'num_missing_currency': np.int64(0)}
Top currencies:


Please indicate the currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
Name: count, dtype: int64

Top countries:


What country do you work in?
United States     8992
USA               7934
US                2610
Canada            1570
United States      667
U.S.               580
UK                 574
United Kingdom     547
USA                468
Usa                448
Name: count, dtype: int64

Top industries:


What industry do you work in?
Computing or Tech                       4699
Education (Higher Education)            2464
Nonprofits                              2419
Health care                             1896
Government and Public Administration    1889
Accounting, Banking & Finance           1809
Engineering or Manufacturing            1695
Marketing, Advertising & PR             1133
Law                                     1097
Business or Consulting                   852
Name: count, dtype: int64

In [None]:
# Step 1 (cont.): Quick profiling of schema and missingness

# Basic info
print("Rows, Cols:", df.shape)
df_info = pd.DataFrame({
    "non_null": df.notnull().sum(),
    "null": df.isnull().sum(),
    "null_pct": (df.isnull().mean() * 100).round(2),
    "nunique": df.nunique(dropna=True),
})
display(df_info.sort_values("null_pct", ascending=False))

# Salary fields: inspect as strings before coercion
salary_col = "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)"
addl_col = "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits."
cur_col = "Please indicate the currency"

for col in [salary_col, addl_col, cur_col]:
    if col in df.columns:
        print(f"\nTop values for: {col}")
        display(df[col].head(10))
        display(df[col].value_counts(dropna=False).head(10))

# Countries and states
for col in ["What country do you work in?", "If you're in the U.S., what state do you work in?"]:
    if col in df.columns:
        print(f"\nTop values for: {col}")
        display(df[col].value_counts(dropna=False).head(15))

# Experience columns
for col in [
    "How many years of professional work experience do you have overall?",
    "How many years of professional work experience do you have in your field?",
]:
    if col in df.columns:
        print(f"\nTop values for: {col}")
        display(df[col].value_counts(dropna=False).head(15))

Rows, Cols: (28062, 18)


Unnamed: 0,non_null,null,null_pct,nunique
"If ""Other,"" please indicate the currency here:",206,27856,99.27,121
"If your income needs additional context, please provide it here:",3042,25020,89.16,2978
"If your job title needs additional context, please clarify here:",7262,20800,74.12,6998
"How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",20766,7296,26.0,847
"If you're in the U.S., what state do you work in?",23039,5023,17.9,134
What is your highest level of education completed?,27840,222,0.79,6
What is your race? (Choose all that apply.),27885,177,0.63,49
What is your gender?,27891,171,0.61,5
What city do you work in?,27980,82,0.29,4830
What industry do you work in?,27988,74,0.26,1219



Top values for: What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)


0     55,000
1     54,600
2     34,000
3     62,000
4     60,000
5     62,000
6     33,000
7     50,000
8    112,000
9     45,000
Name: What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.), dtype: object

What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
60,000     430
80,000     406
70,000     402
65,000     400
75,000     383
90,000     368
50,000     361
100,000    313
85,000     306
55,000     303
Name: count, dtype: int64


Top values for: How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.


0        0.0
1     4000.0
2        NaN
3     3000.0
4     7000.0
5        NaN
6     2000.0
7        NaN
8    10000.0
9        0.0
Name: How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits., dtype: float64

How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.
0.0        7949
NaN        7296
5000.0     1112
10000.0     962
2000.0      775
1000.0      678
3000.0      635
15000.0     538
20000.0     508
4000.0      428
Name: count, dtype: int64


Top values for: Please indicate the currency


0    USD
1    GBP
2    USD
3    USD
4    USD
5    USD
6    USD
7    USD
8    USD
9    USD
Name: Please indicate the currency, dtype: object

Please indicate the currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
Name: count, dtype: int64


Top values for: What country do you work in?


What country do you work in?
United States               8992
USA                         7934
US                          2610
Canada                      1570
United States                667
U.S.                         580
UK                           574
United Kingdom               547
USA                          468
Usa                          448
United States of America     428
Australia                    318
United states                208
usa                          183
Germany                      173
Name: count, dtype: int64


Top values for: If you're in the U.S., what state do you work in?


If you're in the U.S., what state do you work in?
NaN                     5023
California              2610
New York                2173
Massachusetts           1522
Texas                   1268
Illinois                1213
Washington              1184
District of Columbia     983
Pennsylvania             943
Virginia                 786
Minnesota                723
Ohio                     655
Colorado                 632
Oregon                   626
North Carolina           601
Name: count, dtype: int64


Top values for: How many years of professional work experience do you have overall?


How many years of professional work experience do you have overall?
11 - 20 years       9624
8 - 10 years        5377
5-7 years           4882
21 - 30 years       3637
2 - 4 years         3026
31 - 40 years        869
1 year or less       523
41 years or more     124
Name: count, dtype: int64


Top values for: How many years of professional work experience do you have in your field?


How many years of professional work experience do you have in your field?
11 - 20 years       6536
5-7 years           6519
2 - 4 years         6249
8 - 10 years        4982
21 - 30 years       1868
1 year or less      1485
31 - 40 years        382
41 years or more      41
Name: count, dtype: int64

## Step 2: Data Cleaning


In [40]:
# Step 2: Data Cleaning

# Column names used
salary_col = "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)"
addl_col = "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits."
cur_col = "Please indicate the currency"
country_col = "What country do you work in?"
state_col = "If you're in the U.S., what state do you work in?"
exp_overall_col = "How many years of professional work experience do you have overall?"
exp_field_col = "How many years of professional work experience do you have in your field?"

start_rows = len(df)

# Coerce salary/additional to numeric
for c in [salary_col, addl_col]:
    if c in df.columns:
        df[c] = (
            df[c]
            .astype(str)
            .str.replace(",", "", regex=False)
            .str.strip()
            .replace({"": None, "nan": None})
        )
        df[c] = pd.to_numeric(df[c], errors="coerce")

# Filter to USD rows only
if cur_col in df.columns:
    df = df[df[cur_col].str.upper().str.contains("USD", na=False)]

# Drop rows with missing salary
df = df[df[salary_col].notna()]

# Outlier filter
df = df[(df[salary_col] >= 10000) & (df[salary_col] <= 1_000_000)]

# Compute total compensation including additional comp when present
if addl_col in df.columns:
    df["total_comp"] = df[salary_col].fillna(0) + df[addl_col].fillna(0)
else:
    df["total_comp"] = df[salary_col]

# Normalize country names to unify US
if country_col in df.columns:
    df[country_col] = df[country_col].astype(str).str.strip()
    df[country_col] = df[country_col].replace({
        "US": "United States",
        "U.S.": "United States",
        "USA": "United States",
        "U.S.A.": "United States",
    })

# Map experience ranges to midpoints
def map_exp_to_years(val: str) -> float | None:
    if pd.isna(val):
        return None
    s = str(val).strip()
    # Exact numeric
    try:
        return float(s)
    except Exception:
        pass
    # Ranges like "2 - 4 years"
    if "-" in s:
        parts = (
            s.replace("years", "")
            .replace("year", "")
            .replace("+", "")
            .split("-")
        )
        try:
            lo = float(parts[0].strip())
            hi = float(parts[1].strip())
            return (lo + hi) / 2.0
        except Exception:
            return None
    # "Less than 1 year"
    if "Less than 1" in s:
        return 0.5
    # "More than 30 years" or "31 - 40 years" style already covered; catch trailing plus
    if s.endswith("+"):
        try:
            return float(s[:-1])
        except Exception:
            return None
    return None

for col in [exp_overall_col, exp_field_col]:
    if col in df.columns:
        df[col + " (years)"] = df[col].apply(map_exp_to_years)

end_rows = len(df)
print({
    "start_rows": start_rows,
    "after_usd_filter": len(df),
    "after_missing_salary_filter": len(df),
    "after_outlier_filter": len(df),
})

# Preview cleaned numeric fields
display(df[[salary_col, addl_col, "total_comp"]].head(10))

{'start_rows': 28062, 'after_usd_filter': 23257, 'after_missing_salary_filter': 23257, 'after_outlier_filter': 23257}


Unnamed: 0,"What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",total_comp
0,55000,0.0,55000.0
2,34000,,34000.0
3,62000,3000.0,65000.0
4,60000,7000.0,67000.0
5,62000,,62000.0
6,33000,2000.0,35000.0
7,50000,,50000.0
8,112000,10000.0,122000.0
9,45000,0.0,45000.0
10,47500,0.0,47500.0


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?

# Define U.S. filter and Software Engineer keywords
country_col = "What country do you work in?"
state_col = "If you're in the U.S., what state do you work in?"
job_title_col = "Job title"

us_mask = df[country_col].eq("United States") if country_col in df.columns else pd.Series(False, index=df.index)

se_keywords = [
    "software engineer", "swe", "sde", "developer", "software developer",
    "backend", "back-end", "frontend", "front-end", "full stack", "full-stack",
]
exclude_keywords = ["manager", "director", "vp", "chief", "lead"]

# Build title mask
if job_title_col in df.columns:
    titles = df[job_title_col].astype(str).str.lower()
    include = titles.apply(lambda s: any(k in s for k in se_keywords))
    exclude = titles.apply(lambda s: any(k in s for k in exclude_keywords))
    se_mask = include & ~exclude
else:
    se_mask = pd.Series(False, index=df.index)

q1_df = df[us_mask & se_mask].copy()

median_q1 = q1_df["total_comp"].median()
count_q1 = len(q1_df)

print({
    "subset_rows": count_q1,
    "median_total_comp_usd": round(float(median_q1), 2) if pd.notnull(median_q1) else None,
})

display(q1_df[[job_title_col, "total_comp", state_col]].head(10))

{'subset_rows': 959, 'median_total_comp_usd': 135600.0}


Unnamed: 0,Job title,total_comp,"If you're in the U.S., what state do you work in?"
43,Principal Software Engineer,192500.0,Pennsylvania
46,Mobile developer,147100.0,Massachusetts
95,senior web application developer,98166.0,Indiana
321,Software Engineer,345000.0,Massachusetts
345,Developer,75500.0,Illinois
389,Software Engineer,112000.0,Florida
437,Backend Elixir Engineer,120000.0,Ohio
455,Software Engineer,83000.0,Minnesota
511,Senior Software Engineer,140000.0,Texas
557,Embedded Software Engineer,80000.0,Virginia


In [None]:
# Q1: Explicit answer output
country_col = "What country do you work in?"
job_title_col = "Job title"

# Recompute subset to be self-contained
us_mask = df[country_col].eq("United States") if country_col in df.columns else pd.Series(False, index=df.index)
se_keywords = [
    "software engineer", "swe", "sde", "developer", "software developer",
    "backend", "back-end", "frontend", "front-end", "full stack", "full-stack",
]
exclude_keywords = ["manager", "director", "vp", "chief", "lead"]

if job_title_col in df.columns:
    titles = df[job_title_col].astype(str).str.lower()
    include = titles.apply(lambda s: any(k in s for k in se_keywords))
    exclude = titles.apply(lambda s: any(k in s for k in exclude_keywords))
    se_mask = include & ~exclude
else:
    se_mask = pd.Series(False, index=df.index)

q1_df = df[us_mask & se_mask].copy()
median_q1 = q1_df["total_comp"].median()
count_q1 = len(q1_df)

if pd.notnull(median_q1):
    print(f"Q1: Median USD total compensation for Software Engineers in the United States: ${median_q1:,.0f} (n={count_q1})")
else:
    print("Q1: Unable to compute median (no matching rows)")

Q1: Median USD total compensation for Software Engineers in the United States: $135,600 (n=959)


In [43]:
# Question 2: Which US state has the highest average salary for tech workers?

country_col = "What country do you work in?"
state_col = "If you're in the U.S., what state do you work in?"
industry_col = "What industry do you work in?"
job_title_col = "Job title"

# U.S. filter
us_mask = df[country_col].eq("United States") if country_col in df.columns else pd.Series(False, index=df.index)

# Define tech via industry or title keywords
tech_keywords = [
    "computing or tech", "software", "it", "information technology", "data", "machine learning",
]
se_keywords = [
    "software engineer", "swe", "sde", "developer", "software developer",
    "backend", "back-end", "frontend", "front-end", "full stack", "full-stack",
    "data engineer", "ml engineer", "site reliability", "devops",
]

ind_mask = pd.Series(False, index=df.index)
title_mask = pd.Series(False, index=df.index)

if industry_col in df.columns:
    inds = df[industry_col].astype(str).str.lower()
    ind_mask = inds.apply(lambda s: any(k in s for k in tech_keywords))

if job_title_col in df.columns:
    titles = df[job_title_col].astype(str).str.lower()
    title_mask = titles.apply(lambda s: any(k in s for k in se_keywords))

tech_mask = ind_mask | title_mask

q2_df = df[us_mask & tech_mask].copy()

# Drop missing or ambiguous states
if state_col in q2_df.columns:
    q2_df = q2_df[q2_df[state_col].notna() & (q2_df[state_col].astype(str).str.strip() != "")]

    # Split multi-state entries (e.g., "Georgia, Massachusetts") and explode
    q2_df[state_col] = q2_df[state_col].astype(str)
    q2_df[state_col] = q2_df[state_col].str.split(',')
    q2_df = q2_df.explode(state_col, ignore_index=True)
    q2_df[state_col] = q2_df[state_col].astype(str).str.strip()

    # Normalize DC naming
    q2_df[state_col] = q2_df[state_col].replace({
        "DC": "District of Columbia",
        "D.C.": "District of Columbia",
        "Washington DC": "District of Columbia",
        "Washington, DC": "District of Columbia",
    })

    # Remove empty remnants
    q2_df = q2_df[q2_df[state_col].astype(str).str.len() > 0]

# Compute average by state
state_avg = (
    q2_df.groupby(state_col, dropna=False)["total_comp"].mean().sort_values(ascending=False)
)

# Output results
if not state_avg.empty:
    top_state = state_avg.index[0]
    top_value = state_avg.iloc[0]
    print(f"Q2: Highest average USD total compensation for tech workers by U.S. state: {top_state} (${top_value:,.0f})")
    print("\nTop 5 states by average total compensation:")
    display(state_avg.head(5))
else:
    print("Q2: No data available after filtering.")

Q2: Highest average USD total compensation for tech workers by U.S. state: California ($165,752)

Top 5 states by average total compensation:


If you're in the U.S., what state do you work in?
California       165751.634712
Washington       161164.866142
New York         127250.740275
Colorado         124667.115385
Massachusetts    119597.986817
Name: total_comp, dtype: float64

In [44]:
# Question 3: How much does salary increase on average for each year of experience in tech?

country_col = "What country do you work in?"
industry_col = "What industry do you work in?"
job_title_col = "Job title"
exp_overall_col = "How many years of professional work experience do you have overall?"
exp_overall_num_col = exp_overall_col + " (years)"

# Define tech via industry or title keywords (same as Q2)
tech_keywords = [
    "computing or tech", "software", "it", "information technology", "data", "machine learning",
]
se_keywords = [
    "software engineer", "swe", "sde", "developer", "software developer",
    "backend", "back-end", "frontend", "front-end", "full stack", "full-stack",
    "data engineer", "ml engineer", "site reliability", "devops",
]

ind_mask = pd.Series(False, index=df.index)
title_mask = pd.Series(False, index=df.index)

if industry_col in df.columns:
    inds = df[industry_col].astype(str).str.lower()
    ind_mask = inds.apply(lambda s: any(k in s for k in tech_keywords))

if job_title_col in df.columns:
    titles = df[job_title_col].astype(str).str.lower()
    title_mask = titles.apply(lambda s: any(k in s for k in se_keywords))

tech_mask = ind_mask | title_mask

q3_df = df[tech_mask].copy()

# Use overall experience in years
if exp_overall_num_col not in q3_df.columns:
    raise RuntimeError("Experience years column not found. Ensure Step 2 mapping created '... (years)'.")

# Keep rows with valid numeric experience and comp; limit to reasonable range
q3_df = q3_df[q3_df[exp_overall_num_col].notna() & q3_df["total_comp"].notna()]
q3_df = q3_df[(q3_df[exp_overall_num_col] >= 0) & (q3_df[exp_overall_num_col] <= 40)]

x = q3_df[exp_overall_num_col].to_numpy(dtype=float)
y = q3_df["total_comp"].to_numpy(dtype=float)

if len(x) < 2:
    print("Q3: Not enough data to fit model.")
else:
    # Simple linear regression via polyfit (y = m*x + b)
    m, b = np.polyfit(x, y, 1)

    # Binned sanity check: group by rounded years
    q3_df["years_int"] = np.round(q3_df[exp_overall_num_col]).astype(int)
    binned = q3_df.groupby("years_int")["total_comp"].mean().reset_index()
    if len(binned) >= 2:
        m_bin, b_bin = np.polyfit(binned["years_int"].to_numpy(), binned["total_comp"].to_numpy(), 1)
    else:
        m_bin, b_bin = np.nan, np.nan

    print(
        f"Q3: Estimated average increase per year of experience in tech: ${m:,.0f} per year"
    )
    print(
        f"(Binned mean slope for sanity check: ${m_bin:,.0f} per year; n={len(q3_df)})"
    )
    display(binned.head(10))

Q3: Estimated average increase per year of experience in tech: $2,050 per year
(Binned mean slope for sanity check: $1,270 per year; n=7662)


Unnamed: 0,years_int,total_comp
0,3,89017.342604
1,6,98001.91155
2,9,109050.990759
3,16,132227.553891
4,26,141749.838679
5,36,126109.801047


In [45]:
# Question 4: What percentage of respondents work remotely vs. in-office?

context_col = "If your income needs additional context, please provide it here:"

if context_col not in df.columns:
    print("No context column available to search.")
else:
    ctx = df[context_col].astype(str).str.lower()
    # Simple keyword buckets
    remote_kw = ["remote", "work from home", "wfh", "fully remote"]
    office_kw = ["in office", "on-site", "onsite", "in-office"]
    hybrid_kw = ["hybrid", "part remote", "partly remote", "some days"]

    def label_mode(s: str) -> str:
        if any(k in s for k in remote_kw):
            if any(k in s for k in office_kw) or any(k in s for k in hybrid_kw):
                return "hybrid"
            return "remote"
        if any(k in s for k in office_kw):
            return "office"
        if any(k in s for k in hybrid_kw):
            return "hybrid"
        return "unknown"

    mode = ctx.apply(label_mode)

    counts = mode.value_counts(dropna=False)
    total = counts.sum()
    pct = (counts / total * 100).round(2)

    print("Heuristic remote vs in-office split from context field (labels may be noisy):")
    display(pd.DataFrame({"count": counts, "pct": pct}))
    print("As shown above, work modality is not widely collected in the dataset So this question cannot be answered with the available data.")

Heuristic remote vs in-office split from context field (labels may be noisy):


Unnamed: 0_level_0,count,pct
"If your income needs additional context, please provide it here:",Unnamed: 1_level_1,Unnamed: 2_level_1
unknown,23229,99.88
remote,27,0.12
office,1,0.0


As shown above, work modality is not widely collected in the dataset So this question cannot be answered with the available data.


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

industry_col = "What industry do you work in?"

# Exclude tech industry labels and blanks
nonempty = df[industry_col].notna() & (df[industry_col].astype(str).str.strip() != "")

# Normalize industry text
inds = df.loc[nonempty, industry_col].astype(str).str.strip().str.lower()

# Define tech labels/phrases to exclude
tech_industry_terms = {
    "computing or tech",
    "information technology",
    "software",
    "tech",
    "it",
}

is_tech_industry = inds.apply(lambda s: any(term in s for term in tech_industry_terms))
non_tech_idx = inds.index[~is_tech_industry]

q4_df = df.loc[non_tech_idx].copy()

# Keep valid industry labels
good_ind_mask = q4_df[industry_col].notna() & (q4_df[industry_col].astype(str).str.strip() != "")
q4_df = q4_df[good_ind_mask]

# Winsorize total_comp at 1st/99th percentiles to reduce outlier impact
q_low = q4_df["total_comp"].quantile(0.01)
q_high = q4_df["total_comp"].quantile(0.99)
q4_df["total_comp_w"] = q4_df["total_comp"].clip(lower=q_low, upper=q_high)

# Apply minimum count threshold per industry (e.g., n >= 25)
counts = q4_df.groupby(industry_col)["total_comp_w"].size()
valid_inds = counts[counts >= 25].index
q4_df = q4_df[q4_df[industry_col].isin(valid_inds)]

# Compute median by industry on winsorized comp
median_by_ind = (
    q4_df.groupby(industry_col)["total_comp_w"].median().sort_values(ascending=False)
)

if not median_by_ind.empty:
    top_industry = median_by_ind.index[0]
    top_value = median_by_ind.iloc[0]
    print(f"Q5: Highest median USD total compensation (excluding tech, n>=25, winsorized): {top_industry} (${top_value:,.0f})")
    print("\nTop 5 industries by median total compensation (non-tech):")
    display(median_by_ind.head(5))
else:
    print("Q5: No non-tech industry data available after filtering (n>=25).")

Q5: Highest median USD total compensation (excluding tech, n>=25, winsorized): Law ($100,000)

Top 5 industries by median total compensation (non-tech):


What industry do you work in?
Law                              100000.0
Business or Consulting            98000.0
Engineering or Manufacturing      95000.0
Sales                             86000.0
Accounting, Banking & Finance     85000.0
Name: total_comp_w, dtype: float64

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $135,600
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $2,050 per year
4. **Remote vs office percentage:** Work modality is not widely collected in the dataset. So this question cannot be answered with the available data.
5. **Highest paying non-tech industry:** Law

**Key insights:**
- The major tech hubs in America on the East (NYC + Boston) and West (California + Washington) coasts pay the best. This is expected as decades of technological innovation have taken place in these cities. This innovation has made the cities popular areas for people to move to, but the high cost of living requires bigger salaires to support yourself.
- Kids are often told STEM oriented careers are the most lucrative but after tech, 5 of the highest paying non-tech industries are business related. Sales, Accounting, and Business are 5 of the best paying industries that do not include tech. Law is also another non STEM related career path on the list and actually tops the list as the best paying industry outside of tech. Engineering is actually the only STEM centered industry present on the top 5 which goes to show the advice kids get is misleading.
- There will be times when certain business questions cannot be answered by your data. A great example is question 4 in this assignment. There is not really any information collected in the dataset pertaining to work modality so the question really cannot be answered at all. This means that the business questions asked of the data need to fall within the context of data avaliable in the dataset or we need to enrich our data with more information in order to answer the business questions asked that fall outside of the scope of the dataset.

**Challenges faced:**
- I had a lot of trouble with uploading the dataset to Cursor. I am not sure if it was because of the file type or because I used the free version but I could not just upload it normally like I did with my .ipynb file. Eventually I solved it by downloading the dataset locally and opening it on Cursor and then copying + pasting the results from the Agent to my CodeSpace to complete this assignment.
- Question 4 was confusing because Cursor told me there was no work modality data collected in the dataset. At first I thought it was a hallucination until I went and checked the file myself and saw how there was no column for work location. Cursor further expanded on this by creating a script that showed me 99% of the jobs in the data had no work modality included. This cleared things up a lot and let me answer the question properly.

**What you learned about vibe coding:**
- Sometimes AI overcomplicate things. I remember for our data cleaning HW earlier, the code was not nearly as complex and long as the code genereated by my AI Agent for this assignment.
- Breaking things up helps both the AI and me. I found that splitting up the code and steps allowed me to better track the AI's changes and not feel as overwhelmed or lost in my own code file.
- Double checking is necessary. There were times where Cursor said it arrived at a solution for a problem without even providing a answer (1 state is asked for but Cursor provides a list of 10).