# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*Acquire the survey CSV and skim columns; draft a brief data dictionary.

Load the CSV robustly; infer data types; quantify missingness and duplicates.

Normalize compensation to annual USD (base; base+bonus as variant).

Set sanity bounds and outlier rules; winsorize or exclude extremes.

Standardize job titles; map synonyms; tag the Software Engineer cohort.

Define the “tech worker” cohort with explicit inclusion/exclusion rules.

Clean and consolidate industry labels into canonical categories.

Normalize US locations; map to state codes; filter to US respondents.

Parse years-of-experience fields; convert to numeric; handle ranges/bins.

Build analysis-ready tables per question with consistent filters/features.

Q1: Compute median salary for US Software Engineers; report N and CI.

Q2: Compute mean salary by state for tech workers; apply min-N threshold; identify top state.

Q3: Estimate salary increase per year of experience (OLS and robust median-slope); check nonlinearity.

Q4: Compute median salary by industry excluding tech; apply min-N; identify highest.

Run sensitivity checks (base vs total comp; with/without outliers; alt cohort rules).

Produce clear tables/plots; export artifacts; note caveats and assumptions.

Package in a reproducible notebook/script; save versioned outputs.

Covered ingestion, cleaning, cohort definitions, per-question computations, QA/sensitivity, and deliverables.


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


/Users/tableleaf/Documents/GitHub/ds-fall-2025-fri-1230/Week-05-Vibe-Coding-101/homeworks


## Step 2: Data Cleaning


In [None]:
import csv, re, statistics
from pathlib import Path

# Provide the exact filename once; code will find it anywhere in the repo if needed
FILENAME = "Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"

# Put your absolute path here if known; otherwise leave as None
USER_ABS_PATH = "/Users/tableleaf/Documents/GitHub/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"

# Common relative fallback within repo
RELATIVE_PATH = "Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"

def locate_tsv() -> str:
    candidates = []
    # 1) User absolute path
    if USER_ABS_PATH:
        p = Path(USER_ABS_PATH).expanduser()
        if p.is_file():
            return str(p)
        candidates.append(USER_ABS_PATH)
    # 2) Relative known path
    p = Path(RELATIVE_PATH)
    if p.is_file():
        return str(p)
    candidates.append(RELATIVE_PATH)
    # 3) Recursive search by filename
    for hit in Path(".").rglob(FILENAME):
        if hit.is_file():
            return str(hit.resolve())
    raise FileNotFoundError(
        f"Could not locate TSV. Tried: {candidates} and recursive search for {FILENAME}."
    )

COL_SALARY = "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)"
COL_BONUS = "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits."
COL_CURR = "Please indicate the currency"
COL_COUNTRY = "What country do you work in?"
COL_STATE = "If you're in the U.S., what state do you work in?"
COL_INDUSTRY = "What industry do you work in?"
COL_TITLE = "Job title"
COL_YOE_FIELD = "How many years of professional work experience do you have in your field?"

US_NAMES = {s.lower() for s in ["united states","united states of america","usa","us","u.s.","u.s.a.","america"]}

STATE_NAME_TO_CODE = {
    'alabama':'AL','alaska':'AK','arizona':'AZ','arkansas':'AR','california':'CA','colorado':'CO','connecticut':'CT','delaware':'DE','district of columbia':'DC',
    'florida':'FL','georgia':'GA','hawaii':'HI','idaho':'ID','illinois':'IL','indiana':'IN','iowa':'IA','kansas':'KS','kentucky':'KY','louisiana':'LA','maine':'ME',
    'maryland':'MD','massachusetts':'MA','michigan':'MI','minnesota':'MN','mississippi':'MS','missouri':'MO','montana':'MT','nebraska':'NE','nevada':'NV',
    'new hampshire':'NH','new jersey':'NJ','new mexico':'NM','new york':'NY','north carolina':'NC','north dakota':'ND','ohio':'OH','oklahoma':'OK','oregon':'OR',
    'pennsylvania':'PA','rhode island':'RI','south carolina':'SC','south dakota':'SD','tennessee':'TN','texas':'TX','utah':'UT','vermont':'VT','virginia':'VA',
    'washington':'WA','west virginia':'WV','wisconsin':'WI','wyoming':'WY','puerto rico':'PR'
}

YOE_MAP = {
    "1 year or less": 0.5, "2 - 4 years": 3.0, "5-7 years": 6.0, "8 - 10 years": 9.0,
    "11 - 20 years": 15.5, "21 - 30 years": 25.5, "31 - 40 years": 35.5, "41 years or more": 45.0,
}

def parse_money(text):
    if not text: return None
    t = str(text).strip().replace(",", "")
    m = re.findall(r"[-+]?[0-9]*\.?[0-9]+", t)
    return float(m[0]) if m else None

def is_us(country):
    return bool(country) and country.strip().lower() in US_NAMES

def normalize_state(state_raw):
    if not state_raw: return None
    first = state_raw.split(",")[0].strip().lower()
    if len(first) == 2 and first.isalpha():  # e.g., CA
        return first.upper()
    return STATE_NAME_TO_CODE.get(first)

ENGINEER_OR_DEV = re.compile(r"\b(engineer|developer|dev)\b", re.I)

def title_is_software_engineer(title):
    if not title: return False
    t = title.strip().lower()
    if re.search(r"\bsoftware\s+(engineer|developer)\b", t): return True
    if re.search(r"\b(full\s*stack|front\s*end|back\s*end|android|ios|mobile)\b", t) and ENGINEER_OR_DEV.search(t): return True
    if re.search(r"\b(sde|swe)\b", t): return True
    if re.search(r"\b(staff|senior|principal)\s+software\b", t): return True
    return False

def is_tech_worker(industry, title):
    ind = (industry or "").strip().lower()
    t = (title or "").strip().lower()
    if "computing or tech" in ind: return True
    if re.search(r"\b(software|developer|sde|swe|devops|sre|data\s+(scientist|engineer)|security\s+engineer|cloud|infrastructure)\b", t): return True
    if re.search(r"\b(front\s*end|back\s*end|full\s*stack)\b", t) and ENGINEER_OR_DEV.search(t): return True
    return False

def load_clean_rows(path=None):
    if path is None:
        path = locate_tsv()
    rows = []
    with open(path, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            if row.get(COL_CURR, "").strip() != "USD": continue
            if not is_us(row.get(COL_COUNTRY)): continue
            salary = parse_money(row.get(COL_SALARY, ""))
            if salary is None or not (10000 <= salary <= 1_000_000): continue
            bonus = parse_money(row.get(COL_BONUS, "")) or 0.0
            rows.append({
                "salary": float(salary),
                "total_comp": float(salary + bonus),
                "state": normalize_state(row.get(COL_STATE, "")),
                "industry": row.get(COL_INDUSTRY, ""),
                "title": row.get(COL_TITLE, ""),
                "yoe": YOE_MAP.get(row.get(COL_YOE_FIELD, "").strip()),
                "is_se": title_is_software_engineer(row.get(COL_TITLE, "")),
                "is_tech": is_tech_worker(row.get(COL_INDUSTRY, ""), row.get(COL_TITLE, "")),
            })
    return rows

def fmt_usd(x): return f"{int(round(x)):,}"

rows = load_clean_rows()


140,000


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [25]:
# Question 1: What is the median salary for Software Engineers in the United States?
# Q1: median salary for US Software Engineers (base salary)
se_salaries = [r["salary"] for r in rows if r["is_se"]]
q1_median = statistics.median(se_salaries)
print(fmt_usd(q1_median))

140,000


In [31]:
# Question 2: Which US state has the highest average salary for tech workers?
from collections import defaultdict
import statistics

rows = rows  # from previous cell

by_state = defaultdict(list)
for r in rows:
    if r['is_tech'] and r['state']:
        by_state[r['state']].append(r['salary'])

state_avg = {s: (sum(v)/len(v), len(v)) for s,v in by_state.items() if len(v) >= 30}
top_state, (top_avg, n) = max(state_avg.items(), key=lambda kv: kv[1][0])
print(top_state, round(top_avg), n)



CA 156415 696


In [36]:
# Question 3: How much does salary increase on average for each year of experience in tech?
xy = [(r['yoe'], r['salary']) for r in rows if r['is_tech'] and r['yoe'] is not None]
n = len(xy)
sx = sum(x for x,_ in xy); sy = sum(y for _,y in xy)
sxx = sum(x*x for x,_ in xy); sxy = sum(x*y for x,y in xy)
den = n * sxx - sx * sx
slope = (n * sxy - sx * sy) / den
print(round(slope, 2))

2359.58


In [43]:

# Question 4: What percentage of respondents work remotely vs. in-office?

# Heuristic based on the "What city do you work in?" field.
# Marks rows containing keywords as remote; otherwise in-office; hybrid if the word appears.
# Note: The survey lacks a dedicated remote flag, so this undercounts remote.

import csv, re
from pathlib import Path

FILENAME = "Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"
CANDIDATES = [
    "/Users/tableleaf/Documents/GitHub/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/" + FILENAME,
    "Week-02-Pandas-Part-2-and-DS-Overview/data/" + FILENAME,
]

def locate():
    for p in CANDIDATES:
        if Path(p).is_file():
            return p
    for hit in Path(".").rglob(FILENAME):
        if hit.is_file():
            return str(hit.resolve())
    raise FileNotFoundError(FILENAME)

path = locate()
city_field = "What city do you work in?"
remote_rx = re.compile(r"\b(remote|wfh|work\s*from\s*home|telework|telecommute)\b", re.I)
hybrid_rx = re.compile(r"\bhybrid\b", re.I)

total = with_city = remote = hybrid = in_office = 0
with open(path, newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        total += 1
        city = (row.get(city_field) or "").strip()
        if not city:
            continue
        with_city += 1
        lc = city.lower()
        if remote_rx.search(lc):
            remote += 1
        elif hybrid_rx.search(lc):
            hybrid += 1
        else:
            in_office += 1

pct = lambda a,b: round(a/b*100, 2) if b else 0.0
print({"remote_pct": pct(remote, with_city),
       "hybrid_pct": pct(hybrid, with_city),
       "in_office_pct": pct(in_office, with_city),
       "denominator": with_city})

{'remote_pct': 1.2, 'hybrid_pct': 0.0, 'in_office_pct': 98.8, 'denominator': 28053}


In [42]:
# Question 5: Which industry (besides tech) has the highest median salary?

import statistics
from collections import defaultdict

by_ind = defaultdict(list)
for r in rows:
    ind = (r['industry'] or '').strip()
    if not ind: continue
    lower = ind.lower()
    if 'computing or tech' in lower or lower in {'tech','technology'}: continue
    by_ind[ind].append(r['salary'])

candidates = [(ind, statistics.median(vals)) for ind, vals in by_ind.items() if len(vals) >= 30]
top_ind, top_median = max(candidates, key=lambda kv: kv[1])
print(top_ind, round(top_median))


Biotech 110000


In [39]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $140,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $2359.58 per year
4. **Remote vs office percentage:** 1.2% remote, 98.8% office
5. **Highest paying non-tech industry:** Biotech

**Key insights:**
- AI is very useful but if you don't understand to code you don't know if the result is accurate.
- You should be specific on what answer you want, because the AI will interept by itself.
- Make sure to not use a big prompt that exceeds the tokens.

**Challenges faced:**
- The first Challenge I faced was running in errors and I showed cursor the error I was getting then it changed it.
- I did not face any other challenges

**What you learned about vibe coding:**
- Have the AI prompt itself
- Be specific with your prompt
- You can feed the bugs to the AI
