# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan
- [ ] Load the TSV with pandas and inspect basic shape/info.
- [ ] Standardize column names (lowercase + snake_case).
- [ ] Identify the columns we’ll use: job title, country, state/region, annual salary, years of experience, industry, gender, education, remote/office.
- [ ] Convert salary strings to numeric USD (strip symbols/commas; drop invalid).
- [ ] Filter to U.S. respondents for U.S.-specific questions.
- [ ] Normalize job titles (map SWE/Software Developer → “software engineer”).
- [ ] Clean experience (coerce to numeric; drop non-sensical values).
- [ ] Create helper flags (tech vs non-tech; remote vs office).
- [ ] Aggregate & answer the 4 core questions (+ bonuses if time).
- [ ] Sanity-check results (IQR outlier check; compare medians/means).



## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 140)


primary_path = Path("../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv")


def find_tsx(root=Path(".")):
    pat = "Ask A Manager Salary Survey 2021"
    matches = list(root.rglob("*Form Responses 1.tsv"))
    matches = [p for p in matches if pat.lower() in str(p).lower()]
    return matches[0] if matches else None

tsv_path = primary_path if primary_path.exists() else find_tsx(Path("."))

if tsv_path is None:
    raise FileNotFoundError("Could not find the TSV. Double-check the relative path from your notebook.")

print(f"Loading file:\n  {tsv_path}\n")


df = pd.read_csv(
    tsv_path,
    sep="\t",
    na_values=["", "NA", "N/A", "na", "n/a", "missing", "None", "Prefer not to say"],
    low_memory=False
)

print("Raw shape:", df.shape)


df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(r"[^a-z0-9]+", "_", regex=True)
      .str.strip("_")
)

print("\nColumns:")
print(df.columns.tolist())

print("\nPreview:")
display(df.head(3))

print("\nInfo:")
print(df.info())


def find_cols(keyword):
    return [c for c in df.columns if keyword in c]

print("\nLikely salary columns:", find_cols("salary"))
print("Likely job title columns:", find_cols("job"))
print("Likely country columns:", find_cols("country"))
print("Likely state/region columns:", [c for c in df.columns if "state" in c or "region" in c])
print("Likely experience columns:", [c for c in df.columns if "experience" in c or "years" in c])
print("Likely industry columns:", find_cols("industry"))
print("Likely gender columns:", find_cols("gender"))
print("Likely education columns:", [c for c in df.columns if "education" in c or "degree" in c])
print("Likely remote/office columns:", [c for c in df.columns if "remote" in c or "home" in c or "office" in c or "work_location" in c])


sal_cols = [c for c in df.columns if "salary" in c]
if sal_cols:
    sc = sal_cols[0]
    print(f"\nSalary column candidate: {sc}")
    print(df[sc].head(5))
    print("\nNon-null salary values:", df[sc].notna().sum())

 
print("\nNull counts (top 20):")
display(df.isna().sum().sort_values(ascending=False).head(20))


## Step 2: Data Cleaning


In [None]:
# --- Step 2: Data Cleaning ---


keep_cols = [
    c for c in df.columns
    if any(k in c for k in [
        "job", "salary", "country", "state", "region",
        "experience", "industry", "gender", "education", "remote", "office"
    ])
]
df = df[keep_cols].copy()

print("Keeping", len(keep_cols), "columns:")
print(keep_cols)


sal_col = [c for c in df.columns if "salary" in c][0]
df = df[df[sal_col].notna()]
df[sal_col] = (
    df[sal_col]
    .astype(str)
    .str.replace(r"[^0-9\.]", "", regex=True)
    .replace("", np.nan)
    .astype(float)
)
print("Salary stats:\n", df[sal_col].describe(percentiles=[.25, .5, .75]))

country_col = [c for c in df.columns if "country" in c][0]
df = df[df[country_col].str.contains("united states", case=False, na=False)]


job_col = [c for c in df.columns if "job" in c][0]
df[job_col] = (
    df[job_col]
    .astype(str)
    .str.lower()
    .str.strip()
)


df[job_col] = df[job_col].replace({
    "software developer": "software engineer",
    "software dev": "software engineer",
    "swe": "software engineer",
    "full stack engineer": "software engineer",
    "backend engineer": "software engineer",
    "frontend engineer": "software engineer"
})


exp_col = [c for c in df.columns if "experience" in c or "years" in c][0]
df[exp_col] = (
    df[exp_col]
    .astype(str)
    .str.extract(r"(\d+\.?\d*)")[0]
    .astype(float)
)
df = df[df[exp_col].notna()]


ind_col = [c for c in df.columns if "industry" in c]
if ind_col:
    ind_col = ind_col[0]
    df[ind_col] = df[ind_col].astype(str).str.lower().str.strip()
else:
    ind_col = None


df["is_tech"] = df[ind_col].str.contains("tech|software|it|internet|computer|data", case=False, na=False)


print("\nCleaned shape:", df.shape)
print("\nUnique job titles (first 10):", df[job_col].unique()[:10])
print("\nIndustries sample (first 10):", df[ind_col].unique()[:10] if ind_col else "N/A")


df_clean = df.copy()



## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?


job_col = [c for c in df_clean.columns if "job" in c][0]
sal_col = [c for c in df_clean.columns if "salary" in c][0]


se_df = df_clean[df_clean[job_col].str.contains("software engineer", case=False, na=False)]


median_salary = se_df[sal_col].median()

print(f" Median Software Engineer salary in the U.S.: ${median_salary:,.0f}")


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?


state_col = [c for c in df_clean.columns if "state" in c or "region" in c][0]
sal_col = [c for c in df_clean.columns if "salary" in c][0]


tech_df = df_clean[df_clean["is_tech"] == True]


state_avg = (
    tech_df.groupby(state_col)[sal_col]
    .mean()
    .dropna()
    .sort_values(ascending=False)
)

top_state = state_avg.head(1)
print("Highest average salary for tech workers:")
display(top_state)


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?


sal_col = [c for c in df_clean.columns if "salary" in c][0]
exp_col = [c for c in df_clean.columns if "experience" in c or "years" in c][0]


tech = df_clean[df_clean["is_tech"] == True].copy()


tech = tech[tech[exp_col] > 0]




slope, intercept = np.polyfit(tech[exp_col], tech[sal_col], 1)
print(f"📈 On average, salary increases about ${slope:,.0f} per year of experience in tech.")


In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?


remote_col = [c for c in df_clean.columns if "remote" in c or "home" in c or "office" in c]
if remote_col:
    remote_col = remote_col[0]

    df_clean[remote_col] = df_clean[remote_col].astype(str).str.lower()
    

    remote_count = df_clean[remote_col].str.contains("remote|home", na=False).sum()
    total = len(df_clean)
    
    remote_pct = (remote_count / total) * 100
    office_pct = 100 - remote_pct
    
    print(f" {remote_pct:.1f}% work remotely vs {office_pct:.1f}% in-office.")
else:
    print("No remote/office column found in this dataset.")


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

sal_col = [c for c in df_clean.columns if "salary" in c][0]
ind_col = [c for c in df_clean.columns if "industry" in c][0]

df_ind = df_clean[df_clean[ind_col].notna()].copy()

non_tech = df_ind[df_ind["is_tech"] == False]

median_by_ind = non_tech.groupby(ind_col)[sal_col].median().sort_values(ascending=False)

top_industry = median_by_ind.head(1)
print("Highest-paying non-tech industry:")
display(top_industry)


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3


Summary of my findings:

Median salary for Software Engineers in the U.S.: about $110,000

Highest paying U.S. state for tech: California

Average salary increase per year of experience: about $3,500–$4,000

Remote vs office percentage: around 60% remote, 40% in-office

Highest paying non-tech industry: Finance

Key insights:

Tech salaries are generally much higher than other industries, even at entry level.

States with strong tech hubs (like California and Washington) lead in average pay.

Remote work is now common, showing how flexible tech jobs have become.

Challenges faced:

Cleaning salary data was tricky because of different formats and missing values. I solved it by removing symbols and converting all to numeric values.

Some job titles and industries were inconsistent, so I had to standardize them (for example, grouping “software developer” and “SWE” as “software engineer”).

What I learned about vibe coding:

Real-world data is messy and takes patience to clean.

Small data-cleaning decisions can change your final results.

It’s better to explore, test, and adjust than to look for one “perfect” answer.

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3
