# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

- [ ] Confirm dataset path and load TSV with proper delimiter and encoding
- [ ] Inspect schema: columns, dtypes, missingness, sample rows, unique value checks
- [ ] Standardize column names to snake_case and trim whitespace
- [ ] Clean salary fields: parse numeric, handle ranges, strip symbols, convert all to USD
- [ ] Handle currencies: detect currency column/patterns, map FX rates, convert consistently
- [ ] Filter invalid/irrelevant rows (e.g., missing salary, non-employed, internships if out of scope)
- [ ] Remove exact duplicates and near-duplicates if any
- [ ] Winsorize or cap extreme salary outliers with transparent rule; keep raw copy
- [ ] Normalize job titles; create flag for tech roles and a `Software Engineer` canonical label
- [ ] Parse/normalize location: extract country, US state codes; map free-text to standard
- [ ] Parse years of experience; coerce text ranges to numeric (midpoint) and handle 0/NaN
- [ ] Normalize education levels to ordered categories (HS, Assoc, BA/BS, MS/MA, PhD)
- [ ] Standardize gender categories (male, female, nonbinary/other, prefer_not_say)
- [ ] Create a clean, analysis-ready dataframe `df_clean` with documented transformations
- [ ] Validate with spot checks: distribution of salaries, counts by country/state/role
- [ ] Q1: Compute median salary for `Software Engineer` in United States
- [ ] Q2: Compute average salary by US state for tech workers; identify max
- [ ] Q3: Fit simple model/regression or groupwise slope: salary increase per YoE in tech
- [ ] Q4: Identify highest median salary industry excluding tech
- [ ] Bonus: Gender pay gap in tech roles (controlled/uncontrolled)
- [ ] Bonus: Compare MS vs BA median salaries; significance test if feasible
- [ ] Create clear tables/plots supporting each answer
- [ ] Document assumptions and cleaning rules inline (markdown) next to code
- [ ] Summarize final answers; ensure results within ±5% of expected benchmarks
- [ ] Save cleaned dataset and key aggregates to CSV for reproducibility


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [9]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Step 2: Data Cleaning


In [10]:

df = pd.read_csv('/workspaces/ds-fall-2025-tue/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep='\t')
df = df.drop_duplicates()

df.rename(columns={
    "How old are you?": "age",
    "What industry do you work in?": "industry",
    "Job title": "title",
    "If your job title needs additional context, please clarify here:": "title_context",
    "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)": "salary",
    "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.": "additional_comp",
    "Please indicate the currency": "currency",
    "If 'Other,' please indicate the currency here:": "other_currency",
    "If your income needs additional context, please provide it here:": "salary_context",
    "What country do you work in?": "country",
    "If you're in the U.S., what state do you work in?": "state",
    "What city do you work in?": "city",
    "How many years of professional work experience do you have overall?": "total_yoe",
    "How many years of professional work experience do you have in your field?": "field_yoe",
    "What is your highest level of education completed?": "education",
    "What is your gender?": "gender",
    "What is your race? (Choose all that apply.)": "race"
}, inplace=True)

df["country"] = df["country"].str.strip().str.title()
df['country'] = df['country'].replace({
    'United States of America': 'US',
    'United States': 'US',
    'U.S.': 'US',
    'USA': 'US',
    'U.S.A.': 'US'
})

df['title'] = df['title'].replace({
    'Software engineer': 'Software Engineer',
    'Senior Software Engineer' : 'Software Engineer',
    'Software Engineer III (VP)' : 'Software Engineer',
    'software engineer': 'Software Engineer',
})

df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
df = df.dropna(subset=["salary", "title", "industry", "country", "city"])

# filtered = df[df['industry'] == 'Computing or Tech']
# filtered['title'].value_counts()


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [11]:
# Question 1: What is the median salary for Software Engineers in the United States?
filtered = df[
    (df['title'] == 'Software Engineer') &
    (df['country'] == 'US')
]

median_salary = filtered['salary'].median()

print(median_salary)

135200.0


In [12]:
# Question 2: Which US state has the highest average salary for tech workers?
tech_us = df[
    (df['industry'] == 'Computing or Tech') &
    (df['country'] == 'US')
]
state_avg_salary = tech_us.groupby("state")["salary"].mean().sort_values(ascending=False)

top_state = state_avg_salary.head(1)
print(top_state)

state
Connecticut    230000.0
Name: salary, dtype: float64


In [23]:
# Question 3: How much does salary increase on average for each year of experience in tech?

tech = df[df["industry"] == 'Computing or Tech']

average_yoe_salary = tech.groupby('field_yoe')['salary'].mean()

salary_diffs = average_yoe_salary.diff()

avg_increase_per_year = salary_diffs.mean().round(2)
print(avg_increase_per_year)

6131.75


In [14]:

# Question 4: What percentage of respondents work remotely vs. in-office?
# Normalize text to avoid case issues
df["city"] = df["city"].str.strip().str.lower()

# Create a new column for work type
df["work_type"] = df["city"].apply(lambda x: "Remote" if "remote" in str(x) else "In-office")

# Count percentages
work_counts = df["work_type"].value_counts(normalize=True) * 100

print(work_counts)


work_type
In-office    98.922498
Remote        1.077502
Name: proportion, dtype: float64


In [15]:
# Question 5: Which industry (besides tech) has the highest median salary?

non_tech = df[~(df['industry'] == 'Computing or Tech')]

industry_medians = non_tech.groupby("industry")["salary"].median().sort_values(ascending=False)

highest_paying_industry = industry_medians.head(1)

print(highest_paying_industry)

industry
ESL Teacher    36000000.0
Name: salary, dtype: float64


In [16]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $135200.00
2. **Highest paying US state for tech:** Conneticut
3. **Salary increase per year of experience:** $6131.75
4. **Remote vs office percentage:** 1% remote, 99% office
5. **Highest paying non-tech industry:** ESL Teacher

**Key insights:**
- Code quality depends on human direction
- It's like coding with someone where you're actually having a conversation during collaboration

**Challenges faced:**
- Cursor didn't really have my entire dataset, so without context, the responses were sometimes off. I would just iterate and give snippets of the dataset. 

**What you learned about vibe coding:**
- You have to be really specific about what you want outputted. 
- Sometimes, the output might not be something you want so you have to iterate. 
