# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*
Ingest raw CSV and validate schema against 2021 codebook
Profile dataset: shape, types, missingness, distributions
Normalize column names and set consistent data types
Standardize strings, booleans, and categorical levels
Remove duplicates and clearly invalid responses
Parse compensation fields: base, bonus, currency, pay basis
Annualize pay, convert to USD, compute total compensation
Handle textual/range compensation entries with rules
Standardize location fields and derive remote status
Normalize job titles; derive job family and seniority
Standardize industry and company size categories
Clean tenure/experience; derive years-in-role and years-at-company
Define and apply missing/inconsistent value rules
Detect outliers; winsorize or cap with documented thresholds
Add data quality flags and audit columns
Implement reproducible cleaning pipeline with config-driven rules
Add validation tests and automated checks
Persist cleaned dataset, codebook, and metadata with versioning
Produce summary EDA report and sanity checks
Document cleaning decisions and assumptions in README
- [ ] Example todo item
- [ ] Another example
- [ ] ...


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [120]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(
    "/workspaces/ds-fall-2025-tue/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv",
    sep="\t"
)

# Quick look at the data
print(df.shape)  
print(df.columns)
df.head()

(28062, 18)
Index(['Timestamp', 'How old are you?', 'What industry do you work in?',
       'Job title',
       'If your job title needs additional context, please clarify here:',
       'What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)',
       'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.',
       'Please indicate the currency',
       'If "Other," please indicate the currency here: ',
       'If your income needs additional context, please provide it here:',
       'What country do you work in?',
       'If you're in the U.S., what state do you work in?',
       'What city do you work in?',
       'How many years of professional work experience do you have overal

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## Step 2: Data Cleaning


In [121]:

df = df.copy()
df.columns = ['timestamp', 'age', 'field', 'job_title', 'additional_context', 'annual_salary', 'monetary_compensations','currency', 'other_currency', 'income_context', 'country', 'state', 'city', 'years_worked_total', 'years_worked_in_field','highest_education', 'gender', 'race',]

df["country_clean"] = df["country"].str.strip()

df["country_clean"] = df["country_clean"].replace({
    r"United States": "US",   # matches anything starting with "United States"
    r"USA": "US",
    r'United States of America': 'US',
    r'Usa': "US",
    r"US": "US",
    r'U.S.': 'US'})
df["country_clean"] = df["country_clean"].str.title()
df['country_clean'].value_counts()
df['annual_salary'] = pd.to_numeric(df['annual_salary'], errors ='coerce')
df = df.dropna(subset=['annual_salary']).copy()
df.head()

Unnamed: 0,timestamp,age,field,job_title,additional_context,annual_salary,monetary_compensations,currency,other_currency,income_context,country,state,city,years_worked_total,years_worked_in_field,highest_education,gender,race,country_clean
97,4/27/2021 11:05:08,55-64,Environmental regulation,Quality Assurance Lead,,58.0,,USD,,,USA,Maryland,Baltimore,21 - 30 years,8 - 10 years,Master's degree,Woman,White,Us
166,4/27/2021 11:06:07,25-34,Health care,occupational therapist,,35.0,,EUR,,,finland,,helsinki,2 - 4 years,2 - 4 years,College degree,Woman,White,Finland
895,4/27/2021 11:16:37,45-54,Education (Primary/Secondary),Special Education Teacher,,38.0,0.0,USD,,,USA,Idaho,Pocatello,21 - 30 years,11 - 20 years,Master's degree,Woman,White,Us
968,4/27/2021 11:17:54,25-34,Media & Digital,Managing Editor,,61.0,0.0,USD,,,United States,Maryland,Bethesda,5-7 years,5-7 years,College degree,Woman,White,Us
1607,4/27/2021 11:28:23,25-34,Agriculture or Forestry,Chief Data Scientist,,130.0,0.0,USD,,,USA,New Mexico,Santa Fe,8 - 10 years,8 - 10 years,PhD,Woman,White,Us


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [122]:
# Question 1: What is the median salary for Software Engineers in the United States?
median_se_us = df.loc[
    (df["country_clean"] == "Us") &
    (df["job_title"].str.contains("software engineer", case=False, na=False)),
    "annual_salary"
].median()
print(median_se_us)

141875.0


In [123]:
# Question 2: Which US state has the highest average salary for tech workers?
df_tech = df[df['field']== "Computing or Tech"].copy()
state_salary= (
    df_tech.groupby('state')['annual_salary'].agg(['mean']).sort_values("mean", ascending= False)
)
print(state_salary)

                                         mean
state                                        
Michigan, Texas, Washington     340000.000000
Florida                         210711.000000
California, Oregon              200000.000000
California, Colorado            176000.000000
Delaware                        169000.000000
Connecticut                     161610.000000
California                      153286.988827
New York                        146594.112150
Wisconsin                       145806.888889
District of Columbia            142923.076923
New Jersey, New York            137500.000000
Ohio                            137066.666667
Washington                      136923.916667
Massachusetts                   125899.155844
Nevada                          124550.000000
New Jersey                      124090.909091
Oregon                          122640.000000
West Virginia                   120000.000000
Maryland                        119955.461538
Illinois                        11

In [124]:
# Question 3: How much does salary increase on average for each year of experience in tech?

In [125]:

# Question 4: What percentage of respondents work remotely vs. in-office?


In [126]:
# Question 5: Which industry (besides tech) has the highest median salary?

In [127]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1 : Vibe coding requires very specific instruction to the AI model or else it will not help in accomplishing what we are trying to accomplish.
- Learning 2
- Learning 3
