# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

My Data Cleaning Plan
Define goals and scope

Core questions: pay levels by job title, industry, location, experience; total comp distributions; remote vs on-site; company size effects; trends by education and tenure.
Assumptions: raw is 2021 survey; we’ll convert all comp to 2021 USD using annual average FX.
Set up project and data inventory

Create data/raw, data/interim, data/processed, reports, docs.
Save raw as data/raw/askamanager_2021.csv (no edits).
Draft initial data dictionary in docs/data_dictionary.md with column names, types, and meaning.
Profile the raw dataset

Inspect row count, unique respondents, duplicate keys, null rates, basic distributions for salary, additional comp, currency, country, job title, industry.
Flag messy fields: ranges (e.g., “80–90k”), text with “k”, commas, mixed currencies, free-text locations and titles.
Standardize schema

Normalize column names to snake_case; trim whitespace; unify yes/no values to booleans; parse dates.
Coerce base types: numeric, categorical, string.
Normalize compensation (critical)

Parse annual_salary and additional_compensation:
Remove symbols, commas, “k/K”, ranges (take mid or split strategy), notes in parentheses.
Currency handling:
Use currency or currency_other; map to ISO codes.
Convert to USD with 2021 annual average FX; document source.
Compute total_comp_usd = base_salary_usd + additional_comp_usd.
Validation: 0 < base ≤ 5,000,000; total_comp ≥ base; drop or flag impossible values.
Location harmonization

Split and trim country, state, city; map to ISO-3166 (country) and USPS (US states).
Derive is_us, us_region, metro_guess (if city/state present).
Standardize remote status (remote/hybrid/on-site) if available.
Validation: country coverage ≥ 98%; unknowns bucketed as Unknown.
Job title and seniority normalization

Clean titles (case, punctuation); remove employer names and extraneous terms.
Extract seniority (Intern, Junior, Mid, Senior, Lead, Manager, Director, VP, C-level) into separate field.
Map titles to normalized roles (e.g., Software Engineer, Data Analyst, Product Manager) with a lookup table.
Validation: top 100 raw titles mapped; ≥90% rows have a normalized role or Other.
Industry and company attributes

Map free-text industry to a controlled list (10–20 buckets).
Normalize company size (employee count bands), company type (public/private/non-profit/gov).
Validation: bucket coverage ≥95%.
Experience and tenure

Parse years_experience and years_with_employer to numeric; bucket to ranges.
Derive seniority_level cross-checking experience and title.
Validation: non-negative, reasonable caps (e.g., ≤ 60).
Handle missing data

Decide per-field strategy: drop, impute (median for numeric within role/region), or set to Unknown.
Document rationale in docs/data_quality_notes.md.
Outliers and anomalies

Detect with log-scale IQR within role×country (or role×US_region).
Winsorize or exclude extreme outliers; keep an audit trail in reports/row_exclusions.csv.
Feature engineering (analysis-ready)

Create pay_band (deciles by role), cost_level (optional), role_family, experience_bucket, company_size_bucket, remote_flag.
Compute ppp_adjusted_comp (optional) if PPP index available; otherwise skip.
Quality assurance

Row reconciliation: raw rows = kept + dropped.
Spot checks: 20 sampled records before/after cleaning.
Aggregate sanity: median comp by role and country within expected ranges; no negative or null totals in analysis set.
Outputs

data/processed/salary_clean.parquet (typed, deduped, analysis-ready).
data/interim/mappings/* for title/industry lookups.
docs/data_dictionary.md, docs/data_quality_notes.md, reports/qa_summary.md.
Initial insights deliverables

Tables/figures: comp by role, role×location, experience curves, remote premium, company size impact.
Executive summary slide with 5–7 key findings and caveats.
Key early decisions to document

FX rate source and handling for missing currencies.
Range parsing policy and thresholds for exclusion.
Title/industry mapping coverage targets and fallback buckets.
Reproducibility

Lock environment; parameterize file paths; write seeds for any stochastic steps.
Version all lookup tables and exclusions.
Summary:

A phased checklist to clean, normalize, and validate the 2021 survey into an analysis-ready dataset with documented assumptions, mappings, and QA.
Focus areas are compensation parsing/currency conversion, location/title normalization, and robust outlier handling to support reliable business insights.


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Step 2: Data Cleaning


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?

In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3
