# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

## Cleaning Steps

### 1. Load and Initial Inspection

- [ ] Load TSV file with proper encoding
- [ ] Examine column names, data types, and basic statistics
- [ ] Check for duplicate records
- [ ] Identify total missing values per column

### 2. Handle Missing and Invalid Data

- [ ] Remove rows with missing salary values (critical field)
- [ ] Remove rows with missing country data
- [ ] Handle missing bonus values (treat as 0 or separate flag)
- [ ] Remove or flag obviously invalid salaries (< $10,000 or > $10,000,000)
- [ ] Handle missing education levels
- [ ] Handle missing gender data

### 3. Standardize Country and Location Data

- [ ] Standardize country names (US, USA, United States → United States)
- [ ] Standardize state names (handle abbreviations and full names)
- [ ] Filter for US-only data where needed for business questions
- [ ] Clean city names (handle extra spaces, capitalization)

### 4. Currency Conversion

- [ ] Identify all unique currencies in dataset
- [ ] Apply 2021 exchange rates to convert all salaries to USD
- [ ] Convert bonus amounts to USD as well
- [ ] Create flag column indicating original currency

### 5. Standardize Experience Data

- [ ] Parse experience ranges (e.g., "5-7 years", "8 - 10 years")
- [ ] Convert to numeric values (use midpoint of range)
- [ ] Handle special cases like "1 year or less", "41 years or more"

### 6. Clean and Categorize Job Titles

- [ ] Standardize "Software Engineer" variations (Software Developer, Software Eng, etc.)
- [ ] Identify tech roles (Software Engineer, Developer, Data Scientist, etc.)
- [ ] Create industry/role category flags for analysis
- [ ] Handle job title variations and typos

### 7. Standardize Industry Categories

- [ ] Identify "Computing or Tech" industry variations
- [ ] Group similar industries for comparison
- [ ] Create tech vs non-tech flag

### 8. Clean Education Data

- [ ] Standardize education levels (College degree, Bachelor's, etc.)
- [ ] Create education hierarchy for comparison
- [ ] Handle missing education values

### 9. Clean Gender Data

- [ ] Standardize gender categories
- [ ] Handle "prefer not to answer" and missing values
- [ ] Group for analysis (Man, Woman, Non-binary, Other)

### 10. Handle Outliers and Validation

- [ ] Use IQR method or domain knowledge to identify salary outliers
- [ ] Validate experience vs age consistency
- [ ] Check for logical inconsistencies (e.g., 30 years experience at age 25)

### 11. Create Derived Features

- [ ] Total compensation = salary + bonus
- [ ] Experience numeric (midpoint of range)
- [ ] Age numeric (midpoint of range)
- [ ] Tech worker flag
- [ ] Software Engineer flag

### 12. Final Dataset Preparation

- [ ] Save cleaned dataset to new file
- [ ] Document cleaning decisions and row counts at each step
- [ ] Create summary statistics of cleaned data


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [16]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

df = pd.read_csv('/workspaces/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep='\t')
df.head(10)

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White
5,4/27/2021 11:02:46,25-34,Education (Higher Education),Scholarly Publishing Librarian,,62000,,USD,,,USA,New Hampshire,Hanover,8 - 10 years,2 - 4 years,Master's degree,Man,White
6,4/27/2021 11:02:51,25-34,Publishing,Publishing Assistant,,33000,2000.0,USD,,,USA,South Carolina,Columbia,2 - 4 years,2 - 4 years,College degree,Woman,White
7,4/27/2021 11:03:00,25-34,Education (Primary/Secondary),Librarian,"High school, FT",50000,,USD,,,United States,Arizona,Yuma,5-7 years,5-7 years,Master's degree,Man,White
8,4/27/2021 11:03:01,45-54,Computing or Tech,Systems Analyst,Data developer/ETL Developer,112000,10000.0,USD,,,US,Missouri,St. Louis,21 - 30 years,21 - 30 years,College degree,Woman,White
9,4/27/2021 11:03:02,35-44,"Accounting, Banking & Finance",Senior Accountant,,45000,0.0,USD,,I work for a Charter School,United States,Florida,Palm Coast,21 - 30 years,21 - 30 years,College degree,Woman,"Hispanic, Latino, or Spanish origin, White"


## Step 2: Data Cleaning


In [17]:

# Rename columns to short, meaningful names
df = df.rename(columns={
    "Timestamp": "timestamp",
    "How old are you?": "age",
    "What industry do you work in?": "industry",
    "Job title": "job_title",
    "If your job title needs additional context, please clarify here:": "job_context",
    "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)": "salary",
    "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.": "bonus",
    "Please indicate the currency": "currency",
    "If \"Other,\" please indicate the currency here: ": "currency_other",
    "If your income needs additional context, please provide it here:": "income_context",
    "What country do you work in?": "country",
    "If you're in the U.S., what state do you work in?": "state",
    "What city do you work in?": "city",
    "How many years of professional work experience do you have overall?": "exp_overall",
    "How many years of professional work experience do you have in your field?": "exp_field",
    "What is your highest level of education completed?": "education",
    "What is your gender?": "gender",
    "What is your race? (Choose all that apply.)": "race"
})

# Basic overview
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())
print("\nMissing values:\n", df.isna().sum())
print("\nDuplicate rows:", df.duplicated().sum())

# ============================================================
# STEP 2: HANDLE MISSING AND INVALID DATA
# ============================================================

# Remove rows missing salary or country
df = df.dropna(subset=["salary", "country"])

# Replace missing bonuses with 0
df["bonus"] = df["bonus"].fillna(0)

# Convert salary to numeric (remove commas, text, symbols)
df["salary"] = (
    df["salary"]
    .astype(str)
    .str.replace(r"[^0-9.]", "", regex=True)
    .replace("", np.nan)
    .astype(float)
)

# Remove invalid salaries (<10k or >10M)
df = df[(df["salary"] >= 10000) & (df["salary"] <= 10000000)]

# Fill missing education or gender with "Unknown"
df["education"] = df["education"].fillna("Unknown")
df["gender"] = df["gender"].fillna("Unknown")

# ============================================================
# STEP 3: STANDARDIZE COUNTRY AND LOCATION DATA
# ============================================================

# Standardize US variations
df["country"] = df["country"].replace(
    {"USA": "United States", "US": "United States", "U.S.": "United States"}
)

# Clean city capitalization and trim whitespace
df["city"] = df["city"].astype(str).str.strip().str.title()

# Optional: Keep only U.S. rows for certain analyses
df_us = df[df["country"] == "United States"].copy()

# ============================================================
# STEP 4: CURRENCY CONVERSION
# ============================================================

# Define 2021 exchange rates (to USD)
exchange_rates = {
    "USD": 1.0,
    "EUR": 1.18,
    "GBP": 1.38,
    "CAD": 0.80,
    "AUD": 0.75,
    "INR": 0.0137,
    "CNY": 0.155,
    "JPY": 0.0091
}

# Convert all salaries and bonuses to USD
df["currency"] = df["currency"].replace("", np.nan).fillna(df["currency_other"])
df["currency"] = df["currency"].str.upper().str.strip()

df["exchange_rate"] = df["currency"].map(exchange_rates).fillna(1.0)
df["salary_usd"] = df["salary"] * df["exchange_rate"]
df["bonus_usd"] = df["bonus"] * df["exchange_rate"]
df["orig_currency"] = df["currency"]

# ============================================================
# STEP 5: STANDARDIZE EXPERIENCE DATA
# ============================================================

def parse_experience(s):
    if pd.isna(s):
        return np.nan
    s = str(s)
    if "less" in s.lower():
        return 0.5
    if "more" in s.lower():
        nums = re.findall(r"\d+", s)
        return float(nums[0]) + 1 if nums else np.nan
    nums = re.findall(r"\d+", s)
    if len(nums) == 2:
        return (float(nums[0]) + float(nums[1])) / 2
    elif len(nums) == 1:
        return float(nums[0])
    return np.nan

df["exp_overall_yrs"] = df["exp_overall"].apply(parse_experience)
df["exp_field_yrs"] = df["exp_field"].apply(parse_experience)

# ============================================================
# STEP 6: CLEAN AND CATEGORIZE JOB TITLES
# ============================================================

# Normalize job titles
df["job_title_clean"] = (
    df["job_title"]
    .str.lower()
    .str.strip()
    .replace({
        "software developer": "software engineer",
        "software eng": "software engineer",
        "dev": "developer",
    }, regex=False)
)

# Create tech-role flag
tech_keywords = ["engineer", "developer", "data", "scientist", "analyst", "it", "software"]
df["is_tech_role"] = df["job_title_clean"].apply(lambda x: any(k in str(x) for k in tech_keywords))

# ============================================================
# STEP 7: STANDARDIZE INDUSTRY CATEGORIES
# ============================================================

def clean_industry_field(text):
    """Clean and split industry strings into lists, separated by 'and' or commas."""
    if pd.isna(text):
        return []
    text = str(text).strip()
    # Replace ' and ' with commas for consistent splitting
    text = re.sub(r'\s+and\s+', ',', text)
    # Split by comma, strip whitespace, and remove empty entries
    items = [x.strip() for x in text.split(',') if x.strip()]
    return items

# Apply the function to industry column
df["industry_list"] = df["industry"].apply(clean_industry_field)

# For analysis convenience, you might also create a single normalized string
df["industry_clean"] = df["industry_list"].apply(lambda x: "; ".join(x))

# ============================================================
# STEP 8: CLEAN EDUCATION DATA
# ============================================================

edu_map = {
    "High school": "High School",
    "Some college": "Some College",
    "College degree": "Bachelor's",
    "Bachelor": "Bachelor's",
    "Master": "Master's",
    "PhD": "Doctorate",
}
df["education_clean"] = df["education"].replace(edu_map, regex=False)
df["education_clean"] = df["education_clean"].fillna("Unknown")

# ============================================================
# STEP 9: CLEAN GENDER DATA
# ============================================================

gender_map = {
    "Man": "Man",
    "Male": "Man",
    "Woman": "Woman",
    "Female": "Woman",
    "Nonbinary": "Non-binary",
    "Prefer not to answer": "Unknown"
}
df["gender_clean"] = df["gender"].replace(gender_map, regex=False)
df["gender_clean"] = df["gender_clean"].fillna("Unknown")

# ============================================================
# STEP 10: HANDLE OUTLIERS AND VALIDATION
# ============================================================

# IQR method for salary outliers
Q1 = df["salary_usd"].quantile(0.25)
Q3 = df["salary_usd"].quantile(0.75)
IQR = Q3 - Q1
df["is_outlier"] = ~df["salary_usd"].between(Q1 - 1.5 * IQR, Q3 + 1.5 * IQR)

# ============================================================
# STEP 11: CREATE DERIVED FEATURES
# ============================================================

df["total_comp_usd"] = df["salary_usd"] + df["bonus_usd"]
df["is_software_eng"] = df["job_title_clean"].str.contains("software engineer", na=False)

# ============================================================
# STEP 12: FINAL DATASET PREPARATION
# ============================================================

# Save cleaned dataset
df.to_csv("/workspaces/ds-fall-2025-fri-1230/Week-05-Vibe-Coding-101/homeworks/cleaned_salary_survey.tsv", index=False)

# Print summary
print("\nCLEANING COMPLETE ✅")
print("Final shape:", df.shape)
print("Saved to: askamanager_2021_cleaned.csv")


Shape: (28062, 18)
Columns: ['timestamp', 'age', 'industry', 'job_title', 'job_context', 'salary', 'bonus', 'currency', 'currency_other', 'income_context', 'country', 'state', 'city', 'exp_overall', 'exp_field', 'education', 'gender', 'race']

Missing values:
 timestamp             0
age                   0
industry             74
job_title             1
job_context       20800
salary                0
bonus              7296
currency              0
currency_other    27856
income_context    25020
country               0
state              5023
city                 82
exp_overall           0
exp_field             0
education           222
gender              171
race                177
dtype: int64

Duplicate rows: 0

CLEANING COMPLETE ✅
Final shape: (27902, 33)
Saved to: askamanager_2021_cleaned.csv


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [29]:
# Question 1: What is the median salary for Software Engineers in the United States?
import pandas as pd
import numpy as np

# ------------------------------------------------------------
# STEP 1: LOAD CLEANED DATA
# ------------------------------------------------------------

df = pd.read_csv("/workspaces/ds-fall-2025-fri-1230/Week-05-Vibe-Coding-101/homeworks/cleaned_salary_survey.tsv", sep="\t", encoding="utf-8")

print(f"Loaded {len(df):,} rows, {len(df.columns)} columns\n")

# ------------------------------------------------------------
# STEP 2: NORMALIZE COLUMN NAMES (lowercase and underscores)
# ------------------------------------------------------------
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

# For debugging — uncomment to see what columns exist
print("Available columns:\n", df.columns.tolist(), "\n")

# Define flexible name mapping to actual columns present
col_map = {
    "job_title": next((c for c in df.columns if "job" in c and "title" in c), None),
    "salary": next((c for c in df.columns if "salary" in c), None),
    "country": next((c for c in df.columns if "country" in c), None),
    "state": next((c for c in df.columns if "state" in c), None),
    "industry": next((c for c in df.columns if "industry" in c and "list" not in c and "clean" not in c), None),
    "industry_clean": next((c for c in df.columns if "industry_clean" in c), None),
    "exp_overall_yrs": next((c for c in df.columns if "exp_overall_yrs" in c or "exp_overall" in c), None)
}

print("Detected column map:\n", col_map, "\n")

# ============================================================
# QUESTION 1:
# What is the median salary for Software Engineers in the United States?
# ============================================================

software_mask = df[col_map["job_title"]].str.contains("software", case=False, na=False)
us_mask = df[col_map["country"]].str.lower().isin(["united states", "usa", "us"])

software_us = df[software_mask & us_mask]

median_salary_software_us = software_us[col_map["salary"]].median()

print("1️⃣ Median salary for Software Engineers in the United States:")
print(f"${median_salary_software_us:,.2f}\n")


Loaded 27,902 rows, 1 columns

Available columns:
 ['timestamp,age,industry,job_title,job_context,salary,bonus,currency,currency_other,income_context,country,state,city,exp_overall,exp_field,education,gender,race,exchange_rate,salary_usd,bonus_usd,orig_currency,exp_overall_yrs,exp_field_yrs,job_title_clean,is_tech_role,industry_list,industry_clean,education_clean,gender_clean,is_outlier,total_comp_usd,is_software_eng'] 

Detected column map:
 {'job_title': 'timestamp,age,industry,job_title,job_context,salary,bonus,currency,currency_other,income_context,country,state,city,exp_overall,exp_field,education,gender,race,exchange_rate,salary_usd,bonus_usd,orig_currency,exp_overall_yrs,exp_field_yrs,job_title_clean,is_tech_role,industry_list,industry_clean,education_clean,gender_clean,is_outlier,total_comp_usd,is_software_eng', 'salary': 'timestamp,age,industry,job_title,job_context,salary,bonus,currency,currency_other,income_context,country,state,city,exp_overall,exp_field,education,gender,ra

In [None]:
# Question 2: Which US state has the highest average salary for tech workers?
mask_us_tech = (
    df[country_col].str.lower().eq("united states") &
    df[job_col].str.contains("tech|software|developer|engineer|data|it|cyber", case=False, na=False)
)
state_avg = (
    df.loc[mask_us_tech, [state_col, salary_col]]
    .groupby(state_col, dropna=True)[salary_col]
    .mean()
    .sort_values(ascending=False)
)
top_state = state_avg.head(1)
print(f"2️⃣ Highest average tech salary: {top_state.index[0]} (${top_state.values[0]:,.0f})")

In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?
# Keep only tech workers with numeric experience
mask_exp = df[job_col].str.contains("tech|software|developer|engineer|data|it|cyber", case=False, na=False)
exp_df = df.loc[mask_exp, [salary_col, exp_col]].dropna()

if not exp_df.empty:
    coef = np.polyfit(exp_df[exp_col], exp_df[salary_col], 1)[0]
    print(f"3️⃣ Average salary increase per year of experience in tech: ${coef:,.0f}")
else:
    print("3️⃣ Not enough data for experience-based calculation.")

In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?
non_tech = df[~df[industry_col].str.contains("tech|software|it|data|cyber", case=False, na=False, na=False)]
industry_medians = (
    non_tech.groupby(industry_col, dropna=True)[salary_col]
    .median()
    .sort_values(ascending=False)
)
top_industry = industry_medians.head(1)
print(f"4️⃣ Highest median salary (non-tech industry): {top_industry.index[0]} (${top_industry.values[0]:,.0f})")

In [22]:
# Question 5: Which industry (besides tech) has the highest median salary?

In [23]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3
