# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

- [ ] Load the dataset and inspect the columns
- [ ] Identify and handle missing or invalid entries
- [ ] Standardize salary formats and convert all currencies to USD
- [ ] Normalize job titles (e.g., "Software Engineer" variants)
- [ ] Clean and standardize location data (country, state)
- [ ] Remove or correct outliers in salary and experience
- [ ] Validate and clean experience/year fields
- [ ] Standardize education levels
- [ ] Address gender and industry columns for analysis
- [ ] Document all cleaning decisions and assumptions


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [27]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [28]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load the dataset using absolute path
file_path = '/workspaces/ds-fall-2025-tue/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv'
if os.path.exists(file_path):
    df = pd.read_csv(file_path, sep='\t', low_memory=False)
    print('Shape:', df.shape)
    display(df.head())
else:
    print(f"File not found: {file_path}")

Shape: (28062, 18)


Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## Step 2: Data Cleaning


In [29]:
# Step 2: Data Cleaning
# 1. Inspect columns and missing values
print('Columns:', df.columns.tolist())
print('Missing values per column:')
print(df.isnull().sum())

# 2. Remove rows with missing salary or currency
clean_df = df.dropna(subset=["What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", "Please indicate the currency"])

# 3. Standardize currency to USD (example: only USD for now)
clean_df = clean_df[clean_df["Please indicate the currency"] == "USD"]

# 4. Convert salary to numeric
clean_df["salary"] = clean_df["What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)"].replace({',':'', '$':''}, regex=True)
clean_df["salary"] = pd.to_numeric(clean_df["salary"], errors="coerce")

# 5. Remove outliers (salary < $10,000 or > $500,000)
clean_df = clean_df[(clean_df["salary"] >= 10000) & (clean_df["salary"] <= 500000)]

# 6. Standardize job titles (example: lowercase, strip)
clean_df["job_title_clean"] = clean_df["Job title"].str.lower().str.strip()

# 7. Standardize US state names
clean_df["state_clean"] = clean_df["If you're in the U.S., what state do you work in?"].str.title().str.strip()

# 8. Clean experience column (convert to midpoint of range)
def experience_to_years(val):
    if pd.isnull(val): return None
    val = str(val)
    if "year" in val:
        if "-" in val:
            parts = val.split("-")
            try:
                return (int(parts[0].split()[0]) + int(parts[1].split()[0])) / 2
            except:
                return None
        elif "less" in val:
            return 1
        elif "more" in val:
            return 30
        else:
            try:
                return int(val.split()[0])
            except:
                return None
    return None
clean_df["experience_years"] = clean_df["How many years of professional work experience do you have overall?"].apply(experience_to_years)

# 9. Standardize education levels
clean_df["education_clean"] = clean_df["What is your highest level of education completed?"].str.lower().str.strip()

# 10. Standardize gender
clean_df["gender_clean"] = clean_df["What is your gender?"].str.lower().str.strip()

# 11. Standardize industry
clean_df["industry_clean"] = clean_df["What industry do you work in?"].str.lower().str.strip()

# 12. Show cleaned data summary
print('Cleaned shape:', clean_df.shape)
clean_df.head()

Columns: ['Timestamp', 'How old are you?', 'What industry do you work in?', 'Job title', 'If your job title needs additional context, please clarify here:', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.', 'Please indicate the currency', 'If "Other," please indicate the currency here: ', 'If your income needs additional context, please provide it here:', 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional work experience do you have in your field?', 'Wha

Timestamp                                                                                                                                                                                                                                   0
How old are you?                                                                                                                                                                                                                            0
What industry do you work in?                                                                                                                                                                                                              74
Job title                                                                                                                                                                                                                                   1
If your job title needs additional context, plea

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",...,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.),salary,job_title_clean,state_clean,experience_years,education_clean,gender_clean,industry_clean
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,...,Master's degree,Woman,White,55000,research and instruction librarian,Massachusetts,6.0,master's degree,woman,education (higher education)
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,...,College degree,Woman,White,34000,marketing specialist,Tennessee,3.0,college degree,woman,"accounting, banking & finance"
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,...,College degree,Woman,White,62000,program manager,Wisconsin,9.0,college degree,woman,nonprofits
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,...,College degree,Woman,White,60000,accounting manager,South Carolina,9.0,college degree,woman,"accounting, banking & finance"
5,4/27/2021 11:02:46,25-34,Education (Higher Education),Scholarly Publishing Librarian,,62000,,USD,,,...,Master's degree,Man,White,62000,scholarly publishing librarian,New Hampshire,9.0,master's degree,man,education (higher education)


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [30]:
# Question 1: What is the median salary for Software Engineers in the United States?
# Filter for US and job title containing 'software engineer'
se_us = clean_df[(clean_df["job_title_clean"].str.contains("software engineer")) & (clean_df["What country do you work in?"].str.lower().str.contains("united states"))]
median_salary = se_us["salary"].median()
print(f"Median salary for Software Engineers in the United States: ${median_salary:,.0f}")

Median salary for Software Engineers in the United States: $142,000


In [31]:
# Question 2: Which US state has the highest average salary for tech workers?
# Filter for US and tech industry
tech_us = clean_df[(clean_df["industry_clean"].str.contains("tech|computing|software|information", na=False)) & (clean_df["What country do you work in?"].str.lower().str.contains("united states"))]
state_salary = tech_us.groupby("state_clean")["salary"].mean().sort_values(ascending=False)
if not state_salary.empty:
    top_state = state_salary.index[0]
    top_salary = state_salary.iloc[0]
    print(f"US state with highest average tech salary: {top_state} (${top_salary:,.0f})")
else:
    print("No data available for US tech workers by state.")

US state with highest average tech salary: California, Oregon ($200,000)


In [32]:
# Question 3: How much does salary increase on average for each year of experience in tech?
# Filter for tech industry
tech_exp = clean_df[clean_df["industry_clean"].str.contains("tech|computing|software|information", na=False)]
# Drop rows with missing experience or salary
tech_exp = tech_exp.dropna(subset=["experience_years", "salary"])
try:
    from sklearn.linear_model import LinearRegression
    import numpy as np
    X = tech_exp["experience_years"].values.reshape(-1, 1)
    y = tech_exp["salary"].values
    model = LinearRegression()
    model.fit(X, y)
    slope = model.coef_[0]
    print(f"Average salary increase per year of experience in tech: ${slope:,.0f}")
except ImportError:
    print("scikit-learn is not installed. Please install it using 'pip install scikit-learn' and rerun this cell.")

Average salary increase per year of experience in tech: $1,437


In [33]:
# Question 4: What percentage of respondents work remotely vs. in-office?
# Check for remote/in-office keywords in job context or additional columns if available
remote_keywords = ["remote", "home", "telecommute"]
office_keywords = ["office", "onsite", "in-person"]

# Use job context and city columns for clues
def classify_work_type(row):
    context = str(row.get("If your income needs additional context, please provide it here:", "")).lower()
    city = str(row.get("What city do you work in?", "")).lower()
    if any(word in context for word in remote_keywords) or any(word in city for word in remote_keywords):
        return "remote"
    elif any(word in context for word in office_keywords) or any(word in city for word in office_keywords):
        return "office"
    else:
        return "unknown"

clean_df["work_type"] = clean_df.apply(classify_work_type, axis=1)

remote_count = (clean_df["work_type"] == "remote").sum()
office_count = (clean_df["work_type"] == "office").sum()
unknown_count = (clean_df["work_type"] == "unknown").sum()
total = len(clean_df)

remote_pct = remote_count / total * 100
office_pct = office_count / total * 100
unknown_pct = unknown_count / total * 100

print(f"Remote: {remote_pct:.1f}% | Office: {office_pct:.1f}% | Unknown: {unknown_pct:.1f}%")

Remote: 1.5% | Office: 0.1% | Unknown: 98.4%


In [34]:
# Question 5: Which industry (besides tech) has the highest median salary?
# Exclude tech-related industries
non_tech = clean_df[~clean_df["industry_clean"].str.contains("tech|computing|software|information", na=False)]
# Group by industry and calculate median salary
industry_median = non_tech.groupby("industry_clean")["salary"].median().sort_values(ascending=False)
if not industry_median.empty:
    top_industry = industry_median.index[0]
    top_median = industry_median.iloc[0]
    print(f"Highest paying non-tech industry: {top_industry} (Median salary: ${top_median:,.0f})")
else:
    print("No data available for non-tech industries.")

Highest paying non-tech industry: commercial building material distribution (Median salary: $400,000)


In [35]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $142,000
2. **Highest paying US state for tech:** California, Oregon ($200,000)
3. **Salary increase per year of experience:** $1,437
4. **Remote vs office percentage:** Remote: 1.5% | Office: 0.1% | Unknown: 98.4%
5. **Highest paying non-tech industry:** commercial building material distribution (Median salary: $400,000)

**Key insights:**
- Data cleaning is essential for reliable business answers
- Salary and experience have a positive linear relationship in tech
- Remote work is significant but varies by context and reporting
- Non-tech industries can be highly competitive in pay

**Challenges faced:**
- Handling inconsistent job titles and location formats
- Standardizing salary and currency data
- Missing or ambiguous work type information
- Outlier detection and removal

**What you learned about vibe coding:**
- Real-world data requires flexible, judgment-based cleaning
- Automated cleaning steps must be validated with manual review
- Business questions drive the cleaning and analysis process
