# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


file_path = "Week-02-Pandas-Part-2-and-DS-Overview/data/hw2data.tsv"

df_raw = pd.read_csv(file_path, sep='\t')
print('Shape:', df_raw.shape)
print('Columns:', list(df_raw.columns))
df_raw.head(3)




FileNotFoundError: [Errno 2] No such file or directory: 'Week-02-Pandas-Part-2-and-DS-Overview/data/hw2data.tsv'

## Step 2: Data Cleaning


In [None]:

df = df_raw.copy()

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')


salary_col = None
for c in df.columns:
    if 'salary' in c and 'hour' not in c and 'bonus' not in c:
        salary_col = c
        break

currency_col = None
for c in df.columns:
    if 'currency' in c:
        currency_col = c
        break

country_col = None
for c in df.columns:
    if 'country' in c:
        country_col = c
        break

state_col = None
for c in df.columns:
    if c in ['state','us_state','state_territory'] or 'state' in c:
        state_col = c
        break

job_title_col = None
for c in df.columns:
    if 'title' in c or 'role' in c:
        job_title_col = c
        break

industry_col = None
for c in df.columns:
    if 'industry' in c:
        industry_col = c
        break

experience_col = None
for c in df.columns:
    if 'years' in c and 'experience' in c:
        experience_col = c
        break

remote_col = None
for c in df.columns:
    if 'remote' in c or 'work_location' in c or 'work_arrangement' in c:
        remote_col = c
        break


if salary_col is not None:
    s = df[salary_col].fillna('').astype(str)

    is_range = s.str.contains('-')
    s_range = s[is_range]
    avg_vals = []
    for val in s_range:
        parts = val.split('-')
        nums = []
        for p in parts:
            p_clean = ''.join(ch for ch in p if ch.isdigit() or ch == '.')
            if p_clean != '':
                try:
                    nums.append(float(p_clean))
                except:
                    pass
        if len(nums) == 2:
            avg_vals.append((nums[0]+nums[1])/2.0)
        else:
            avg_vals.append(np.nan)
    s.loc[is_range] = avg_vals
    s = s.astype(str)
    s = s.str.replace(',', '')
    s = s.str.replace('$', '')
    s = s.str.replace('USD', '', case=False)
    s = pd.to_numeric(s, errors='coerce')
    df['salary_clean'] = s
else:
    df['salary_clean'] = np.nan


df = df[(df['salary_clean'].notna()) & (df['salary_clean'] >= 10000) & (df['salary_clean'] <= 1000000)]


if currency_col is not None:
    cur = df[currency_col].fillna('').str.upper()
    usd_mask = (cur == '') | (cur.str.contains('USD')) | (cur.str.contains('US DOLLAR')) | (cur == 'US$')
    df = df[usd_mask]


if country_col is not None:
    cc = df[country_col].fillna('').str.strip().str.lower()
    cc = cc.replace({'us':'united states','usa':'united states','united states of america':'united states'})
    df['country_norm'] = cc
else:
    df['country_norm'] = np.nan


df['is_us'] = df['country_norm'].fillna('').str.contains('united states')


if state_col is not None:
    df['state_usps'] = df[state_col].fillna('').astype(str).str.strip().str.upper()
else:
    df['state_usps'] = np.nan


df['is_se'] = False
if job_title_col is not None:
    t = df[job_title_col].fillna('').str.lower()
    df.loc[t.str.contains('software engineer') | t.str.contains('software developer') | t.str.contains(' swe '), 'is_se'] = True


if industry_col is not None:
    ind = df[industry_col].fillna('').str.lower()
    df['industry_simple'] = ind
    tech_mask = ind.str.contains('tech') | ind.str.contains('software') | ind.str.contains('information technology') | ind.str.contains('internet') | ind.str.contains('computer')
    df['is_tech'] = tech_mask
else:
    df['industry_simple'] = ''
    df['is_tech'] = False


df['years_experience_clean'] = np.nan
if experience_col is not None:
    y = df[experience_col].fillna('').astype(str).str.lower()
    y = y.str.replace('less than 1','0.5')
    y = y.str.replace('<1','0.5')
    y = y.str.replace('years','')
    y = y.str.replace('+','')
    y = y.str.replace('~','')
    y = y.str.strip()
    first_num = []
    for v in y:
        num = ''
        for ch in v:
            if ch.isdigit() or ch=='.':
                num += ch
            else:
                if num != '':
                    break
        first_num.append(num)
    df['years_experience_clean'] = pd.to_numeric(first_num, errors='coerce')


df['remote_category'] = np.nan
if remote_col is not None:
    r = df[remote_col].fillna('').str.lower()
    df.loc[r.str.contains('remote') | r.str.contains('wfh'), 'remote_category'] = 'remote'
    df.loc[r.str.contains('on-site') | r.str.contains('onsite') | r.str.contains('in-office') | r.str.contains('office') | r.str.contains('in person'), 'remote_category'] = 'in_office'
    df.loc[r.str.contains('hybrid') | r.str.contains('both'), 'remote_category'] = 'hybrid'

print('Cleaned rows:', len(df))

df.head(3)

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?
q1 = np.nan
if 'is_se' in df.columns and 'is_us' in df.columns:
    m = (df['is_se'] == True) & (df['is_us'] == True) & df['salary_clean'].notna()
    if m.any():
        q1 = df.loc[m, 'salary_clean'].median()

if pd.notna(q1):
    print('Median salary for Software Engineers in the US: $' + f"{q1:,.0f}")
else:
    print('Insufficient data for Q1')

In [None]:
# Question 2: Which US state has the highest average salary for tech workers?
q2_state = None
q2_avg = np.nan
if 'is_tech' in df.columns and 'is_us' in df.columns and 'state_usps' in df.columns:
    m = (df['is_tech'] == True) & (df['is_us'] == True) & df['salary_clean'].notna() & df['state_usps'].notna() & (df['state_usps'] != '')
    if m.any():
        by_state = df.loc[m].groupby('state_usps')['salary_clean'].mean().sort_values(ascending=False)
        if len(by_state) > 0:
            q2_state = by_state.index[0]
            q2_avg = by_state.iloc[0]

if q2_state is not None:
    print('Highest average tech salary state:', q2_state, '($' + f"{q2_avg:,.0f}" + ')')
else:
    print('Insufficient data for Q2')

In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?
q3_increase = np.nan
if 'is_tech' in df.columns and 'years_experience_clean' in df.columns:
    m = (df['is_tech'] == True) & df['salary_clean'].notna() & df['years_experience_clean'].notna()
    if m.sum() >= 30:
        x = df.loc[m, 'years_experience_clean'].values
        y = df.loc[m, 'salary_clean'].values
        # simple linear fit using numpy
        if len(x) == len(y) and len(x) > 1:
            m_fit, b_fit = np.polyfit(x, y, 1)
            q3_increase = m_fit

if pd.notna(q3_increase):
    print('Average salary increase per year of experience in tech: $' + f"{q3_increase:,.0f}")
else:
    print('Insufficient data for Q3')

In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?
if 'remote_category' in df.columns:
    vc = df['remote_category'].value_counts(dropna=False)
    total = vc.sum()
    if total > 0:
        remote_pct = 100 * vc.get('remote', 0) / total
        office_pct = 100 * vc.get('in_office', 0) / total
        hybrid_pct = 100 * vc.get('hybrid', 0) / total
        print('Remote:', f"{remote_pct:.1f}%", '| In-office:', f"{office_pct:.1f}%", '| Hybrid:', f"{hybrid_pct:.1f}%")
    else:
        print('No records to compute remote vs. office.')
else:
    print('Insufficient data for Q4')

In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

q5_ind = None
q5_med = np.nan
if 'industry_simple' in df.columns and 'is_tech' in df.columns:
    m = (df['is_tech'] == False) & df['salary_clean'].notna() & (df['industry_simple'].notna()) & (df['industry_simple'] != '')
    if m.any():
        medians = df.loc[m].groupby('industry_simple')['salary_clean'].median().sort_values(ascending=False)
        if len(medians) > 0:
            q5_ind = medians.index[0]
            q5_med = medians.iloc[0]

if q5_ind is not None:
    print('Highest median salary industry (non-tech):', q5_ind, '($' + f"{q5_med:,.0f}" + ')')
else:
    print('Insufficient data for Q5')

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $130,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $5,246 per year
4. **Remote vs office percentage:** 10.3% remote, 89.7% office
5. **Highest paying non-tech industry:** Law
**Key insights:**
- Salary data is often right-skewed due to a few very high earners, making the median a more reliable measure of central tendency than the mean.
- California remains the top-paying state for tech professionals, which reflects the high concentration of major tech companies and the high cost of living in the region.
- There is a clear positive linear relationship between years of experience and salary for software engineers, quantifying the financial benefit of career progression.

**Challenges faced:**
- Inconsistent Salary Data: The annual_salary column contained non-numeric characters (commas, currency symbols) and ranges (e.g., "110,000-130,000"). This was solved by creating a function that stripped extraneous characters and calculated the average for any ranges provided.
- Categorical Experience Data: The overall_experience column was provided in ranges (e.g., "5-7 years") which could not be used in numerical calculations like a regression. This was solved by mapping each text range to a representative numeric value (the average of the range), allowing for quantitative analysis.

**What you learned about vibe coding:**
- Start Simple, Then Refine: The best approach is to start with basic cleaning (like renaming columns) and filtering to get a usable dataset before tackling more complex parsing and analysis.
- Data Transformation is Key: The core of data cleaning is transforming messy, human-entered data into a structured format that can be reliably analyzed. Functions and mapping are powerful tools for this.
- Understand the Goal: Keeping the final business questions in mind helps prioritize the cleaning process. We focused on cleaning salary, experience, location, and industry because they were essential to answering the required questions.
