# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [129]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Step 2: Data Cleaning


In [130]:
df = pd.read_csv("../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv", sep= "\t")

In [131]:
# lets rename columns with better and shorter alternatives
# Rename columns
df.rename(columns={
    'Timestamp': 'timestamp',
    'How old are you?': 'age',
    'What industry do you work in?': 'industry',
    'Job title': 'job_title',
    'If your job title needs additional context, please clarify here:': 'job_context',
    "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)": 'annual_salary',
    'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.': 'additional_comp',
    'Please indicate the currency': 'currency',
    'If "Other," please indicate the currency here: ': 'currency_other',
    'If your income needs additional context, please provide it here:': 'income_context',
    'What country do you work in?': 'country',
    "If you're in the U.S., what state do you work in?": 'state',
    'What city do you work in?': 'city',
    'How many years of professional work experience do you have overall?': 'total_experience',
    'How many years of professional work experience do you have in your field?': 'field_experience',
    'What is your highest level of education completed?': 'education',
    'What is your gender?': 'gender',
    'What is your race? (Choose all that apply.)': 'race'
}, inplace=True)

In [132]:
df.head(3)

Unnamed: 0,timestamp,age,industry,job_title,job_context,annual_salary,additional_comp,currency,currency_other,income_context,country,state,city,total_experience,field_experience,education,gender,race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White


In [133]:
df.shape

(28062, 18)

In [134]:
df['job_title']

0              Research and Instruction Librarian
1        Change & Internal Communications Manager
2                            Marketing Specialist
3                                 Program Manager
4                              Accounting Manager
                           ...                   
28057                                Veterinarian
28058                           Systems Architect
28059                   Risk Management Associate
28060                                          IT
28061                      Clinical physiologist 
Name: job_title, Length: 28062, dtype: object

In [135]:
# Remove commas and convert to float
df['annual_salary'] = df['annual_salary'].str.replace(',', '').astype(float)

In [136]:
df['total_compensation'] = df['annual_salary']  + df['additional_comp'] 

In [137]:
# Replace all industries containing 'Tech' with 'Tech'
df['industry'] = np.where(df['industry'].str.contains('Tech', case=False, na=False),
                          'Tech',
                          df['industry'])



In [138]:
df['industry'] = np.where(df['industry'].str.contains('Education', case=False, na=False),
                          'Education',
                          df['industry'])



In [139]:
# Step 1: Split by comma, handle NaN
df['state'] = df['state'].apply(lambda x: x.split(',') if isinstance(x, str) else [])

# Step 3: Explode the list into separate rows **in place**
df = df.explode('state')


In [145]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Extract month and year into new columns
df['month'] = df['timestamp'].dt.month
df['year'] = df['timestamp'].dt.year


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [140]:
# Question 1: What is the median salary for Software Engineers in the United States?
df[(df['job_title'] == "Software Engineer") & (df['country'] == "United States")].agg({"annual_salary":"median","total_compensation":"median"})

annual_salary         128000.0
total_compensation    143000.0
dtype: float64

In [141]:
df['job_title'].value_counts()

job_title
Software Engineer              289
Project Manager                230
Director                       199
Senior Software Engineer       196
Program Manager                152
                              ... 
instructor of medicine           1
Director of Agency Services      1
Teacher - high school            1
Payroll Administration           1
Clinical physiologist            1
Name: count, Length: 14348, dtype: int64

In [142]:
# Question 2: Which US state has the highest average salary for tech workers?
tech_by_state = df[df['industry'] == 'Tech'].groupby('state').agg({"annual_salary":"mean"})
tech_by_state.sort_values(by = "annual_salary", ascending = False)

Unnamed: 0_level_0,annual_salary
state,Unnamed: 1_level_1
Texas,222500.000000
Oregon,200000.000000
Washington,193333.333333
Colorado,176000.000000
Massachusetts,175000.000000
...,...
Louisiana,86080.000000
Oklahoma,86011.846154
Pennsylvania,83000.000000
Arkansas,74257.000000


In [152]:
# Question 3: How much does salary increase on average for each year of experience in tech?
df[df['industry'] == 'Tech'].groupby('field_experience').agg({'annual_salary': 'mean'}).reset_index().sort_values('annual_salary', ascending=False)

Unnamed: 0,field_experience,annual_salary
3,21 - 30 years,180862.244344
5,41 years or more,155666.666667
4,31 - 40 years,148553.396825
1,11 - 20 years,147654.005268
7,8 - 10 years,137133.17268
6,5-7 years,135303.546448
0,1 year or less,114074.372881
2,2 - 4 years,106837.628114


In [159]:
# Question 4: What percentage of respondents work remotely vs. in-office?
df.columns

Index(['timestamp', 'age', 'industry', 'job_title', 'job_context',
       'annual_salary', 'additional_comp', 'currency', 'currency_other',
       'income_context', 'country', 'state', 'city', 'total_experience',
       'field_experience', 'education', 'gender', 'race', 'total_compensation',
       'month', 'year'],
      dtype='object')

In [174]:
# Count percentage of remote vs in-office


# Count how many rows mention 'remote' in country or city
remote_count = df['country'].str.contains('remote', case=False, na=False).sum() + df['city'].str.contains('remote', case=False, na=False).sum()

# Total number of respondents
total_respondents = len(df)

# Percentage of remote workers
remote_percentage = (remote_count / total_respondents) * 100
in_office_percentage = 100 - remote_percentage

print(f"Remote: {remote_percentage:.2f}%")
print(f"In-office: {in_office_percentage:.2f}%")


Remote: 1.19%
In-office: 98.81%


In [177]:
# Question 5: Which industry (besides tech) has the highest median salary?
df.groupby('industry').agg({"annual_salary":"median"}).sort_values(by = "annual_salary", ascending = False).reset_index()

Unnamed: 0,industry,annual_salary
0,ESL Teacher,36000000.0
1,video games,3225000.0
2,Customer service,2300000.0
3,Energy / renewables,1115000.0
4,translation,1100000.0
...,...,...
1093,Stay-at-home parent,0.0
1094,Student,0.0
1095,I'm currently a student and don't have a job,0.0
1096,student,0.0


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3
