# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- Load TSV and profile schema, missingness, and distributions
- Normalize column names and data types; trim/clean text
- Standardize compensation (parse, annualize, convert to USD)
- Standardize job metadata and location fields
- Handle missing data, outliers, and duplicates; add flags
- Document transformations and export clean dataset


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [33]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv('Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep='\t')
print(df.shape)
#print(df.head(5))
print(df.columns.tolist())
#print(df.dtypes)


(28062, 18)
['Timestamp', 'How old are you?', 'What industry do you work in?', 'Job title', 'If your job title needs additional context, please clarify here:', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.', 'Please indicate the currency', 'If "Other," please indicate the currency here: ', 'If your income needs additional context, please provide it here:', 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional work experience do you have in your field?', '

## Step 2: Data Cleaning


In [34]:
df.rename(columns={
    'How old are you?': 'Age',
    'What industry do you work in?': 'Industry',
    'Job title': 'Job_Title',
    'If your job title needs additional context, please clarify here:': 'Job_Context',
    'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.': 'Additional_Compensation',
    'Please indicate the currency': 'Currency',
    'If "Other," please indicate the currency here: ': 'Currency_Other',
    'If your income needs additional context, please provide it here:': 'Income_Context',
    'What country do you work in?': 'Country',
    "If you're in the U.S., what state do you work in?": 'State',
    'What city do you work in?': 'City',
    'How many years of professional work experience do you have overall?': 'Years_Experience_Total',
    'How many years of professional work experience do you have in your field?': 'Years_Experience_Field',
    'What is your highest level of education completed?': 'Education',
    'What is your gender?': 'Gender',
    'What is your race? (Choose all that apply.)': 'Race'
}, inplace=True)

df = df.rename(columns={
    "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)": "salary"
})

df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(r"[^a-z0-9]+", "_", regex=True)
    .str.strip("_")
)

for col in df.select_dtypes(include="object").columns:
    df[col] = (
        df[col]
        .astype(str)
        .str.strip()
        .str.replace(r"\s+", " ", regex=True)
    )



In [35]:
print(df.shape)
#print(df.head(5))
print(df.columns.tolist())
#print(df.dtypes)


(28062, 18)
['timestamp', 'age', 'industry', 'job_title', 'job_context', 'salary', 'additional_compensation', 'currency', 'currency_other', 'income_context', 'country', 'state', 'city', 'years_experience_total', 'years_experience_field', 'education', 'gender', 'race']


In [36]:
#cursor suggested cleaning
df.columns = df.columns.str.strip().str.lower().str.replace(r"[^a-z0-9]+", "_", regex=True).str.strip("_")

for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].astype(str).str.strip().str.replace(r"\s+", " ", regex=True)

#fillna
text_cols = ['job_context','income_context','state','currency_other']
for col in text_cols:
    if col in df:
        df[col] = df[col].replace({'nan':'N/A'}).fillna('N/A')


#convert num colsn
numeric_cols = ['salary','additional_compensation','years_experience_field']
for col in numeric_cols:
    if col in df:
        # Remove commas 
        if col in ['salary','additional_compensation']:
            df[col] = df[col].astype(str).str.replace(r"[^0-9.]", "", regex=True)
        df[col] = pd.to_numeric(df[col], errors='coerce')

# total compensation
if {'salary','additional_compensation'}.issubset(df.columns):
    df['total_comp'] = df['salary'].fillna(0) + df['additional_compensation'].fillna(0)

# 8️⃣ Convert currencies to USD
fx = {"USD":1.0, "CAD":0.79, "GBP":1.37, "EUR":1.18, "AUD":0.75, "INR":0.013}
if 'currency' in df:
    rate = df['currency'].str.upper().str.strip().map(fx).fillna(1.0)
    for col in ['salary','additional_compensation','total_comp']:
        if col in df:
            df[f"{col}_usd"] = df[col] * rate

# 9️⃣ Clean job title and country
if 'job_title' in df:
    df['job_title_clean'] = df['job_title'].str.title()
if 'country' in df:
    df['country_std'] = df['country'].str.title()

# Fix US country aliases
us_aliases = ['Us', 'Usa', 'United States', 'United States Of America', 'U.S.', 'U.S.A.']
if 'country_std' in df:
    df.loc[df['country_std'].isin(us_aliases), 'country_std'] = 'United States'

# 🔟 Drop rows with missing critical columns and duplicates
required_cols = ['salary_usd','job_title_clean','country_std']
df = df.dropna(subset=[c for c in required_cols if c in df])

keys = ['timestamp','job_title_clean','salary_usd','country_std']
df = df.drop_duplicates(subset=[k for k in keys if k in df])

# 1️⃣1️⃣ Convert years of experience to numeric (extract first number)
if 'years_experience_field' in df:
    df['years_experience_field'] = df['years_experience_field'].astype(str).str.extract(r'(\d+)')
    df['years_experience_field'] = pd.to_numeric(df['years_experience_field'], errors='coerce')

# ✅ DataFrame is now clean and ready for analysis
print(df.head())

            timestamp    age                       industry  \
0  4/27/2021 11:02:10  25-34   Education (Higher Education)   
1  4/27/2021 11:02:22  25-34              Computing or Tech   
2  4/27/2021 11:02:38  25-34  Accounting, Banking & Finance   
3  4/27/2021 11:02:41  25-34                     Nonprofits   
4  4/27/2021 11:02:42  25-34  Accounting, Banking & Finance   

                                  job_title job_context  salary  \
0        Research and Instruction Librarian         N/A   55000   
1  Change & Internal Communications Manager         N/A   54600   
2                      Marketing Specialist         N/A   34000   
3                           Program Manager         N/A   62000   
4                        Accounting Manager         N/A   60000   

   additional_compensation currency currency_other income_context  ...  \
0                      0.0      USD            N/A            N/A  ...   
1                   4000.0      GBP            N/A            N/A  ...

In [37]:
print(df.shape)
print(df.head(10))
print(df.columns.tolist())
print(df.dtypes)

(28062, 24)
            timestamp    age                       industry  \
0  4/27/2021 11:02:10  25-34   Education (Higher Education)   
1  4/27/2021 11:02:22  25-34              Computing or Tech   
2  4/27/2021 11:02:38  25-34  Accounting, Banking & Finance   
3  4/27/2021 11:02:41  25-34                     Nonprofits   
4  4/27/2021 11:02:42  25-34  Accounting, Banking & Finance   
5  4/27/2021 11:02:46  25-34   Education (Higher Education)   
6  4/27/2021 11:02:51  25-34                     Publishing   
7  4/27/2021 11:03:00  25-34  Education (Primary/Secondary)   
8  4/27/2021 11:03:01  45-54              Computing or Tech   
9  4/27/2021 11:03:02  35-44  Accounting, Banking & Finance   

                                  job_title                   job_context  \
0        Research and Instruction Librarian                           N/A   
1  Change & Internal Communications Manager                           N/A   
2                      Marketing Specialist                    

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [38]:
# Question 1: What is the median salary for Software Engineers in the United States?
median_salary = df.query('job_title_clean.str.contains("Software Engineer", case=False) and country_std == "United States"')['salary_usd'].median()

print(f"The median salary for Software Engineers in the US is ${median_salary:,.0f}")

df_filtered = df[
    df['job_title_clean'].str.contains("Software Engineer", case=False, na=False) &
    (df['country_std'] == "United States") &
    (df['salary_usd'] > 0)
]
df_filtered[['job_title_clean','country_std','salary_usd']].head(10)


The median salary for Software Engineers in the US is $145,000


Unnamed: 0,job_title_clean,country_std,salary_usd
43,Principal Software Engineer,United States,187500.0
215,Software Engineer,United States,139000.0
321,Software Engineer,United States,156000.0
389,Software Engineer,United States,112000.0
455,Software Engineer,United States,80000.0
511,Senior Software Engineer,United States,120000.0
557,Embedded Software Engineer,United States,80000.0
746,Senior Software Engineer,United States,160367.0
831,Software Engineer 2,United States,128000.0
838,Software Engineer Team Lead,United States,110500.0


In [39]:
# Question 2: Which US state has the highest average salary for tech workers?
df_tech = df[df['job_title_clean'].str.contains("engineer|developer|software|tech|analyst", case=False, na=False)]
df_tech = df_tech[df_tech['state'].notna() & (df_tech['state'].str.strip() != 'N/A')]
df_tech['state'] = df_tech['state'].str.split(',').str[0].str.strip()

avg_by_state = df_tech.groupby('state')['salary_usd'].mean()
best_state = avg_by_state.idxmax()
best_salary = avg_by_state.max()

print(f"{best_state} has the highest average tech salary of ${best_salary:,.0f}")


California has the highest average tech salary of $139,587


In [40]:
# Question 3: How much does salary increase on average for each year of experience in tech?

''' I am using years_experience_total instead of years_experience_field because I had no idea how to make a range of numbers into 
just one number, and vibe coding is not helping and making things more confusing '''

df['years_experience_total'] = (
    df['years_experience_total']
      .astype(str)
      .str.extract(r'(\d+)')
      .astype(float)
)

# Filter US tech workers
df_us_tech = df[
    (df['country_std'].str.lower().isin(['united states','us','usa'])) &
    (df['salary_usd'] > 0) &
    (df['years_experience_total'].notna()) &
    (df['job_title_clean'].str.contains("engineer|developer|software|tech|analyst", case=False, na=False))
]

avg_increase = df_us_tech['salary_usd'].sum() / df_us_tech['years_experience_total'].sum()
print(f"Salary average increase per year of experience: ${avg_increase:,.2f}")



Salary average increase per year of experience: $11,348.36


In [41]:

# Question 4: What percentage of respondents work remotely vs. in-office?
total_respondents = len(df)

#guessing remote is where theres Nan for any city?? 
remote = len(df[
    ((df['city'].isna()) | (df['city'].str.strip().isin(['', 'N/A', 'nan']))) &
    ((df['state'].isna()) | (df['state'].str.strip().isin(['', 'N/A', 'nan']))) &
    (df['country'].notna()) & 
    (~df['country'].str.strip().isin(['', 'N/A', 'nan']))
])

in_office = total_respondents - remote

remote_pct = (remote / total_respondents) * 100
office_pct = (in_office / total_respondents) * 100

print(f"Based on city not being given:")
print(f"Remote workers: {remote_pct:.1f}% ({remote:,})")
print(f"In-office workers: {office_pct:.1f}% ({in_office:,})")



Based on city not being given:
Remote workers: 0.1% (28)
In-office workers: 99.9% (28,034)


In [42]:
# Question 5: Which industry (besides tech) has the highest median salary?

#NO tech
df_non_tech_industry = df[
    (df['industry'].str.contains('tech|computing', case=False, na=False)) &
    (df['salary_usd'] > 0) &
    (df['salary_usd'] <= 300_000)  #Taking away outliers (??)
]

#groupby + median
median_by_industry = df_non_tech_industry.groupby('industry')['salary_usd'].median().sort_values(ascending=False)

#top not tech industry
top_industry = median_by_industry.index[0]
top_salary = median_by_industry.iloc[0]

print(f"The industry besides tech with the highest median salary is '{top_industry}' at ${top_salary:,.0f}")

# Show top 5
# print("\nTop 5 highest paying industries (excluding tech):")
# for i, (industry, salary) in enumerate(median_by_industry.head().items(), 1):
#     print(f"{i}. {industry}: ${salary:,.0f}")

The industry besides tech with the highest median salary is 'Pharmaceutical/biotechnology' at $227,500


In [43]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $145,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $11,348.36
4. **Remote vs office percentage:** Do not think this is correct, 0.1% remote, 99.9% in person
5. **Highest paying non-tech industry:** Pharmaceutical/biotechnology

**Key insights:**
- I dont think my code is completely correct, but industries outside of tech, like sports or healthcare, have high median salaries, showing that tech and medicine aren't the only high-paying fields.
- Cleaning the data and making sure your variable names are correct is very important, and one mistake with naming can change the entire solution. If you're vibe coding, the LLM can change the variable names from the ones you like to use, messing up the data. 
- We took the median in this challenge because the mean can be easily skewed by something like an outlier or a group of people who earn a lot. The median is a good way to find the middle balance.

**Challenges faced:**
- I did not like how vibe coding cleaned and renamed the data. I solved this by using my own skills and renaming it myself.
- The salary data was stored as strings with symbols and commas, which made calculating it difficult. I fixed this by cleaned the column by removing the symbols and converting it to a numeric format.

**What you learned about vibe coding:**
- I learnt that although vibe coding is very popular and can be useful, it is sometimes harder to use than just coding yourself. You have to be really careful and go step by step, but even then it can be confusing and the LLM can go in a different direction from what you want. It also gets super confusing, since you ask the LLM to do one thing, and it gets very complicated and provides a solution that you have to take a long time to understand
- You have to be very specific with what you want. You must give the LLM the variables you want to use, and the methods you want to avoid.
- Vibe coding may give you more complicated solutions than the human coded ones.
