# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [6]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset (TSV format)
file_path = "../../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"

# Use tab separator since it's a TSV file
df = pd.read_csv(file_path, sep="\t", low_memory=False)

# Display shape and first few rows
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (28062, 18)


Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## Step 2: Data Cleaning


In [7]:
# Step 2: Data Cleaning

# 1. Check for missing values
missing_values = df.isnull().sum().sort_values(ascending=False)
#print("Missing values per column:\n", missing_values.head(20))

# 2. Drop completely empty rows
df.dropna(how='all', inplace=True)

# 3. Drop duplicates (if any)
initial_shape = df.shape
df.drop_duplicates(inplace=True)
print(f"Dropped {initial_shape[0] - df.shape[0]} duplicate rows.")

# 4. Standardize column names (lowercase, underscores, remove special chars)
df.columns = (
    df.columns.str.strip()
    .str.lower()
    .str.replace(' ', '_')
    .str.replace(r'[^a-z0-9_]', '', regex=True)
)

# 5. Rename long columns to concise names
df.rename(columns={
    'timestamp': 'timestamp',
    'how_old_are_you': 'age',
    'what_industry_do_you_work_in': 'industry',
    'job_title': 'job_title',
    'if_your_job_title_needs_additional_context_please_clarify_here': 'job_context',
    'what_is_your_annual_salary_youll_indicate_the_currency_in_a_later_question_if_you_are_parttime_or_hourly_please_enter_an_annualized_equivalent__what_you_would_earn_if_you_worked_the_job_40_hours_a_week_52_weeks_a_year': 'annual_salary',
    'how_much_additional_monetary_compensation_do_you_get_if_any_for_example_bonuses_or_overtime_in_an_average_year_please_only_include_monetary_compensation_here_not_the_value_of_benefits': 'bonus',
    'please_indicate_the_currency': 'currency',
    'if_other_please_indicate_the_currency_here': 'currency_other',
    'if_your_income_needs_additional_context_please_provide_it_here': 'income_context',
    'what_country_do_you_work_in': 'country',
    'if_youre_in_the_us_what_state_do_you_work_in': 'state',
    'what_city_do_you_work_in': 'city',
    'how_many_years_of_professional_work_experience_do_you_have_overall': 'experience_overall',
    'how_many_years_of_professional_work_experience_do_you_have_in_your_field': 'experience_field',
    'what_is_your_highest_level_of_education_completed': 'education',
    'what_is_your_gender': 'gender',
    'what_is_your_race_choose_all_that_apply': 'race'
}, inplace=True)

# 6. Clean salary-related columns
for col in ['annual_salary', 'bonus']:
    if col in df.columns:
        df[col] = (
            df[col]
            .astype(str)
            .str.replace(',', '', regex=False)
            .str.replace('$', '', regex=False)
            .replace('', np.nan)
            .astype(float)
        )

# 7. Handle invalid or unrealistic salaries
df = df[(df['annual_salary'] > 1000) & (df['annual_salary'] < 1_000_000)]

# 8. Standardize text fields (country, state, gender, race)
text_cols = ['country', 'state', 'gender', 'race']
for col in text_cols:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip().str.title()

# 9. Confirm cleaning results
print("\nAfter cleaning:")
print(df.info())
print(df.describe(include='all').T)





Dropped 0 duplicate rows.

After cleaning:
<class 'pandas.core.frame.DataFrame'>
Index: 27875 entries, 0 to 28060
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   timestamp           27875 non-null  object 
 1   age                 27875 non-null  object 
 2   industry            27806 non-null  object 
 3   job_title           27875 non-null  object 
 4   job_context         7213 non-null   object 
 5   annual_salary       27875 non-null  float64
 6   bonus               20648 non-null  float64
 7   currency            27875 non-null  object 
 8   currency_other      164 non-null    object 
 9   income_context      3002 non-null   object 
 10  country             27875 non-null  object 
 11  state               27875 non-null  object 
 12  city                27795 non-null  object 
 13  experience_overall  27875 non-null  object 
 14  experience_field    27875 non-null  object 
 15  education      

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [8]:
# Question 1: What is the median salary for Software Engineers in the United States?

# Filter to U.S. respondents only
us_df = df[df['country'].str.contains('United States', case=False, na=False)]

# Filter to Software Engineer roles
software_df = us_df[us_df['job_title'].str.contains('software engineer', case=False, na=False)]

# Calculate median salary
median_salary = software_df['annual_salary'].median()

print(f"💼 Median salary for Software Engineers in the United States: ${median_salary:,.2f}")
print(f"Number of respondents considered: {len(software_df)}")

💼 Median salary for Software Engineers in the United States: $142,000.00
Number of respondents considered: 395


In [9]:
# Question 2: Which US state has the highest average salary for tech workers?

# Filter U.S. respondents
us_df = df[df['country'].str.contains('United States', case=False, na=False)]

# Define what counts as a "tech" job (you can expand these keywords)
tech_keywords = [
    'software', 'developer', 'engineer', 'data', 'it', 'systems', 'programmer',
    'tech', 'technology', 'qa', 'security', 'network', 'web', 'machine learning', 'ai'
]

# Keep rows where job title matches any tech keyword
tech_df = us_df[us_df['job_title'].str.contains('|'.join(tech_keywords), case=False, na=False)]

# Group by state and calculate average salary
state_avg_salary = (
    tech_df.groupby('state')['annual_salary']
    .mean()
    .sort_values(ascending=False)
    .reset_index()
)

# Display the top state
top_state = state_avg_salary.iloc[0]
print(f"🌎 The US state with the highest average tech salary is {top_state['state']} "
      f"with an average salary of ${top_state['annual_salary']:,.2f}.")




🌎 The US state with the highest average tech salary is California, Colorado with an average salary of $176,000.00.


In [12]:
# Question 3: How much does salary increase on average for each year of experience in tech?

# Clean experience column
tech_df['experience_field'] = (
    tech_df['experience_field']
    .astype(str)
    .str.extract('(\d+)', expand=False)
    .astype(float)
)

# Drop missing values
tech_df = tech_df.dropna(subset=['annual_salary', 'experience_field'])

# Group by years of experience
avg_salary_by_exp = (
    tech_df.groupby('experience_field')['annual_salary']
    .mean()
    .reset_index()
    .sort_values('experience_field')
)

print(avg_salary_by_exp.head(10))  # show first 10 experience levels

# Estimate the average increase per year
# (difference between consecutive years)
avg_salary_by_exp['yearly_increase'] = avg_salary_by_exp['annual_salary'].diff()

# Take the mean of those yearly increases
average_increase = avg_salary_by_exp['yearly_increase'].mean()

print(f"💵 Average salary increase per year of experience: ${average_increase:,.2f}")


   experience_field  annual_salary
0               1.0   68600.458824
1               2.0   82851.441860
2               5.0   96418.526542
3               8.0  114413.917695
4              11.0  116923.304860
5              21.0  133558.279070
6              31.0  134152.076923
7              41.0  118400.000000
💵 Average salary increase per year of experience: $7,114.22


In [13]:

# Question 4: What percentage of respondents work remotely vs. in-office?

col_city = 'city'  # make sure the column name matches your cleaned DataFrame

# 1. Clean the city column
df[col_city] = df[col_city].astype(str).str.lower().str.strip()

# 2. Categorize based on city name
def categorize_work_mode(city):
    if any(term in city for term in ['remote', 'home', 'telecommute', 'wfh', 'anywhere']):
        return 'Remote'
    elif city in ['nan', '', 'none']:
        return 'Unknown'
    else:
        return 'In-Office'

df['work_mode'] = df[col_city].apply(categorize_work_mode)

# 3. Calculate percentages
work_mode_counts = df['work_mode'].value_counts(normalize=True) * 100

# 4. Display results
print("🌍 Work Arrangement Breakdown (inferred from city data):")
for mode, pct in work_mode_counts.items():
    print(f"{mode}: {pct:.1f}%")





🌍 Work Arrangement Breakdown (inferred from city data):
In-Office: 98.3%
Remote: 1.4%
Unknown: 0.3%


In [15]:
# Question 5: Which industry (besides tech) has the highest median salary?

# 1. Clean the industry column
df['industry'] = (
    df['industry']
    .astype(str)
    .str.strip()
    .str.lower()
)

# 2. Filter out rows with missing or invalid salary
df = df.dropna(subset=['annual_salary'])
df = df[df['annual_salary'] > 1000]

# 3. Compute median salary by industry
industry_salary = (
    df.groupby('industry')['annual_salary']
    .median()
    .sort_values(ascending=False)
)

# 4. Identify the top industries
top_industries = industry_salary.head(10)
print("🏆 Top 10 industries by median salary:")
print(top_industries)

# 5. Exclude 'tech' or similar terms
non_tech_industries = industry_salary[~industry_salary.index.str.contains('tech|software|it|information', case=False)]
top_non_tech = non_tech_industries.head(1)

print("\n💼 Highest median salary (non-tech industry):")
print(top_non_tech)



🏆 Top 10 industries by median salary:
industry
interpretation                                                    905000.0
oil                                                               689000.0
biotech (r&d)                                                     556200.0
commercial building material distribution                         400000.0
sports                                                            300000.0
corporate training                                                280000.0
geologist                                                         260000.0
energy (oil & gas & associated products, renewable power, etc)    253300.0
multilateral organisation                                         250000.0
pharmaceutical/biotechnology                                      227500.0
Name: annual_salary, dtype: float64

💼 Highest median salary (non-tech industry):
industry
interpretation    905000.0
Name: annual_salary, dtype: float64


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $142,000.00
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $7,114.22 per year
4. **Remote vs office percentage:** 1.4% remote, 98.3% office
5. **Highest paying non-tech industry:** interpretation

**Key insights:**
- The median salary varies significantly across industries, with some non-tech industries like “interpretation” paying surprisingly high salaries.

- Certain industries, such as biotech and energy, show high earning potential even outside the typical “tech” sector.

- There is a wide disparity in compensation for similar roles depending on location, experience, and industry sector.

**Challenges faced:**
- Cleaning inconsistent industry names. Solution: Used .str.strip().str.lower() to standardize the names.
- Handling missing or invalid salary entries. Solution: Filtered out rows with missing values.

**What you learned about vibe coding:**
- Let the AI generate its own follow-up prompts.
- Give the AI detailed, specific instructions for better results.
- Provide your errors or issues directly to the AI so it can help debug them.
