# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...


In [12]:
Load and explore the salary survey dataset to understand structure and data quality issues
Clean salary data - handle different currencies, remove non-numeric values, standardize format
Clean location data - standardize country/state formats, handle variations in US entries
Clean and categorize job titles - identify software engineers and tech workers
Clean demographic data - standardize gender, education, experience fields
Handle missing values and outliers appropriately for each field

SyntaxError: invalid syntax (2190916237.py, line 1)

## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [13]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("/workspaces/ds-fall-2025-tue/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv",delimiter='\t')

print("Data Types:")
print(df.dtypes)

print("\nMissing Values Per Column:")
print(df.isnull().sum())

print("\nDataset info:")
df.info()





Data Types:
Timestamp                                                                                                                                                                                                                                object
How old are you?                                                                                                                                                                                                                         object
What industry do you work in?                                                                                                                                                                                                            object
Job title                                                                                                                                                                                                                                object
If your job title needs addi

In [17]:
print("=== SALARY COLUMN ANALYSIS ===")
salary_col = 'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'
print("Unique values in salary column (first 20):")
print(df[salary_col].value_counts().head(20))

print("\n=== CURRENCY COLUMN ANALYSIS ===")
print("Unique currencies:")
print(df['Please indicate the currency'].value_counts())

print("\n=== COUNTRY COLUMN ANALYSIS ===")
print("Unique countries (first 20):")
print(df['What country do you work in?'].value_counts().head(20))

print("\n=== JOB TITLE ANALYSIS ===")
print("Unique job titles containing 'software' or 'engineer':")
software_jobs = df[df['Job title'].str.contains('software|engineer', case=False, na=False)]
print(f"Found {len(software_jobs)} potential software engineering jobs")
print(software_jobs['Job title'].value_counts().head(10))


=== SALARY COLUMN ANALYSIS ===
Unique values in salary column (first 20):
What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
60,000     430
80,000     406
70,000     402
65,000     400
75,000     383
90,000     368
50,000     361
100,000    313
85,000     306
55,000     303
120,000    295
110,000    263
52,000     251
45,000     247
130,000    245
40,000     234
72,000     218
95,000     213
125,000    197
105,000    191
Name: count, dtype: int64

=== CURRENCY COLUMN ANALYSIS ===
Unique currencies:
Please indicate the currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
HKD            4
Name: count, dtype: int64

=== COUNTRY COLUMN ANALYSIS ===
Unique countries (first 20):
What country d

## Step 2: Data Cleaning


In [28]:
# Step 1: Clean Salary Data
# First, let's create a copy to work with
df_clean = df.copy()

# Define the long column names for easier reference
salary_col = 'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'
currency_col = 'Please indicate the currency'
country_col = 'What country do you work in?'
state_col = 'If you\'re in the U.S., what state do you work in?'
job_title_col = 'Job title'
industry_col = 'What industry do you work in?'

# Let's examine the salary data more closely
print("Sample salary values:")
print(df_clean[salary_col].head(10))
print("\nSalary data types and null values:")
print(df_clean[salary_col].dtype)
print("Null values:", df_clean[salary_col].isnull().sum())

# Let's look at some problematic salary entries
print("\nSample of non-numeric salary entries:")
non_numeric_salaries = df_clean[~df_clean[salary_col].astype(str).str.replace(',', '').str.replace('$', '').str.replace('.', '').str.isdigit()]
print(non_numeric_salaries[salary_col].value_counts().head(10))


Sample salary values:
0     55,000
1     54,600
2     34,000
3     62,000
4     60,000
5     62,000
6     33,000
7     50,000
8    112,000
9     45,000
Name: What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.), dtype: object

Salary data types and null values:
object
Null values: 0

Sample of non-numeric salary entries:
Series([], Name: count, dtype: int64)


In [29]:
def clean_salary(salary_str):
    if pd.isna(salary_str):
        return np.nan
    
    # Convert to string and clean
    salary_str = str(salary_str).strip()
    
    # Remove common non-numeric characters except decimal points
    cleaned = salary_str.replace(',', '').replace('$', '').replace('€', '').replace('£', '').replace('¥', '')
    
    # Handle ranges (take midpoint)
    if '-' in cleaned and 'k' not in cleaned.lower():
        parts = cleaned.split('-')
        if len(parts) == 2:
            try:
                low = float(parts[0])
                high = float(parts[1])
                return (low + high) / 2
            except:
                pass
    
    # Handle 'k' notation (thousands)
    if 'k' in cleaned.lower():
        cleaned = cleaned.lower().replace('k', '')
        try:
            return float(cleaned) * 1000
        except:
            pass
    
    # Try to convert to float
    try:
        return float(cleaned)
    except:
        return np.nan

# Apply salary cleaning
df_clean['salary_cleaned'] = df_clean[salary_col].apply(clean_salary)

print("Salary cleaning results:")
print(f"Original salary entries: {len(df_clean)}")
print(f"Successfully cleaned salaries: {df_clean['salary_cleaned'].notna().sum()}")
print(f"Failed to clean: {df_clean['salary_cleaned'].isna().sum()}")
print(f"\nSalary statistics after cleaning:")
print(df_clean['salary_cleaned'].describe())


Salary cleaning results:
Original salary entries: 28062
Successfully cleaned salaries: 28062
Failed to clean: 0

Salary statistics after cleaning:
count    2.806200e+04
mean     3.614244e+05
std      3.622275e+07
min      0.000000e+00
25%      5.400000e+04
50%      7.500000e+04
75%      1.100000e+05
max      6.000070e+09
Name: salary_cleaned, dtype: float64


In [30]:
# For now, let's focus on USD salaries only to simplify our analysis
print("Currency distribution:")
print(df_clean[currency_col].value_counts())

# Convert to USD (simplified approach - focusing on USD for now)
# In a real scenario, you'd want to use current exchange rates
usd_mask = df_clean[currency_col] == 'USD'
print(f"\nUSD entries: {usd_mask.sum()}")
print(f"Non-USD entries: {(~usd_mask).sum()}")

# For this exercise, let's focus on USD salaries only
df_usd = df_clean[usd_mask].copy()
print(f"\nWorking with {len(df_usd)} USD salary entries")

# Remove extreme outliers (salaries < 10k or > 1M)
df_usd = df_usd[(df_usd['salary_cleaned'] >= 10000) & (df_usd['salary_cleaned'] <= 1000000)]
print(f"After removing outliers: {len(df_usd)} entries")


Currency distribution:
Please indicate the currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
HKD            4
Name: count, dtype: int64

USD entries: 23374
Non-USD entries: 4688

Working with 23374 USD salary entries
After removing outliers: 23257 entries


In [31]:
print("Country variations:")
print(df_usd[country_col].value_counts().head(10))

# Standardize US country entries
def standardize_country(country):
    if pd.isna(country):
        return np.nan
    
    country_lower = str(country).lower().strip()
    if any(us_term in country_lower for us_term in ['united states', 'usa', 'us', 'america']):
        return 'United States'
    return country

df_usd['country_cleaned'] = df_usd[country_col].apply(standardize_country)

# Filter to US only
df_us = df_usd[df_usd['country_cleaned'] == 'United States'].copy()
print(f"\nUS entries: {len(df_us)}")

# Clean state data
print("\nState variations (first 10):")
print(df_us[state_col].value_counts().head(10))


Country variations:
What country do you work in?
United States               8960
USA                         7888
US                          2595
United States                666
U.S.                         577
USA                          464
Usa                          443
United States of America     425
United states                207
usa                          179
Name: count, dtype: int64

US entries: 22298

State variations (first 10):
If you're in the U.S., what state do you work in?
California              2491
New York                2099
Massachusetts           1453
Texas                   1211
Illinois                1166
Washington              1149
District of Columbia     945
Pennsylvania             914
Virginia                 756
Minnesota                690
Name: count, dtype: int64


In [32]:
print("Industry distribution:")
print(df_us[industry_col].value_counts())

# Identify software engineers and tech workers
def is_software_engineer(job_title):
    if pd.isna(job_title):
        return False
    
    job_lower = str(job_title).lower()
    software_keywords = ['software engineer', 'software developer', 'software developer engineer', 'sde', 'senior software engineer']
    return any(keyword in job_lower for keyword in software_keywords)

def is_tech_worker(job_title, industry):
    if pd.isna(job_title):
        return False
    
    # Check industry first
    if not pd.isna(industry):
        industry_lower = str(industry).lower()
        if 'tech' in industry_lower or 'computing' in industry_lower:
            return True
    
    # Check job title
    job_lower = str(job_title).lower()
    tech_keywords = [
        'software', 'engineer', 'developer', 'programmer', 'data scientist', 
        'data analyst', 'data engineer', 'product manager', 'technical',
        'devops', 'sre', 'sdet', 'qa engineer', 'systems analyst', 'it'
    ]
    return any(keyword in job_lower for keyword in tech_keywords)

# Apply classifications
df_us['is_software_engineer'] = df_us[job_title_col].apply(is_software_engineer)
df_us['is_tech_worker'] = df_us.apply(lambda row: is_tech_worker(row[job_title_col], row[industry_col]), axis=1)

print(f"\nSoftware engineers found: {df_us['is_software_engineer'].sum()}")
print(f"Tech workers found: {df_us['is_tech_worker'].sum()}")

# Show some examples
print("\nSample software engineer titles:")
software_engineers = df_us[df_us['is_software_engineer']]
print(software_engineers[job_title_col].value_counts().head(10))


Industry distribution:
What industry do you work in?
Computing or Tech                            3583
Nonprofits                                   2043
Education (Higher Education)                 2006
Health care                                  1572
Accounting, Banking & Finance                1424
                                             ... 
Aerospace/Defense                               1
Public Library (technically City Govt.?)        1
consumer products                               1
Small business/service company                  1
Wherever I'm assigned via the union             1
Name: count, Length: 996, dtype: int64

Software engineers found: 881
Tech workers found: 6922

Sample software engineer titles:
Job title
Software Engineer              222
Senior Software Engineer       161
Software Developer              52
Software engineer               36
Staff Software Engineer         31
Software Engineer II            29
Principal Software Engineer     25
Senior softwa

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [33]:
# Question 1: What is the median salary for Software Engineers in the United States?
software_engineers_us = df_us[df_us['is_software_engineer'] & df_us['salary_cleaned'].notna()]
print(f"Median salary for Software Engineers in the United States: ${software_engineers_us['salary_cleaned'].median():,.0f}")

Median salary for Software Engineers in the United States: $141,750


In [36]:
# Question 2: Which US state has the highest average salary for tech workers?
tech_workers_us = df_us[df_us['is_tech_worker'] & df_us['salary_cleaned'].notna() & df_us[state_col].notna()]
state_salaries = tech_workers_us.groupby(state_col)['salary_cleaned'].agg(['mean', 'median', 'count']).reset_index()
state_salaries = state_salaries[state_salaries['count'] >= 5]  # Only states with at least 5 tech workers
state_salaries = state_salaries.sort_values('mean', ascending=False)
highest_state = state_salaries.iloc[0]
print(f"\n{highest_state[state_col]} has the highest average salary for tech workers: ${highest_state['mean']:,.0f}")



California has the highest average salary for tech workers: $140,501


In [40]:
# Question 3: How much does salary increase on average for each year of experience in tech?
experience_col = 'How many years of professional work experience do you have overall?'

# Clean experience data - convert ranges to midpoints
def clean_experience(exp_str):
    if pd.isna(exp_str):
        return np.nan
    
    exp_str = str(exp_str).strip()
    
    # Handle ranges like "5-7 years" or "8 - 10 years"
    if '-' in exp_str:
        parts = exp_str.split('-')
        if len(parts) == 2:
            try:
                low = float(parts[0].strip().split()[0])  # Take first number
                high = float(parts[1].strip().split()[0])  # Take first number
                return (low + high) / 2
            except:
                pass
    
    # Handle single numbers
    try:
        return float(exp_str.split()[0])
    except:
        return np.nan

# Apply experience cleaning
df_us['experience_cleaned'] = df_us[experience_col].apply(clean_experience)

# Filter to tech workers with both salary and experience data
tech_with_exp = df_us[(df_us['is_tech_worker'] & 
                      df_us['salary_cleaned'].notna() & 
                      df_us['experience_cleaned'].notna())]


# Calculate correlation and regression
from scipy import stats

# Simple linear regression: salary vs experience
slope, intercept, r_value, p_value, std_err = stats.linregress(
    tech_with_exp['experience_cleaned'], 
    tech_with_exp['salary_cleaned']
)
print(f"\nSalary increases by approximately ${slope:,.0f} per year of experience in tech")




Salary increases by approximately $1,422 per year of experience in tech


In [None]:
# Question 4: What percentage of respondents work remotely vs. in-office?


Number of non-tech workers with salary and industry data: 15332

Top 10 industries by median salary (non-tech):
      What industry do you work in?    median  count
493                          Pharma  127500.0     14
510                 Pharmaceuticals  120000.0     15
498                  Pharmaceutical  100000.0     13
379                             Law   96000.0    871
105          Business or Consulting   93600.0    612
725  Utilities & Telecommunications   83385.0    205
219    Engineering or Manufacturing   81473.0    591
15    Accounting, Banking & Finance   80000.0   1237
430     Marketing, Advertising & PR   80000.0    708
431                 Media & Digital   80000.0    305

ANSWER: Pharma has the highest median salary among non-tech industries: $127,500
Based on 14 workers in that industry


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?
non_tech_workers = df_us[(~df_us['is_tech_worker']) & df_us['salary_cleaned'].notna() & df_us[industry_col].notna()]

# Calculate median salary by industry for non-tech workers
industry_salaries = non_tech_workers.groupby(industry_col)['salary_cleaned'].agg(['median', 'mean', 'count']).reset_index()
industry_salaries = industry_salaries[industry_salaries['count'] >= 10]  # Only industries with at least 10 workers
industry_salaries = industry_salaries.sort_values('median', ascending=False)

highest_industry = industry_salaries.iloc[0]
print(f"\n{highest_industry[industry_col]} has the highest median salary among non-tech industries: ${highest_industry['median']:,.0f}")


Pharma has the highest median salary among non-tech industries: $127,500


In [6]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3


Learning 
    The agent produces a lot of work but its not all neccesary 
    You should try and be really specific for your prompts
    Its still necessary to know whats going on in the code

Challenges
    The agent not doing exactly what I asked it too (I was being too vague)
    The agent wrote so much that I got lost and had to read all of it throughly 

Key insights 
    Data is really messy
    Geographic location for jobs is importantant 
    Job titles are very unique 

In [56]:
print(f"Median salary for Software Engineers in the US: ${software_engineers_us['salary_cleaned'].median():,.0f}")
print(f"Highest paying US state for tech: {highest_state[state_col]} ")
print(f"Salary increase per year of experience: ${slope:,.0f}")
print(f"Highest paying non-tech industry: {highest_industry[industry_col]}")


Median salary for Software Engineers in the US: $141,750
Highest paying US state for tech: California 
Salary increase per year of experience: $1,422
Highest paying non-tech industry: Pharma
