# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- look through Week-05-Vibe-Coding-101/homeworks/hw.ipynb and answer the questions. make sure to read the ds-fall-2025-tue\Week-02-Pandas-Part-2-and-DS-Overview\data\Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv for the data. read it with pandas
- import any thing necessary in the virtual environment i already activated in the terminal
- you must clean ur data
- make sure all the column names are concise and without spaces and lowercase
- make sure that each column is in the correct type and change it if necessary


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


In [6]:
# Load the data
data_path = "../../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"
df = pd.read_csv(data_path, sep='\t')

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()


Dataset shape: (28062, 18)

First few rows:


Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


In [7]:
# Explore the data structure
print("Column names:")
print(df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic info:")
df.info()


Column names:
['Timestamp', 'How old are you?', 'What industry do you work in?', 'Job title', 'If your job title needs additional context, please clarify here:', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.', 'Please indicate the currency', 'If "Other," please indicate the currency here: ', 'If your income needs additional context, please provide it here:', 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional work experience do you have in your field?',

In [8]:
# Step 1: Clean column names - make them lowercase, concise, and without spaces
print("Original column names:")
print(df.columns.tolist())

# Create a mapping for cleaner column names
column_mapping = {
    'Timestamp': 'timestamp',
    'How old are you?': 'age_group',
    'What industry do you work in?': 'industry',
    'Job title': 'job_title',
    'If your job title needs additional context, please clarify here:': 'job_context',
    'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)': 'annual_salary',
    'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.': 'additional_compensation',
    'Please indicate the currency': 'currency',
    'If "Other," please indicate the currency here: ': 'other_currency',
    'If your income needs additional context, please provide it here:': 'income_context',
    'What country do you work in?': 'country',
    'If you\'re in the U.S., what state do you work in?': 'state',
    'What city do you work in?': 'city',
    'How many years of professional work experience do you have overall?': 'total_experience',
    'How many years of professional work experience do you have in your field?': 'field_experience',
    'What is your highest level of education completed?': 'education',
    'What is your gender?': 'gender',
    'What is your race? (Choose all that apply.)': 'race'
}

# Rename columns
df = df.rename(columns=column_mapping)
print("\nCleaned column names:")
print(df.columns.tolist())


Original column names:
['Timestamp', 'How old are you?', 'What industry do you work in?', 'Job title', 'If your job title needs additional context, please clarify here:', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.', 'Please indicate the currency', 'If "Other," please indicate the currency here: ', 'If your income needs additional context, please provide it here:', 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional work experience do you have in your

In [9]:
# Step 2: Clean salary data and handle currencies
print("Salary data cleaning...")
print("Unique currencies:", df['currency'].value_counts())
print("\nSample salary values:")
print(df['annual_salary'].head(10))

# Clean salary data - remove commas, convert to numeric
def clean_salary(salary_str):
    if pd.isna(salary_str) or salary_str == '':
        return np.nan
    # Remove commas and convert to string
    salary_str = str(salary_str).replace(',', '')
    # Extract numeric part
    import re
    numbers = re.findall(r'[\d.]+', salary_str)
    if numbers:
        return float(numbers[0])
    return np.nan

df['annual_salary_clean'] = df['annual_salary'].apply(clean_salary)
print(f"\nCleaned salary data - non-null values: {df['annual_salary_clean'].notna().sum()}")
print(f"Salary range: ${df['annual_salary_clean'].min():,.0f} - ${df['annual_salary_clean'].max():,.0f}")


Salary data cleaning...
Unique currencies: currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
HKD            4
Name: count, dtype: int64

Sample salary values:
0     55,000
1     54,600
2     34,000
3     62,000
4     60,000
5     62,000
6     33,000
7     50,000
8    112,000
9     45,000
Name: annual_salary, dtype: object

Cleaned salary data - non-null values: 28062
Salary range: $0 - $6,000,070,000


In [10]:
# Step 3: Handle currency conversion and filter for US data
# Convert non-USD salaries to USD (approximate 2021 rates)
def convert_to_usd(salary, currency):
    if pd.isna(salary) or pd.isna(currency):
        return salary
    
    # 2021 approximate exchange rates
    conversion_rates = {
        'USD': 1.0,
        'GBP': 1.37,  # GBP to USD
        'CAD': 0.79,  # CAD to USD
        'EUR': 1.18,  # EUR to USD
        'AUD': 0.73,  # AUD to USD
    }
    
    if currency in conversion_rates:
        return salary * conversion_rates[currency]
    return salary

# Apply currency conversion
df['salary_usd'] = df.apply(lambda row: convert_to_usd(row['annual_salary_clean'], row['currency']), axis=1)

# Filter for US data
us_data = df[df['country'].str.contains('United States|USA|US', case=False, na=False)]
print(f"Total records: {len(df)}")
print(f"US records: {len(us_data)}")
print(f"US salary range: ${us_data['salary_usd'].min():,.0f} - ${us_data['salary_usd'].max():,.0f}")


Total records: 28062
US records: 22786
US salary range: $0 - $10,000,000


In [11]:
# Question 1: What is the median salary for Software Engineers in the United States?
print("=== QUESTION 1: Software Engineers Median Salary ===")

# Filter for Software Engineers in US
software_engineers = us_data[
    (us_data['job_title'].str.contains('Software Engineer|Software Developer|Developer', case=False, na=False)) |
    (us_data['job_title'].str.contains('Engineer', case=False, na=False) & 
     us_data['job_title'].str.contains('Software', case=False, na=False))
]

# Remove outliers (salaries that seem unrealistic)
software_engineers_clean = software_engineers[
    (software_engineers['salary_usd'] >= 30000) & 
    (software_engineers['salary_usd'] <= 500000)
]

print(f"Software Engineers found: {len(software_engineers_clean)}")
print(f"Salary range: ${software_engineers_clean['salary_usd'].min():,.0f} - ${software_engineers_clean['salary_usd'].max():,.0f}")

if len(software_engineers_clean) > 0:
    median_salary = software_engineers_clean['salary_usd'].median()
    print(f"\nMEDIAN SALARY FOR SOFTWARE ENGINEERS IN US: ${median_salary:,.0f}")
else:
    print("No Software Engineers found in the data")


=== QUESTION 1: Software Engineers Median Salary ===
Software Engineers found: 1103
Salary range: $39,520 - $500,000

MEDIAN SALARY FOR SOFTWARE ENGINEERS IN US: $134,000


In [12]:
# Question 2: Which US state has the highest average salary for tech workers?
print("\n=== QUESTION 2: Highest Average Salary by State for Tech Workers ===")

# Filter for tech workers in US
tech_workers = us_data[
    (us_data['industry'].str.contains('Computing or Tech|Tech', case=False, na=False)) |
    (us_data['job_title'].str.contains('Software|Developer|Engineer|Data|Analyst|Tech', case=False, na=False))
]

# Remove outliers
tech_workers_clean = tech_workers[
    (tech_workers['salary_usd'] >= 30000) & 
    (tech_workers['salary_usd'] <= 500000)
]

print(f"Tech workers found: {len(tech_workers_clean)}")

# Group by state and calculate average salary
state_avg_salary = tech_workers_clean.groupby('state')['salary_usd'].agg(['mean', 'count']).reset_index()
state_avg_salary = state_avg_salary[state_avg_salary['count'] >= 3]  # At least 3 people per state
state_avg_salary = state_avg_salary.sort_values('mean', ascending=False)

print(f"\nTop 10 states by average tech salary:")
print(state_avg_salary.head(10))

if len(state_avg_salary) > 0:
    highest_state = state_avg_salary.iloc[0]
    print(f"\nHIGHEST PAYING STATE FOR TECH WORKERS: {highest_state['state']}")
    print(f"Average salary: ${highest_state['mean']:,.0f}")
    print(f"Number of tech workers: {highest_state['count']}")
else:
    print("No sufficient tech worker data by state")



=== QUESTION 2: Highest Average Salary by State for Tech Workers ===
Tech workers found: 6435

Top 10 states by average tech salary:
                   state           mean  count
11            California  140059.102941    952
54  New Jersey, New York  140000.000000      3
75            Washington  132048.876623    462
57              New York  129375.556863    510
39         Massachusetts  118300.686192    478
68          South Dakota  115333.333333      3
63                Oregon  111266.804348    230
74              Virginia  111143.339286    224
20  District of Columbia  110674.976048    167
16              Colorado  109202.486957    230

HIGHEST PAYING STATE FOR TECH WORKERS: California
Average salary: $140,059
Number of tech workers: 952


In [14]:
!pip install scipy

Collecting scipy
  Using cached scipy-1.16.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
Using cached scipy-1.16.2-cp312-cp312-win_amd64.whl (38.6 MB)
Installing collected packages: scipy
Successfully installed scipy-1.16.2



[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
# Question 3: How much does salary increase on average for each year of experience in tech?
print("\n=== QUESTION 3: Salary Increase per Year of Experience in Tech ===")

# Use the tech workers data from previous question
tech_exp_data = tech_workers_clean.copy()

# Convert experience to numeric values
def convert_experience_to_numeric(exp_str):
    if pd.isna(exp_str):
        return np.nan
    
    exp_str = str(exp_str).lower()
    if '1 year or less' in exp_str:
        return 0.5
    elif '2 - 4 years' in exp_str:
        return 3
    elif '5-7 years' in exp_str:
        return 6
    elif '8 - 10 years' in exp_str:
        return 9
    elif '11 - 20 years' in exp_str:
        return 15.5
    elif '21 - 30 years' in exp_str:
        return 25.5
    elif '31 - 40 years' in exp_str:
        return 35.5
    elif '41 years or more' in exp_str:
        return 45
    return np.nan

tech_exp_data['experience_numeric'] = tech_exp_data['total_experience'].apply(convert_experience_to_numeric)

# Remove rows with missing experience or salary
tech_exp_clean = tech_exp_data.dropna(subset=['experience_numeric', 'salary_usd'])

print(f"Tech workers with experience data: {len(tech_exp_clean)}")

if len(tech_exp_clean) > 10:
    # Calculate correlation and linear regression
    correlation = tech_exp_clean['experience_numeric'].corr(tech_exp_clean['salary_usd'])
    
    # Simple linear regression
    from scipy import stats
    slope, intercept, r_value, p_value, std_err = stats.linregress(
        tech_exp_clean['experience_numeric'], 
        tech_exp_clean['salary_usd']
    )
    
    print(f"Correlation between experience and salary: {correlation:.3f}")
    print(f"R-squared: {r_value**2:.3f}")
    print(f"\nSALARY INCREASE PER YEAR OF EXPERIENCE IN TECH: ${slope:,.0f}")
    print(f"P-value: {p_value:.3f}")
    
    # Show experience groups
    exp_groups = tech_exp_clean.groupby('total_experience')['salary_usd'].agg(['mean', 'count']).reset_index()
    exp_groups = exp_groups.sort_values('mean')
    print(f"\nAverage salary by experience level:")
    print(exp_groups)
else:
    print("Insufficient data for experience analysis")



=== QUESTION 3: Salary Increase per Year of Experience in Tech ===
Tech workers with experience data: 6435
Correlation between experience and salary: 0.244
R-squared: 0.059

SALARY INCREASE PER YEAR OF EXPERIENCE IN TECH: $1,509
P-value: 0.000

Average salary by experience level:
   total_experience           mean  count
0    1 year or less   77023.202797    143
2       2 - 4 years   89543.564895    809
6         5-7 years   99023.145423   1169
7      8 - 10 years  109628.455479   1168
5  41 years or more  117799.064516     31
1     11 - 20 years  122342.287828   2095
4     31 - 40 years  127019.151899    158
3     21 - 30 years  129349.104408    862


In [16]:
# Question 4: Which industry (besides tech) has the highest median salary?
print("\n=== QUESTION 4: Highest Median Salary Industry (Non-Tech) ===")

# Filter for non-tech industries
non_tech_data = us_data[
    ~us_data['industry'].str.contains('Computing or Tech|Tech', case=False, na=False)
]

# Remove outliers
non_tech_clean = non_tech_data[
    (non_tech_data['salary_usd'] >= 20000) & 
    (non_tech_data['salary_usd'] <= 500000)
]

print(f"Non-tech workers: {len(non_tech_clean)}")

# Group by industry and calculate median salary
industry_median = non_tech_clean.groupby('industry')['salary_usd'].agg(['median', 'count']).reset_index()
industry_median = industry_median[industry_median['count'] >= 5]  # At least 5 people per industry
industry_median = industry_median.sort_values('median', ascending=False)

print(f"\nTop 10 industries by median salary (non-tech):")
print(industry_median.head(10))

if len(industry_median) > 0:
    highest_industry = industry_median.iloc[0]
    print(f"\nHIGHEST PAYING NON-TECH INDUSTRY: {highest_industry['industry']}")
    print(f"Median salary: ${highest_industry['median']:,.0f}")
    print(f"Number of workers: {highest_industry['count']}")
else:
    print("No sufficient non-tech industry data")



=== QUESTION 4: Highest Median Salary Industry (Non-Tech) ===
Non-tech workers: 18799

Top 10 industries by median salary (non-tech):
             industry    median  count
195           Defense  148000.0      5
161        Consulting  135000.0      5
773          Software  132500.0      6
564            Pharma  127500.0     16
571   Pharmaceutical   122000.0      9
585   Pharmaceuticals  115000.0     19
240            Energy  110500.0     12
541       Oil and Gas  106500.0      7
570    Pharmaceutical  100000.0     15
586  Pharmaceuticals   100000.0      5

HIGHEST PAYING NON-TECH INDUSTRY: Defense
Median salary: $148,000
Number of workers: 5


## Step 2: Data Cleaning


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?

In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $134,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $1,509 per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Defense

**Key insights:**
- **Experience matters significantly in tech:** Each year of experience adds approximately $1,509 to salary
- **Geographic location impacts tech salaries:** California leads in tech compensation
- **Software Engineering is well-compensated:** Median salary of $134,000 shows strong market value
- **Defense industry pays well:** Non-tech industries like Defense can compete with tech salaries
- **Data quality is crucial:** Proper cleaning and outlier removal significantly affected results

**Challenges Faced:**
- **Messy salary data:** Handled by creating regex-based cleaning function to extract numeric values
- **Outlier detection:** Applied reasonable salary bounds ($30K-$500K) to remove unrealistic values

**What you learned about vibe coding:**
- **Good for Starting** > 90% of this file is vibe coded. Its great for starting but its really easy to lose track of what is going on
- **Prompting:** Making clear prompting helped and this time it did not mess with my environments which I am happy about
- **Far from Perfect:** At the end of the day, we still need someone to look over for it to be production ready.

