# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

### Data Exploration & Understanding
- Load and explore the TSV dataset to understand structure, columns, data types, and initial data quality issues

### Data Cleaning Tasks
- Clean salary data - handle different currencies, convert to USD, remove outliers, handle missing values and invalid entries
- Clean location data - standardize US state names, handle different country formats, identify US vs non-US workers
- Clean job titles - standardize tech job titles, identify software engineers, categorize tech vs non-tech roles
- Clean experience data - convert ranges to numeric values, handle missing experience data
- Clean demographic data - standardize gender categories, education levels, industry classifications

### Business Questions Analysis
- Answer Q1: Calculate median salary for Software Engineers in the United States
- Answer Q2: Find which US state has the highest average salary for tech workers
- Answer Q3: Calculate average salary increase per year of experience in tech
- Answer Q4: Find which industry (besides tech) has the highest median salary

### Final Deliverables
- Create comprehensive summary with findings, insights, challenges faced, and learnings

## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
df = pd.read_csv('../../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', 
                 sep='\t')

print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nFirst few rows:")
df.head()

Dataset loaded successfully!
Shape: (28062, 18)
Columns: ['Timestamp', 'How old are you?', 'What industry do you work in?', 'Job title', 'If your job title needs additional context, please clarify here:', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.', 'Please indicate the currency', 'If "Other," please indicate the currency here: ', 'If your income needs additional context, please provide it here:', 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional w

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## Step 2: Data Cleaning

In [None]:
# Step 2: Data Cleaning - Salary Data
print("=== SALARY DATA CLEANING ===")

# First, let's examine the salary column more closely
salary_col = 'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'
currency_col = 'Please indicate the currency'

print(f"Original salary data types: {df[salary_col].dtype}")
print(f"Sample salary values:")
print(df[salary_col].head(10))

print(f"\nCurrency distribution:")
print(df[currency_col].value_counts())

# Convert salary to string for cleaning
df[salary_col] = df[salary_col].astype(str)

# Remove common non-numeric characters and clean salary data
def clean_salary(salary_str):
    if pd.isna(salary_str) or salary_str == 'nan':
        return None
    
    # Remove common text patterns
    salary_str = str(salary_str).lower()
    salary_str = salary_str.replace(',', '')  # Remove commas
    salary_str = salary_str.replace('$', '')  # Remove dollar signs
    salary_str = salary_str.replace(' ', '')  # Remove spaces
    
    # Handle 'k' notation (multiply by 1000)
    if 'k' in salary_str:
        salary_str = salary_str.replace('k', '')
        try:
            return float(salary_str) * 1000
        except:
            return None
    
    # Extract only numbers and decimal points
    import re
    numbers = re.findall(r'\d+\.?\d*', salary_str)
    if numbers:
        try:
            return float(numbers[0])
        except:
            return None
    
    return None

# Apply salary cleaning
df['salary_cleaned'] = df[salary_col].apply(clean_salary)

print(f"\nAfter cleaning:")
print(f"Valid salaries: {df['salary_cleaned'].notna().sum()}")
print(f"Invalid salaries: {df['salary_cleaned'].isna().sum()}")
print(f"Sample cleaned salaries: {df['salary_cleaned'].dropna().head(10).tolist()}")

# Currency conversion to USD (approximate 2021 rates)
print("\n=== CURRENCY CONVERSION TO USD ===")

# 2021 approximate exchange rates
exchange_rates = {
    'USD': 1.0,
    'GBP': 1.37,  # British Pound to USD
    'CAD': 0.79,  # Canadian Dollar to USD
    'EUR': 1.18,  # Euro to USD
    'AUD': 0.73,  # Australian Dollar to USD
    'Other': 1.0   # Assume USD for other currencies
}

def convert_to_usd(salary, currency):
    if pd.isna(salary) or pd.isna(currency):
        return None
    
    currency = str(currency).strip()
    rate = exchange_rates.get(currency, 1.0)  # Default to 1.0 if currency not found
    
    return salary * rate

# Convert salaries to USD
df['salary_usd'] = df.apply(lambda row: convert_to_usd(row['salary_cleaned'], row[currency_col]), axis=1)

print(f"Salaries converted to USD:")
print(f"Valid USD salaries: {df['salary_usd'].notna().sum()}")
print(f"Sample USD salaries: {df['salary_usd'].dropna().head(10).tolist()}")

# Check for outliers (salaries that seem too high or too low)
print(f"\nSalary statistics (USD):")
print(f"Min: ${df['salary_usd'].min():,.0f}")
print(f"Max: ${df['salary_usd'].max():,.0f}")
print(f"Median: ${df['salary_usd'].median():,.0f}")
print(f"Mean: ${df['salary_usd'].mean():,.0f}")

# Remove extreme outliers (salaries < $10k or > $500k)
df['salary_usd_clean'] = df['salary_usd']
df.loc[(df['salary_usd'] < 10000) | (df['salary_usd'] > 500000), 'salary_usd_clean'] = None

print(f"\nAfter removing outliers (< $10k or > $500k):")
print(f"Valid salaries: {df['salary_usd_clean'].notna().sum()}")
print(f"Removed outliers: {df['salary_usd'].notna().sum() - df['salary_usd_clean'].notna().sum()}")

# Location Data Cleaning
print("\n=== LOCATION DATA CLEANING ===")

country_col = 'What country do you work in?'
state_col = "If you're in the U.S., what state do you work in?"

# Clean country data - standardize US variations
def clean_country(country):
    if pd.isna(country):
        return None
    
    country = str(country).strip()
    country_lower = country.lower()
    
    # Standardize US variations
    if any(us_variant in country_lower for us_variant in ['united states', 'usa', 'us']):
        return 'United States'
    
    return country

df['country_clean'] = df[country_col].apply(clean_country)

print(f"Country cleaning results:")
print(f"US entries: {df[df['country_clean'] == 'United States'].shape[0]}")
print(f"Non-US entries: {df[df['country_clean'] != 'United States'].shape[0]}")

# Clean US state data
us_data = df[df['country_clean'] == 'United States'].copy()

# Common state name mappings
state_mappings = {
    'california': 'California',
    'ca': 'California',
    'new york': 'New York',
    'ny': 'New York',
    'texas': 'Texas',
    'tx': 'Texas',
    'florida': 'Florida',
    'fl': 'Florida',
    'washington': 'Washington',
    'wa': 'Washington',
    'massachusetts': 'Massachusetts',
    'ma': 'Massachusetts',
    'illinois': 'Illinois',
    'il': 'Illinois',
    'pennsylvania': 'Pennsylvania',
    'pa': 'Pennsylvania',
    'ohio': 'Ohio',
    'oh': 'Ohio',
    'georgia': 'Georgia',
    'ga': 'Georgia'
}

def clean_state(state):
    if pd.isna(state):
        return None
    
    state = str(state).strip().lower()
    return state_mappings.get(state, state.title())

# Apply state cleaning to US data
us_data['state_clean'] = us_data[state_col].apply(clean_state)

print(f"\nUS state cleaning:")
print(f"Unique states before cleaning: {us_data[state_col].nunique()}")
print(f"Unique states after cleaning: {us_data['state_clean'].nunique()}")
print(f"Sample cleaned states: {us_data['state_clean'].value_counts().head(10)}")

# Update main dataframe with cleaned location data
df.loc[df['country_clean'] == 'United States', 'state_clean'] = us_data['state_clean']

# Job Title Cleaning
print("\n=== JOB TITLE CLEANING ===")

job_title_col = 'Job title'
industry_col = 'What industry do you work in?'

# Define tech keywords to identify tech roles
tech_keywords = [
    'software', 'engineer', 'developer', 'programmer', 'coder',
    'data scientist', 'data analyst', 'data engineer', 'analyst',
    'architect', 'devops', 'sre', 'site reliability',
    'frontend', 'backend', 'full stack', 'fullstack',
    'mobile', 'ios', 'android', 'web developer',
    'machine learning', 'ml engineer', 'ai engineer',
    'cybersecurity', 'security engineer', 'infrastructure',
    'cloud', 'aws', 'azure', 'gcp', 'kubernetes',
    'product manager', 'technical product', 'scrum master',
    'qa', 'quality assurance', 'test engineer', 'automation'
]

# Define software engineer specific keywords
software_engineer_keywords = [
    'software engineer', 'software developer', 'software programmer',
    'senior software engineer', 'principal software engineer',
    'staff software engineer', 'lead software engineer',
    'software architect', 'senior developer', 'lead developer',
    'principal developer', 'staff developer'
]

def is_tech_role(job_title, industry):
    if pd.isna(job_title):
        return False
    
    job_title_lower = str(job_title).lower()
    industry_lower = str(industry).lower() if not pd.isna(industry) else ''
    
    # Check if industry is tech
    if 'computing' in industry_lower or 'tech' in industry_lower:
        return True
    
    # Check job title for tech keywords
    return any(keyword in job_title_lower for keyword in tech_keywords)

def is_software_engineer(job_title):
    if pd.isna(job_title):
        return False
    
    job_title_lower = str(job_title).lower()
    return any(keyword in job_title_lower for keyword in software_engineer_keywords)

# Apply tech role identification
df['is_tech_role'] = df.apply(lambda row: is_tech_role(row[job_title_col], row[industry_col]), axis=1)
df['is_software_engineer'] = df[job_title_col].apply(is_software_engineer)

print(f"Tech role identification:")
print(f"Total tech roles: {df['is_tech_role'].sum()}")
print(f"Software engineers: {df['is_software_engineer'].sum()}")

print(f"\nSample tech job titles:")
tech_jobs = df[df['is_tech_role'] == True][job_title_col].value_counts().head(15)
print(tech_jobs)

print(f"\nSample software engineer titles:")
se_jobs = df[df['is_software_engineer'] == True][job_title_col].value_counts().head(10)
print(se_jobs)


=== SALARY DATA CLEANING ===
Original salary data types: object
Sample salary values:
0     55,000
1     54,600
2     34,000
3     62,000
4     60,000
5     62,000
6     33,000
7     50,000
8    112,000
9     45,000
Name: What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.), dtype: object

Currency distribution:
Please indicate the currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
HKD            4
Name: count, dtype: int64

After cleaning:
Valid salaries: 28062
Invalid salaries: 0
Sample cleaned salaries: [55000.0, 54600.0, 34000.0, 62000.0, 60000.0, 62000.0, 33000.0, 50000.0, 112000.0, 45000.0]

=== CURRENCY CONVERSION TO USD ===
Salaries converted to USD:
Valid USD salaries: 28062
Sam

## Step 3: Business Questions Analysis

Now answer those important business questions!

In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?

print("=== QUESTION 1: MEDIAN SALARY FOR SOFTWARE ENGINEERS IN US ===")

# Filter for Software Engineers in the United States
se_us_filter = (df['is_software_engineer'] == True) & (df['country_clean'] == 'United States') & (df['salary_usd_clean'].notna())

se_us_data = df[se_us_filter]

if len(se_us_data) > 0:
    median_salary = se_us_data['salary_usd_clean'].median()
    
    print(f"ANSWER: The median salary for Software Engineers in the United States is ${median_salary:,.0f}")
    
else:
    print("No Software Engineers found in US with valid salary data!")
    print("Let's check what we have:")
    print(f"Total Software Engineers: {df['is_software_engineer'].sum()}")
    print(f"US entries: {df[df['country_clean'] == 'United States'].shape[0]}")
    print(f"Valid salaries: {df['salary_usd_clean'].notna().sum()}")


=== QUESTION 1: MEDIAN SALARY FOR SOFTWARE ENGINEERS IN US ===
ANSWER: The median salary for Software Engineers in the United States is $140,000


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?

print("=== QUESTION 2: HIGHEST AVERAGE SALARY FOR TECH WORKERS BY US STATE ===")

# Filter for tech workers in the United States with valid salary data
tech_us_filter = (
    (df['is_tech_role'] == True)
    & (df['country_clean'] == 'United States')
    & (df['salary_usd_clean'].notna())
    & (df['state_clean'].notna())
)

tech_us_data = df[tech_us_filter].copy()

if len(tech_us_data) > 0:
    # Require a minimum sample size per state to avoid small-sample noise
    min_samples = 20
    state_counts = tech_us_data.groupby('state_clean').size()
    valid_states = state_counts[state_counts >= min_samples].index
    tech_us_valid = tech_us_data[tech_us_data['state_clean'].isin(valid_states)]

    if len(tech_us_valid) == 0:
        tech_us_valid = tech_us_data  # fallback if all states are small

    # Calculate mean salary by state
    state_avg_salaries = (
        tech_us_valid.groupby('state_clean')['salary_usd_clean']
        .mean()
        .sort_values(ascending=False)
    )

    highest_paying_state = state_avg_salaries.index[0]
    highest_avg_salary = state_avg_salaries.iloc[0]

    print(
        f"ANSWER: {highest_paying_state} has the highest average salary for tech workers at ${highest_avg_salary:,.0f}"
    )
else:
    print("No tech workers found in US with valid salary and state data!")
    print("Let's check what we have:")
    print(f"Total tech roles: {df['is_tech_role'].sum()}")
    print(f"US entries: {df[df['country_clean'] == 'United States'].shape[0]}")
    print(f"Valid salaries: {df['salary_usd_clean'].notna().sum()}")
    print(f"Valid states: {df['state_clean'].notna().sum()}")


=== QUESTION 2: HIGHEST AVERAGE SALARY FOR TECH WORKERS BY US STATE ===
ANSWER: California has the highest average salary for tech workers at $141,498


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?

print("=== QUESTION 3: SALARY INCREASE PER YEAR OF EXPERIENCE IN TECH ===")

# First, let's clean the experience data
experience_col = 'How many years of professional work experience do you have overall?'

def clean_experience(exp_str):
    if pd.isna(exp_str):
        return None
    
    exp_str = str(exp_str).lower().strip()
    
    # Handle common ranges
    if '1 year or less' in exp_str:
        return 0.5
    elif '2 - 4 years' in exp_str:
        return 3
    elif '5-7 years' in exp_str or '5 - 7 years' in exp_str:
        return 6
    elif '8 - 10 years' in exp_str:
        return 9
    elif '11 - 20 years' in exp_str:
        return 15.5
    elif '21 - 30 years' in exp_str:
        return 25.5
    elif '31 - 40 years' in exp_str:
        return 35.5
    elif '41 years or more' in exp_str:
        return 45
    
    # Try to extract numbers
    import re
    numbers = re.findall(r'\d+', exp_str)
    if numbers:
        return float(numbers[0])
    
    return None

# Apply experience cleaning
df['experience_clean'] = df[experience_col].apply(clean_experience)

# Filter for tech workers with valid salary and experience data
tech_exp_filter = (
    (df['is_tech_role'] == True) 
    & (df['salary_usd_clean'].notna()) 
    & (df['experience_clean'].notna())
)

tech_exp_data = df[tech_exp_filter]

if len(tech_exp_data) > 0:
    # Calculate correlation between experience and salary
    correlation = tech_exp_data['experience_clean'].corr(tech_exp_data['salary_usd_clean'])
    
    # Calculate average salary increase per year using linear regression
    from sklearn.linear_model import LinearRegression
    import numpy as np
    
    X = tech_exp_data[['experience_clean']].values
    y = tech_exp_data['salary_usd_clean'].values
    
    model = LinearRegression()
    model.fit(X, y)
    
    salary_increase_per_year = model.coef_[0]
    
    print(f"ANSWER: Salary increases by ${salary_increase_per_year:,.0f} on average for each year of experience in tech")
    
else:
    print("No tech workers found with valid salary and experience data!")
    print("Let's check what we have:")
    print(f"Total tech roles: {df['is_tech_role'].sum()}")
    print(f"Valid salaries: {df['salary_usd_clean'].notna().sum()}")
    print(f"Valid experience: {df['experience_clean'].notna().sum()}")


=== QUESTION 3: SALARY INCREASE PER YEAR OF EXPERIENCE IN TECH ===
ANSWER: Salary increases by $1,634 on average for each year of experience in tech


In [None]:
# Question 4: Which industry (besides tech) has the highest median salary?

print("=== QUESTION 4: HIGHEST MEDIAN SALARY BY INDUSTRY (NON-TECH) ===")

# Filter for non-tech workers with valid salary data
non_tech_filter = (
    (df['is_tech_role'] == False) 
    & (df['salary_usd_clean'].notna())
    & (df['What industry do you work in?'].notna())
)

non_tech_data = df[non_tech_filter]

if len(non_tech_data) > 0:
    # Require a minimum sample size per industry to avoid small-sample noise
    min_samples = 20
    industry_counts = non_tech_data.groupby('What industry do you work in?').size()
    valid_industries = industry_counts[industry_counts >= min_samples].index
    non_tech_valid = non_tech_data[non_tech_data['What industry do you work in?'].isin(valid_industries)]

    if len(non_tech_valid) == 0:
        non_tech_valid = non_tech_data  # fallback if all industries are small

    # Calculate median salary by industry
    industry_median_salaries = (
        non_tech_valid.groupby('What industry do you work in?')['salary_usd_clean']
        .median()
        .sort_values(ascending=False)
    )
    
    highest_paying_industry = industry_median_salaries.index[0]
    highest_median_salary = industry_median_salaries.iloc[0]
    
    print(f"ANSWER: {highest_paying_industry} has the highest median salary at ${highest_median_salary:,.0f}")
    
else:
    print("No non-tech workers found with valid salary and industry data!")
    print("Let's check what we have:")
    print(f"Total non-tech roles: {(df['is_tech_role'] == False).sum()}")
    print(f"Valid salaries: {df['salary_usd_clean'].notna().sum()}")
    print(f"Valid industries: {df['What industry do you work in?'].notna().sum()}")


=== QUESTION 4: HIGHEST MEDIAN SALARY BY INDUSTRY (NON-TECH) ===
ANSWER: Pharmaceuticals has the highest median salary at $115,000


In [None]:
# Question 5: What percentage of respondents work remotely vs. in-office?

print("=== QUESTION 5: REMOTE VS IN-OFFICE WORK PERCENTAGES ===")

# Let's first explore what data we have about work location
print("Available columns related to work location:")
location_columns = [col for col in df.columns if 'remote' in col.lower() or 'office' in col.lower() or 'location' in col.lower() or 'work' in col.lower()]
print(location_columns)

# Check if there's a specific remote work column
remote_col = None
for col in df.columns:
    if 'remote' in col.lower() and 'work' in col.lower():
        remote_col = col
        break

if remote_col:
    print(f"\nUsing column: {remote_col}")
    print(f"Unique values:")
    print(df[remote_col].value_counts())
    
    # Calculate percentages
    total_responses = df[remote_col].notna().sum()
    remote_responses = df[df[remote_col].str.contains('remote', case=False, na=False)].shape[0]
    office_responses = total_responses - remote_responses
    
    remote_percentage = (remote_responses / total_responses) * 100
    office_percentage = (office_responses / total_responses) * 100
    
    print(f"\nANSWER:")
    print(f"Remote work: {remote_percentage:.1f}% ({remote_responses:,} respondents)")
    print(f"In-office work: {office_percentage:.1f}% ({office_responses:,} respondents)")
    
else:
    print("\nNo specific remote work column found. Let's check city data for clues...")
    
    # Check if city data indicates remote work
    city_col = 'What city do you work in?'
    if city_col in df.columns:
        print(f"\nChecking {city_col} for remote indicators:")
        city_counts = df[city_col].value_counts()
        remote_indicators = city_counts[city_counts.index.str.contains('remote', case=False, na=False)]
        
        if len(remote_indicators) > 0:
            print("Found remote indicators in city data:")
            print(remote_indicators)
            
            total_responses = df[city_col].notna().sum()
            remote_responses = df[df[city_col].str.contains('remote', case=False, na=False)].shape[0]
            office_responses = total_responses - remote_responses
            
            remote_percentage = (remote_responses / total_responses) * 100
            office_percentage = (office_responses / total_responses) * 100
            
            print(f"\nANSWER:")
            print(f"Remote work: {remote_percentage:.1f}% ({remote_responses:,} respondents)")
            print(f"In-office work: {office_percentage:.1f}% ({office_responses:,} respondents)")
        else:
            print("No remote indicators found in city data.")
            print("Unable to determine remote vs in-office percentages from available data.")
    else:
        print("No city column found. Unable to determine remote vs in-office percentages.")


=== QUESTION 5: REMOTE VS IN-OFFICE WORK PERCENTAGES ===
Available columns related to work location:
['What industry do you work in?', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional work experience do you have in your field?']

No specific remote work column found. Let's check city data for clues...

Checking What city do you work in? for remote indicators:
Found remote indicators in city data:
What city do you work in?
Remote                                          133
Remote                                           14
remote                                           12
Remote worker     

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $140,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $1634 per year
4. **Remote vs office percentage:** 1.1% remote, 98.9% office
5. **Highest paying non-tech industry:** Pharmaceuticals

**Key insights:**
- Software Engineers make more than non-tech industries
- California is the best paying state for tech roles
- Majority of the roles were in person at an office during Covid?

**Challenges faced:**
- Cursor frequently creates new cells instead of following the listed steps, and this has to be manually fixed
- Cursor will give skewed results if not prompted carefully with minimum requirements

**What you learned about vibe coding:**
- It's more efficient as in you don't code for as long, but you spend more time debugging 
- Results have to be checked manually still, as it will sometimes give answers like 400,000
- Vibe coding can automate the messy parts like data cleaning with relative ease
