# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Load TSV data and perform initial exploration (shape, columns, basic stats)
- [ ] Clean and standardize salary, currency, country, state, industry, job title, and experience columns
- [ ] Calculate median salary for Software Engineers in the US
- [ ] Find US state with highest average tech worker salary
- [ ] Calculate salary increase per year of experience in tech
- [ ] Identify non-tech industry with highest median salary


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [10]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
print("\n1. LOADING DATA...")
data = pd.read_csv('/workspaces/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep='\t')

print(f"Data loaded successfully!")
print(f"Shape: {data.shape}")
print(f"Columns: {len(data.columns)}")

# Display the first few rows of the data
print("\nFirst 3 rows:")
print(data.head(3))

# Display the column names
print(f"\nColumn names ({len(data.columns)} total):")
for i, col in enumerate(data.columns):
    print(f"{i+1:2d}. {col}")

# Display basic info
print(f"\nData types:")
print(data.dtypes.value_counts())

# Display missing values summary
print(f"\nMissing values summary:")
missing_summary = data.isnull().sum()
print(f"Columns with missing values: {(missing_summary > 0).sum()}")
print(f"Total missing values: {missing_summary.sum()}")

print("\n" + "=" * 60)
print("2. DATA CLEANING")
print("=" * 60)



1. LOADING DATA...
Data loaded successfully!
Shape: (28062, 18)
Columns: 18

First 3 rows:
            Timestamp How old are you?  What industry do you work in?  \
0  4/27/2021 11:02:10            25-34   Education (Higher Education)   
1  4/27/2021 11:02:22            25-34              Computing or Tech   
2  4/27/2021 11:02:38            25-34  Accounting, Banking & Finance   

                                  Job title  \
0        Research and Instruction Librarian   
1  Change & Internal Communications Manager   
2                      Marketing Specialist   

  If your job title needs additional context, please clarify here:  \
0                                                NaN                 
1                                                NaN                 
2                                                NaN                 

  What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equiv

## Step 2: Data Cleaning


In [11]:
# Step 2.1: Handle Column Names
print("\n2.1 Creating cleaner column names...")
original_columns = data.columns.tolist()

# Create mapping for cleaner column names
column_mapping = {
    'Timestamp': 'timestamp',
    'How old are you?': 'age',
    'What industry do you work in?': 'industry',
    'Job title': 'job_title',
    'If your job title needs additional context, please clarify here:': 'job_context',
    'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)': 'salary',
    'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.': 'additional_compensation',
    'Please indicate the currency': 'currency',
    'If "Other," please indicate the currency here: ': 'currency_other',
    'If your income needs additional context, please provide it here:': 'income_context',
    'What country do you work in?': 'country',
    'If you\'re in the U.S., what state do you work in?': 'state',
    'What city do you work in?': 'city',
    'How many years of professional work experience do you have overall?': 'experience_overall',
    'How many years of professional work experience do you have in your field?': 'experience_field',
    'What is your highest level of education completed?': 'education',
    'What is your gender?': 'gender',
    'What is your race? (Choose all that apply.)': 'race'
}

# Rename columns
data = data.rename(columns=column_mapping)
print(f"Renamed {len(column_mapping)} columns for easier manipulation")

# Step 2.2: Clean Salary Column
print("\n2.2 Cleaning salary column...")
original_salary_count = len(data)
print(f"Original salary entries: {original_salary_count}")

# Convert salary to string first to handle any non-string types
data['salary'] = data['salary'].astype(str)

# Remove commas and strip whitespace
data['salary'] = data['salary'].str.replace(',', '').str.strip()

# Convert to numeric, invalid entries become NaN
data['salary'] = pd.to_numeric(data['salary'], errors='coerce')

# Check for suspiciously low salaries (likely data entry errors)
low_salary_mask = (data['salary'] < 10000) & (data['salary'].notna())
print(f"Entries with salary < $10,000: {low_salary_mask.sum()}")

# Check for suspiciously high salaries (likely data entry errors)
high_salary_mask = (data['salary'] > 1000000) & (data['salary'].notna())
print(f"Entries with salary > $1,000,000: {high_salary_mask.sum()}")

# Remove rows with invalid salaries
data = data[(data['salary'].notna()) & (data['salary'] >= 10000) & (data['salary'] <= 1000000)]
print(f"After salary cleaning: {len(data)} rows (removed {original_salary_count - len(data)})")

# Step 2.3: Clean Currency Column
print("\n2.3 Cleaning currency column...")
data['currency'] = data['currency'].astype(str).str.strip().str.upper()
print(f"Unique currencies: {data['currency'].nunique()}")
print(f"Currency distribution:")
print(data['currency'].value_counts().head(10))

# Step 2.4: Clean Country Column
print("\n2.4 Cleaning country column...")
data['country'] = data['country'].astype(str).str.strip()

# Standardize US country names
us_variations = ['united states', 'usa', 'us', 'united states of america']
data['is_us'] = data['country'].str.lower().isin(us_variations) | data['country'].str.contains('united states', case=False, na=False)

# Standardize country names
data.loc[data['is_us'], 'country'] = 'United States'
data.loc[data['country'].str.contains('united kingdom', case=False, na=False), 'country'] = 'United Kingdom'

print(f"US respondents: {data['is_us'].sum()}")
print(f"Top countries:")
print(data['country'].value_counts().head(10))

# Step 2.5: Clean US State Column
print("\n2.5 Cleaning US state column...")
data['state'] = data['state'].astype(str).str.strip()
# Keep states as-is for now, but note missing states for US respondents
us_missing_state = data['is_us'] & (data['state'].isin(['', 'nan', 'None']))
print(f"US respondents with missing state: {us_missing_state.sum()}")

# Step 2.6: Clean Industry Column
print("\n2.6 Cleaning industry column...")
data['industry'] = data['industry'].astype(str).str.strip()

# Create tech flag
tech_industries = ['computing or tech', 'computing', 'tech']
data['is_tech'] = data['industry'].str.lower().isin(tech_industries) | data['industry'].str.contains('computing or tech', case=False, na=False)

# Standardize tech industry name
data.loc[data['is_tech'], 'industry'] = 'Computing or Tech'

print(f"Tech industry respondents: {data['is_tech'].sum()}")
print(f"Top industries:")
print(data['industry'].value_counts().head(10))

# Step 2.7: Clean Job Title Column
print("\n2.7 Cleaning job title column...")
data['job_title'] = data['job_title'].astype(str).str.strip()

# Create software engineer flag (case-insensitive)
se_keywords = ['software engineer', 'software developer', 'swe', 'software eng']
data['is_software_engineer'] = data['job_title'].str.lower().str.contains('|'.join(se_keywords), na=False)

print(f"Software engineer respondents: {data['is_software_engineer'].sum()}")
print(f"Sample software engineer titles:")
se_titles = data[data['is_software_engineer']]['job_title'].unique()[:10]
for title in se_titles:
    print(f"  - {title}")

# Step 2.8: Clean Experience Columns
print("\n2.8 Cleaning experience columns...")

def convert_experience_to_numeric(exp_str):
    """Convert experience text to numeric midpoint"""
    if pd.isna(exp_str) or exp_str == '' or exp_str == 'nan':
        return np.nan
    
    exp_str = str(exp_str).lower().strip()
    
    if '1 year or less' in exp_str:
        return 0.5
    elif '2 - 4 years' in exp_str or '2-4 years' in exp_str:
        return 3
    elif '5-7 years' in exp_str or '5 - 7 years' in exp_str:
        return 6
    elif '8 - 10 years' in exp_str or '8-10 years' in exp_str:
        return 9
    elif '11 - 20 years' in exp_str or '11-20 years' in exp_str:
        return 15.5
    elif '21 - 30 years' in exp_str or '21-30 years' in exp_str:
        return 25.5
    elif '31 - 40 years' in exp_str or '31-40 years' in exp_str:
        return 35.5
    elif '41 years or more' in exp_str:
        return 41
    else:
        return np.nan

data['experience_overall_numeric'] = data['experience_overall'].apply(convert_experience_to_numeric)
data['experience_field_numeric'] = data['experience_field'].apply(convert_experience_to_numeric)

print(f"Experience conversion complete")
print(f"Valid overall experience entries: {data['experience_overall_numeric'].notna().sum()}")
print(f"Valid field experience entries: {data['experience_field_numeric'].notna().sum()}")

# Step 2.9: Remove Invalid Rows
print("\n2.9 Removing invalid rows...")
rows_before_cleaning = len(data)

# Remove rows missing critical data for our analysis
data = data.dropna(subset=['salary', 'currency', 'country'])

# For US-focused questions, keep only USD currency
data_us = data[data['is_us'] & (data['currency'] == 'USD')].copy()

print(f"Rows before final cleaning: {rows_before_cleaning}")
print(f"Rows after cleaning: {len(data)}")
print(f"US USD rows for analysis: {len(data_us)}")

# Step 2.10: Data Quality Report
print("\n2.10 DATA QUALITY REPORT")
print("-" * 40)
print(f"Final dataset size: {len(data)} rows")
print(f"US USD dataset size: {len(data_us)} rows")
print(f"Software engineers in US: {data_us['is_software_engineer'].sum()}")
print(f"Tech workers in US: {data_us['is_tech'].sum()}")
print(f"Salary range: ${data_us['salary'].min():,.0f} - ${data_us['salary'].max():,.0f}")
print(f"Median salary (all US): ${data_us['salary'].median():,.0f}")

print("\n" + "=" * 60)
print("3. ANALYSIS QUESTIONS")
print("=" * 60)



2.1 Creating cleaner column names...
Renamed 18 columns for easier manipulation

2.2 Cleaning salary column...
Original salary entries: 28062
Entries with salary < $10,000: 149
Entries with salary > $1,000,000: 71
After salary cleaning: 27842 rows (removed 220)

2.3 Cleaning currency column...
Unique currencies: 11
Currency distribution:
currency
USD        23257
CAD         1666
GBP         1581
EUR          616
AUD/NZD      503
OTHER        130
CHF           37
SEK           35
ZAR           12
HKD            4
Name: count, dtype: int64

2.4 Cleaning country column...
US respondents: 22256
Top countries:
country
United States     22256
Canada             1639
United Kingdom      641
U.S.                595
UK                  592
Australia           385
Germany             182
England             166
Ireland             120
New Zealand         117
Name: count, dtype: int64

2.5 Cleaning US state column...
US respondents with missing state: 163

2.6 Cleaning industry column...
Tech i

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [12]:
# Question 1: What is the median salary for Software Engineers in the United States?

print("\nQUESTION 1: What is the median salary for Software Engineers in the United States?")
print("-" * 70)

se_us = data_us[data_us['is_software_engineer']]
if len(se_us) > 0:
    median_se_salary = se_us['salary'].median()
    mean_se_salary = se_us['salary'].mean()
    count_se = len(se_us)
    
    print(f"Sample size: {count_se} Software Engineers in the US")
    print(f"Median salary: ${median_se_salary:,.0f}")
    print(f"Mean salary: ${mean_se_salary:,.0f}")
    print(f"Salary range: ${se_us['salary'].min():,.0f} - ${se_us['salary'].max():,.0f}")
    
    # Show some examples
    print(f"\nSample Software Engineer titles and salaries:")
    sample_se = se_us[['job_title', 'salary', 'state']].head(10)
    for _, row in sample_se.iterrows():
        print(f"  - {row['job_title']} in {row['state']}: ${row['salary']:,.0f}")
else:
    print("No Software Engineers found in the US dataset")



QUESTION 1: What is the median salary for Software Engineers in the United States?
----------------------------------------------------------------------
Sample size: 879 Software Engineers in the US
Median salary: $141,000
Mean salary: $147,900
Salary range: $11,000 - $875,000

Sample Software Engineer titles and salaries:
  - Principal Software Engineer in Pennsylvania: $187,500
  - Software engineer in Washington: $139,000
  - Software Engineer in Massachusetts: $156,000
  - Software Engineer in Florida: $112,000
  - Software Engineer in Minnesota: $80,000
  - Senior Software Engineer in Texas: $120,000
  - Embedded Software Engineer in Virginia: $80,000
  - Senior Software Engineer in Massachusetts: $160,367
  - Software Developer in Ohio: $100,000
  - Software Developer in Illinois: $72,000


In [13]:
# Question 2: Which US state has the highest average salary for tech workers?


print("\n\nQUESTION 2: Which US state has the highest average salary for tech workers?")
print("-" * 70)

tech_us = data_us[data_us['is_tech'] & (data_us['state'].notna()) & (data_us['state'] != '')]
if len(tech_us) > 0:
    state_avg_salaries = tech_us.groupby('state')['salary'].agg(['mean', 'count']).round(0)
    state_avg_salaries = state_avg_salaries[state_avg_salaries['count'] >= 5]  # At least 5 respondents
    
    if len(state_avg_salaries) > 0:
        top_state = state_avg_salaries['mean'].idxmax()
        top_salary = state_avg_salaries.loc[top_state, 'mean']
        top_count = state_avg_salaries.loc[top_state, 'count']
        
        print(f"State with highest average tech salary: {top_state}")
        print(f"Average salary: ${top_salary:,.0f}")
        print(f"Sample size: {int(top_count)} tech workers")
        
        print(f"\nTop 10 states by average tech salary:")
        top_10_states = state_avg_salaries.sort_values('mean', ascending=False).head(10)
        for state, row in top_10_states.iterrows():
            print(f"  - {state}: ${row['mean']:,.0f} (n={int(row['count'])})")
    else:
        print("No states with sufficient tech worker data (minimum 5 respondents)")
else:
    print("No tech workers found in the US dataset")




QUESTION 2: Which US state has the highest average salary for tech workers?
----------------------------------------------------------------------
State with highest average tech salary: California
Average salary: $155,876
Sample size: 632 tech workers

Top 10 states by average tech salary:
  - California: $155,876 (n=632)
  - Washington: $151,176 (n=332)
  - New York: $148,801 (n=332)
  - Nevada: $137,567 (n=9)
  - New Mexico: $136,420 (n=10)
  - District of Columbia: $131,179 (n=53)
  - New Jersey: $129,678 (n=37)
  - Massachusetts: $129,338 (n=279)
  - Connecticut: $128,375 (n=18)
  - Virginia: $127,822 (n=116)


In [14]:
# Question 3: How much does salary increase on average for each year of experience in tech?

print("\n\nQUESTION 3: How much does salary increase on average for each year of experience in tech?")
print("-" * 70)

tech_with_exp = data_us[data_us['is_tech'] & data_us['experience_overall_numeric'].notna()]
if len(tech_with_exp) > 0:
    # Remove outliers for better regression
    Q1 = tech_with_exp['salary'].quantile(0.25)
    Q3 = tech_with_exp['salary'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    tech_clean = tech_with_exp[
        (tech_with_exp['salary'] >= lower_bound) & 
        (tech_with_exp['salary'] <= upper_bound)
    ]
    
    if len(tech_clean) > 10:  # Need sufficient data for regression
        X = tech_clean['experience_overall_numeric'].values
        y = tech_clean['salary'].values
        
        # Simple linear regression using numpy
        # y = mx + b, where m is slope (salary per year) and b is intercept
        # Using least squares: m = (n*sum(xy) - sum(x)*sum(y)) / (n*sum(x^2) - (sum(x))^2)
        n = len(X)
        sum_x = np.sum(X)
        sum_y = np.sum(y)
        sum_xy = np.sum(X * y)
        sum_x2 = np.sum(X * X)
        
        # Calculate slope (salary increase per year)
        salary_per_year = (n * sum_xy - sum_x * sum_y) / (n * sum_x2 - sum_x * sum_x)
        
        # Calculate intercept
        intercept = (sum_y - salary_per_year * sum_x) / n
        
        # Calculate R-squared
        y_pred = salary_per_year * X + intercept
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        r_squared = 1 - (ss_res / ss_tot)
        
        print(f"Sample size: {len(tech_clean)} tech workers with experience data")
        print(f"Average salary increase per year of experience: ${salary_per_year:,.0f}")
        print(f"R-squared: {r_squared:.3f} (explains {r_squared*100:.1f}% of salary variance)")
        
        # Show some examples
        print(f"\nExperience vs Salary examples:")
        sample_exp = tech_clean[['experience_overall_numeric', 'salary', 'job_title']].head(10)
        for _, row in sample_exp.iterrows():
            print(f"  - {row['experience_overall_numeric']:.1f} years: ${row['salary']:,.0f} ({row['job_title']})")
    else:
        print("Insufficient data for regression analysis")
else:
    print("No tech workers with experience data found")



QUESTION 3: How much does salary increase on average for each year of experience in tech?
----------------------------------------------------------------------
Sample size: 3512 tech workers with experience data
Average salary increase per year of experience: $1,356
R-squared: 0.053 (explains 5.3% of salary variance)

Experience vs Salary examples:
  - 25.5 years: $112,000 (Systems Analyst)
  - 9.0 years: $187,500 (Principal Software Engineer)
  - 9.0 years: $110,000 (Intelligence Analyst)
  - 6.0 years: $144,600 (Mobile developer)
  - 15.5 years: $200,850 (Product Design Director)
  - 15.5 years: $72,000 (Managed Services Sales Coordinator)
  - 6.0 years: $34,000 (Technical specialist)
  - 15.5 years: $88,000 (Tech Writer)
  - 15.5 years: $125,000 (Senior Business Analyst)
  - 25.5 years: $98,166 (senior web application developer)


In [15]:

# Question 4: What percentage of respondents work remotely vs. in-office?

print("\n\nQUESTION 4: What percentage of respondents work remotely vs. in-office?")
print("-" * 70)

# Look for remote work indicators in various columns
# Check city column for "remote", "work from home", etc.
remote_indicators = ['remote', 'work from home', 'wfh', 'virtual', 'distributed', 'telecommute']

# Create remote work flag
data_us['is_remote'] = False

# Check city column for remote indicators
city_remote = data_us['city'].astype(str).str.lower().str.contains('|'.join(remote_indicators), na=False)
data_us.loc[city_remote, 'is_remote'] = True

# Check job context for remote indicators
if 'job_context' in data_us.columns:
    context_remote = data_us['job_context'].astype(str).str.lower().str.contains('|'.join(remote_indicators), na=False)
    data_us.loc[context_remote, 'is_remote'] = True

# Check income context for remote indicators
if 'income_context' in data_us.columns:
    income_remote = data_us['income_context'].astype(str).str.lower().str.contains('|'.join(remote_indicators), na=False)
    data_us.loc[income_remote, 'is_remote'] = True

# Calculate percentages
total_respondents = len(data_us)
remote_count = data_us['is_remote'].sum()
in_office_count = total_respondents - remote_count

remote_percentage = (remote_count / total_respondents) * 100
in_office_percentage = (in_office_count / total_respondents) * 100

print(f"Total US respondents: {total_respondents:,}")
print(f"Remote workers: {remote_count:,} ({remote_percentage:.1f}%)")
print(f"In-office workers: {in_office_count:,} ({in_office_percentage:.1f}%)")

# Show some examples of remote work entries
print(f"\nSample remote work entries:")
remote_examples = data_us[data_us['is_remote']][['city', 'job_title', 'industry']].head(10)
for _, row in remote_examples.iterrows():
    print(f"  - {row['job_title']} in {row['industry']}: {row['city']}")




QUESTION 4: What percentage of respondents work remotely vs. in-office?
----------------------------------------------------------------------
Total US respondents: 22,236
Remote workers: 344 (1.5%)
In-office workers: 21,892 (98.5%)

Sample remote work entries:
  - Supervisory Archivist in Nonprofits: NYC (remotely)
  - Customer Service Specialist in Hospitality & Events: Small city, remote, national company
  - Onboarding Partner in Recruitment or HR: buffalo
  - Visual Designer in Computing or Tech: Chicago (remote) 
  - Information Technology Specialist in Nonprofits: Huntington (remote, HQ is in Charleston)
  - IT Manager, Financial Systems in Computing or Tech: Remote
  - SVP, risk management in Accounting, Banking & Finance: Fully Remote (Greater Boston)
  - Counsel in Law: I work remotely
  - QA Analyst IT in Utilities & Telecommunications: Remote/ home
  - Senior Talent Acquisition Consultant in Recruitment or HR: Work from home 


In [16]:
# Question 5: Which industry (besides tech) has the highest median salary?


print("\n\nQUESTION 5: Which industry (besides tech) has the highest median salary?")
print("-" * 70)

non_tech = data_us[~data_us['is_tech']]
if len(non_tech) > 0:
    industry_medians = non_tech.groupby('industry')['salary'].agg(['median', 'count']).round(0)
    industry_medians = industry_medians[industry_medians['count'] >= 10]  # At least 10 respondents
    
    if len(industry_medians) > 0:
        top_industry = industry_medians['median'].idxmax()
        top_median = industry_medians.loc[top_industry, 'median']
        top_count = industry_medians.loc[top_industry, 'count']
        
        print(f"Non-tech industry with highest median salary: {top_industry}")
        print(f"Median salary: ${top_median:,.0f}")
        print(f"Sample size: {int(top_count)} workers")
        
        print(f"\nTop 10 non-tech industries by median salary:")
        top_10_industries = industry_medians.sort_values('median', ascending=False).head(10)
        for industry, row in top_10_industries.iterrows():
            print(f"  - {industry}: ${row['median']:,.0f} (n={int(row['count'])})")
    else:
        print("No industries with sufficient non-tech data (minimum 10 respondents)")
else:
    print("No non-tech workers found in the dataset")

print("\n" + "=" * 60)
print("ANALYSIS COMPLETE")
print("=" * 60)



QUESTION 5: Which industry (besides tech) has the highest median salary?
----------------------------------------------------------------------
Non-tech industry with highest median salary: Pharma
Median salary: $125,000
Sample size: 17 workers

Top 10 non-tech industries by median salary:
  - Pharma: $125,000 (n=17)
  - Pharmaceutical: $116,358 (n=23)
  - Pharmaceuticals: $115,000 (n=24)
  - Energy: $110,500 (n=12)
  - Biotech: $105,000 (n=47)
  - Law: $96,000 (n=915)
  - Engineering or Manufacturing: $91,000 (n=1376)
  - Business or Consulting: $91,000 (n=675)
  - Biotechnology: $89,000 (n=20)
  - Utilities & Telecommunications: $86,810 (n=257)

ANALYSIS COMPLETE


In [17]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $141,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $1,356 per year
4. **Remote vs office percentage:** 1.5% remote, 98.5% office
5. **Highest paying non-tech industry:** Pharma

**Key insights:**
- Organize your original project folder with all the nessesary data files first before you ask the AI anything so that it can base what to do after scanning through the actual data it will work on. 
- Talk to it like its a kid and say what you want ver detailed breaking up the steps 
- Make sure to double check your work because it can always make mistakes. 

**Challenges faced:**
- The ai would use completely different packages and make the project a lot different so You have to remind it to stop and redo with how you want it and say these are the Only packages you want
- They completely forgot a question and skipped it which means you have to ask again reminding it. 

**What you learned about vibe coding:**
- Make sure you ask it to stay within certain constrainsts so it doesnt go crazy and make "unfamiliar" code
- Its sure quick to get things done but lowkey you dont learn as much as struggling
- Learning 3
