# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

PHASE 1 - DATA LOADING
- Load TSV file and examine basic structure
- Identify key columns needed for analysis
- Get basic statistics (record count, missing values, unique values)

PHASE 2 - DATA EXPLORATION
- Analyze salary distributions and outliers across currencies
- Explore job title variations (especially Software Engineer roles)
- Analyze industry categories (tech vs non-tech)
- Examine location data patterns (US states, remote work indicators)
- Analyze experience data distributions and formats

PHASE 3 - DATA CLEANING
- Standardize currency conversions (USD, CAD, GBP to USD equivalent)
- Clean and standardize job titles
- Handle missing values and create derived variables
- Standardize state names, industry categories
- Create analysis-ready dataset

ANALYSIS PHASE
- Calculate median salary for Software Engineers in US
- Identify highest paying US state for tech workers
- Calculate salary increase per year of experience
- Determine remote vs office work percentages
- Find highest paying non-tech industry


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("=== COMPREHENSIVE DATA ANALYSIS: PHASES 1 & 2 ===")
print("Phase 1: Data Loading & Structure Analysis")
print("Phase 2: Comprehensive Data Exploration")
print()

# =============================================================================
# PHASE 1: DATA LOADING & STRUCTURE ANALYSIS
# =============================================================================

print("=" * 60)
print("PHASE 1: DATA LOADING & STRUCTURE ANALYSIS")
print("=" * 60)

# Load the TSV file
print("1. Loading TSV file...")
df = pd.read_csv('../../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep='\t')

print("File loaded successfully!")
print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print()

# Examine column structure
print("2. Column Structure Analysis:")
print("=" * 50)
print("Column names and their positions:")
for i, col in enumerate(df.columns):
    print(f"{i+1:2d}. {col}")
print()

# Identify data types
print("3. Data Types Analysis:")
print("=" * 30)
print(df.dtypes)
print()

# Get basic statistics
print("4. Basic Statistics:")
print("=" * 25)
print(f"Total records: {len(df):,}")
print(f"Total columns: {len(df.columns)}")
print()

# Missing values analysis
print("5. Missing Values Analysis:")
print("=" * 30)
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_summary = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
}).sort_values('Missing Count', ascending=False)

print("All columns with missing data:")
print(missing_summary[missing_summary['Missing Count'] > 0])
print()

# Unique values per column
print("6. Unique Values per Column:")
print("=" * 35)
for col in df.columns:
    unique_count = df[col].nunique()
    print(f"{col}: {unique_count:,} unique values")
print()

print("PHASE 1 COMPLETE")
print()

# =============================================================================
# PHASE 2: COMPREHENSIVE DATA EXPLORATION
# =============================================================================

print("=" * 60)
print("PHASE 2: COMPREHENSIVE DATA EXPLORATION")
print("=" * 60)

# 1. Analyze missing value patterns
print("1. MISSING VALUE PATTERNS ANALYSIS:")
print("=" * 40)
print("Columns with significant missing data (>10%):")
significant_missing = missing_summary[missing_summary['Missing Percentage'] > 10]
for idx, row in significant_missing.iterrows():
    print(f"  - {idx}: {row['Missing Count']:,} missing ({row['Missing Percentage']:.1f}%)")
print()

# 2. Check for outliers in salary data
print("2. SALARY DATA OUTLIER ANALYSIS:")
print("=" * 35)

# First, let's examine salary column
salary_col = 'What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'
currency_col = 'Please indicate the currency'

print("Salary column sample values:")
print(df[salary_col].head(10).tolist())
print()

print("Currency distribution:")
print(df[currency_col].value_counts())
print()

# Convert salary to numeric (handling commas and other characters)
def clean_salary(salary_str):
    if pd.isna(salary_str):
        return np.nan
    try:
        # Remove commas and convert to float
        cleaned = str(salary_str).replace(',', '').replace('$', '').strip()
        return float(cleaned)
    except:
        return np.nan

df['salary_numeric'] = df[salary_col].apply(clean_salary)

print("Salary statistics (all currencies):")
print(f"  Valid salary records: {df['salary_numeric'].notna().sum():,}")
print(f"  Mean salary: ${df['salary_numeric'].mean():,.0f}")
print(f"  Median salary: ${df['salary_numeric'].median():,.0f}")
print(f"  Min salary: ${df['salary_numeric'].min():,.0f}")
print(f"  Max salary: ${df['salary_numeric'].max():,.0f}")
print()

# Check for extreme outliers
Q1 = df['salary_numeric'].quantile(0.25)
Q3 = df['salary_numeric'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['salary_numeric'] < lower_bound) | (df['salary_numeric'] > upper_bound)]
print(f"Potential outliers (beyond 1.5*IQR): {len(outliers):,} records")
print(f"Outlier range: < ${lower_bound:,.0f} or > ${upper_bound:,.0f}")
print()

# 3. Examine distributions of key variables
print("3. KEY VARIABLE DISTRIBUTIONS:")
print("=" * 35)

# Age distribution
print("Age distribution:")
print(df['How old are you?'].value_counts().sort_index())
print()

# Experience distribution
print("Overall experience distribution:")
print(df['How many years of professional work experience do you have overall?'].value_counts())
print()

# Education distribution
print("Education distribution:")
print(df['What is your highest level of education completed?'].value_counts())
print()

# Gender distribution
print("Gender distribution:")
print(df['What is your gender?'].value_counts())
print()

# 4. Identify unique values in industries, job titles, locations
print("4. UNIQUE VALUES ANALYSIS:")
print("=" * 30)

# Top industries
print("Top 20 industries:")
top_industries = df['What industry do you work in?'].value_counts().head(20)
for industry, count in top_industries.items():
    print(f"  {industry}: {count:,}")
print()

# Top job titles
print("Top 20 job titles:")
top_titles = df['Job title'].value_counts().head(20)
for title, count in top_titles.items():
    print(f"  {title}: {count:,}")
print()

# Countries
print("Top 15 countries:")
top_countries = df['What country do you work in?'].value_counts().head(15)
for country, count in top_countries.items():
    print(f"  {country}: {count:,}")
print()

# 5. Look for data quality issues
print("5. DATA QUALITY ISSUES:")
print("=" * 25)

# Check for inconsistent formats in salary
print("Salary format issues:")
salary_samples = df[salary_col].dropna().head(20)
for i, salary in enumerate(salary_samples):
    print(f"  {i+1}. {salary}")
print()

# Check for inconsistent country formats
print("Country format variations (US-related):")
us_variations = df[df['What country do you work in?'].str.contains('US|United States|USA', case=False, na=False)]['What country do you work in?'].value_counts()
print(us_variations)
print()

# Check for Software Engineer variations
print("Software Engineer job title variations:")
se_titles = df[df['Job title'].str.contains('software|engineer|developer', case=False, na=False)]['Job title'].value_counts().head(15)
for title, count in se_titles.items():
    print(f"  {title}: {count:,}")
print()

# Check for remote work indicators
print("Remote work indicators in city field:")
remote_cities = df[df['What city do you work in?'].str.contains('remote|Remote', case=False, na=False)]['What city do you work in?'].value_counts()
print(remote_cities)
print()

# 6. Additional Analysis for Research Questions
print("6. RESEARCH QUESTION PREPARATION:")
print("=" * 35)

# Software Engineer identification
se_patterns = [
    'software engineer', 'software developer', 'senior software engineer', 
    'principal software engineer', 'staff software engineer', 'lead software engineer',
    'software engineer ii', 'software engineer iii', 'senior software developer',
    'principal software developer', 'staff software developer', 'lead software developer',
    'software architect', 'senior engineer', 'principal engineer', 'staff engineer',
    'lead engineer', 'engineering manager', 'director of engineering'
]

def is_software_engineer(title):
    if pd.isna(title):
        return False
    title_lower = str(title).lower()
    return any(pattern in title_lower for pattern in se_patterns)

df['is_software_engineer'] = df['Job title'].apply(is_software_engineer)

print("Software Engineer identification:")
print(f"  - Total Software Engineer roles: {df['is_software_engineer'].sum():,}")
print(f"  - Percentage of all roles: {(df['is_software_engineer'].sum() / len(df)) * 100:.1f}%")
print()

# US worker identification
us_countries = ['United States', 'USA', 'US', 'United States ', 'U.S.', 'USA ', 'Usa', 
                'United States of America', 'United states', 'usa']

df['is_us'] = df['What country do you work in?'].isin(us_countries)

print("US worker identification:")
print(f"  - Total US workers: {df['is_us'].sum():,}")
print(f"  - Percentage of all workers: {(df['is_us'].sum() / len(df)) * 100:.1f}%")
print()

# Tech industry identification
tech_industries = ['Computing or Tech', 'Engineering or Manufacturing']
df['is_tech'] = df['What industry do you work in?'].isin(tech_industries)

print("Tech industry identification:")
print(f"  - Total tech workers: {df['is_tech'].sum():,}")
print(f"  - Percentage of all workers: {(df['is_tech'].sum() / len(df)) * 100:.1f}%")
print()

# Remote work identification
def is_remote(city):
    if pd.isna(city):
        return False
    city_str = str(city).lower()
    remote_indicators = ['remote', 'work from home', 'wfh', 'telecommute', 'virtual']
    return any(indicator in city_str for indicator in remote_indicators)

df['is_remote'] = df['What city do you work in?'].apply(is_remote)

print("Remote work identification:")
print(f"  - Remote workers identified: {df['is_remote'].sum():,}")
print(f"  - Remote work percentage: {(df['is_remote'].sum() / len(df)) * 100:.1f}%")
print()

print("=" * 60)
print("PHASE 2 COMPLETE")
print("=" * 60)
print()

# Summary for next steps
print("SUMMARY FOR DATA CLEANING:")
print("=" * 25)
print(f"• Dataset ready for cleaning: {len(df):,} records")
print(f"• Software Engineers identified: {df['is_software_engineer'].sum():,}")
print(f"• US workers identified: {df['is_us'].sum():,}")
print(f"• Tech workers identified: {df['is_tech'].sum():,}")
print(f"• Remote workers identified: {df['is_remote'].sum():,}")
print(f"• Multiple currencies to convert: {df[currency_col].nunique()} currencies")
print(f"• State data available: {df['If you\'re in the U.S., what state do you work in?'].notna().sum():,} records")
print()
print("Ready for Phase 3: Data Cleaning & Preparation")

=== COMPREHENSIVE DATA ANALYSIS: PHASES 1 & 2 ===
Phase 1: Data Loading & Structure Analysis
Phase 2: Comprehensive Data Exploration

PHASE 1: DATA LOADING & STRUCTURE ANALYSIS
1. Loading TSV file...
File loaded successfully!
Dataset shape: 28,062 rows × 18 columns

2. Column Structure Analysis:
Column names and their positions:
 1. Timestamp
 2. How old are you?
 3. What industry do you work in?
 4. Job title
 5. If your job title needs additional context, please clarify here:
 6. What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
 7. How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.
 8. Please indicate the currency
 9. If "Other," please indicate the currency here: 
10. 

## Step 2: Data Cleaning


In [22]:
from scipy import stats

print("=== PHASE 3: DATA CLEANING & PREPARATION ===")
print()

# 1. Standardize currency conversions (USD, CAD, GBP to USD equivalent)
print("1. CURRENCY STANDARDIZATION:")
print("=" * 30)

# Define exchange rates (approximate 2021 rates)
exchange_rates = {
    'USD': 1.0,
    'CAD': 0.79,  # 1 CAD = 0.79 USD
    'GBP': 1.37,  # 1 GBP = 1.37 USD
    'EUR': 1.18,  # 1 EUR = 1.18 USD
    'AUD/NZD': 0.73,  # 1 AUD = 0.73 USD
    'CHF': 1.09,  # 1 CHF = 1.09 USD
    'SEK': 0.11,  # 1 SEK = 0.11 USD
    'JPY': 0.009, # 1 JPY = 0.009 USD
    'ZAR': 0.067, # 1 ZAR = 0.067 USD
    'HKD': 0.13   # 1 HKD = 0.13 USD
}

# Clean salary function
def clean_salary(salary_str):
    if pd.isna(salary_str):
        return np.nan
    try:
        # Remove commas and convert to float
        cleaned = str(salary_str).replace(',', '').replace('$', '').strip()
        return float(cleaned)
    except:
        return np.nan

# Clean additional compensation function
def clean_additional_comp(comp_str):
    if pd.isna(comp_str):
        return 0
    try:
        return float(comp_str)
    except:
        return 0

# Apply cleaning
df['salary_numeric'] = df['What is your annual salary? (You\'ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)'].apply(clean_salary)
df['additional_comp_numeric'] = df['How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.'].apply(clean_additional_comp)

# Convert to USD
def convert_to_usd(salary, currency):
    if pd.isna(salary) or pd.isna(currency):
        return np.nan
    
    currency = str(currency).strip()
    if currency in exchange_rates:
        return salary * exchange_rates[currency]
    else:
        return salary  # Assume USD if unknown

df['salary_usd'] = df.apply(lambda row: convert_to_usd(row['salary_numeric'], row['Please indicate the currency']), axis=1)
df['additional_comp_usd'] = df.apply(lambda row: convert_to_usd(row['additional_comp_numeric'], row['Please indicate the currency']), axis=1)

# Create total compensation
df['total_compensation_usd'] = df['salary_usd'] + df['additional_comp_usd']

print(f"Currency conversion completed:")
print(f"  - Records with valid salary: {df['salary_usd'].notna().sum():,}")
print(f"  - Mean salary (USD): ${df['salary_usd'].mean():,.0f}")
print(f"  - Median salary (USD): ${df['salary_usd'].median():,.0f}")
print()

# 2. Clean and standardize job titles (especially Software Engineer variations)
print("2. JOB TITLE STANDARDIZATION:")
print("=" * 30)

# Define Software Engineer patterns
se_patterns = [
    'software engineer', 'software developer', 'senior software engineer', 
    'principal software engineer', 'staff software engineer', 'lead software engineer',
    'software engineer ii', 'software engineer iii', 'senior software developer',
    'principal software developer', 'staff software developer', 'lead software developer',
    'software architect', 'senior engineer', 'principal engineer', 'staff engineer',
    'lead engineer', 'engineering manager', 'director of engineering'
]

def is_software_engineer(title):
    if pd.isna(title):
        return False
    title_lower = str(title).lower()
    return any(pattern in title_lower for pattern in se_patterns)

df['is_software_engineer'] = df['Job title'].apply(is_software_engineer)

print(f"Software Engineer identification:")
print(f"  - Total Software Engineer roles: {df['is_software_engineer'].sum():,}")
print(f"  - Percentage of all roles: {(df['is_software_engineer'].sum() / len(df)) * 100:.1f}%")
print()

# 3. Handle missing values strategically
print("3. MISSING VALUES HANDLING:")
print("=" * 25)

# Create US filter
us_countries = ['United States', 'USA', 'US', 'United States ', 'U.S.', 'USA ', 'Usa', 
                'United States of America', 'United states', 'usa']

df['is_us'] = df['What country do you work in?'].isin(us_countries)

print(f"US worker identification:")
print(f"  - Total US workers: {df['is_us'].sum():,}")
print(f"  - Percentage of all workers: {(df['is_us'].sum() / len(df)) * 100:.1f}%")
print()

# Create tech industry filter (CORRECTED to include biotech)
def is_tech_corrected(industry):
    if pd.isna(industry):
        return False
    industry_str = str(industry).lower()
    tech_keywords = ['tech', 'biotech', 'biotechnology', 'engineering', 'software', 'computing']
    return any(keyword in industry_str for keyword in tech_keywords)

df['is_tech'] = df['What industry do you work in?'].apply(is_tech_corrected)

print(f"Tech industry identification:")
print(f"  - Total tech workers: {df['is_tech'].sum():,}")
print(f"  - Percentage of all workers: {(df['is_tech'].sum() / len(df)) * 100:.1f}%")
print()

# 4. Create derived variables (experience categories, remote work)
print("4. DERIVED VARIABLES CREATION:")
print("=" * 30)

# Convert experience to numeric ranges
def experience_to_numeric(exp_str):
    if pd.isna(exp_str):
        return np.nan
    
    exp_str = str(exp_str).lower()
    if '1 year or less' in exp_str:
        return 0.5
    elif '2 - 4 years' in exp_str:
        return 3
    elif '5-7 years' in exp_str:
        return 6
    elif '8 - 10 years' in exp_str:
        return 9
    elif '11 - 20 years' in exp_str:
        return 15.5
    elif '21 - 30 years' in exp_str:
        return 25.5
    elif '31 - 40 years' in exp_str:
        return 35.5
    elif '41 years or more' in exp_str:
        return 45
    else:
        return np.nan

df['experience_years'] = df['How many years of professional work experience do you have overall?'].apply(experience_to_numeric)

# Create remote work indicator
def is_remote(city):
    if pd.isna(city):
        return False
    city_str = str(city).lower()
    remote_indicators = ['remote', 'work from home', 'wfh', 'telecommute', 'virtual']
    return any(indicator in city_str for indicator in remote_indicators)

df['is_remote'] = df['What city do you work in?'].apply(is_remote)

print(f"Derived variables created:")
print(f"  - Records with experience data: {df['experience_years'].notna().sum():,}")
print(f"  - Remote workers identified: {df['is_remote'].sum():,}")
print(f"  - Remote work percentage: {(df['is_remote'].sum() / len(df)) * 100:.1f}%")
print()

# 5. Standardize state names and industry categories
print("5. STATE AND INDUSTRY STANDARDIZATION:")
print("=" * 40)

# Standardize US state names
state_mapping = {
    'California': 'CA', 'New York': 'NY', 'Texas': 'TX', 'Florida': 'FL',
    'Illinois': 'IL', 'Pennsylvania': 'PA', 'Ohio': 'OH', 'Georgia': 'GA',
    'North Carolina': 'NC', 'Michigan': 'MI', 'New Jersey': 'NJ', 'Virginia': 'VA',
    'Washington': 'WA', 'Arizona': 'AZ', 'Massachusetts': 'MA', 'Tennessee': 'TN',
    'Indiana': 'IN', 'Missouri': 'MO', 'Maryland': 'MD', 'Wisconsin': 'WI',
    'Colorado': 'CO', 'Minnesota': 'MN', 'South Carolina': 'SC', 'Alabama': 'AL',
    'Louisiana': 'LA', 'Kentucky': 'KY', 'Oregon': 'OR', 'Oklahoma': 'OK',
    'Connecticut': 'CT', 'Utah': 'UT', 'Iowa': 'IA', 'Nevada': 'NV',
    'Arkansas': 'AR', 'Mississippi': 'MS', 'Kansas': 'KS', 'New Mexico': 'NM',
    'Nebraska': 'NE', 'West Virginia': 'WV', 'Idaho': 'ID', 'Hawaii': 'HI',
    'New Hampshire': 'NH', 'Maine': 'ME', 'Montana': 'MT', 'Rhode Island': 'RI',
    'Delaware': 'DE', 'South Dakota': 'SD', 'North Dakota': 'ND', 'Alaska': 'AK',
    'Vermont': 'VT', 'Wyoming': 'WY', 'District of Columbia': 'DC'
}

def standardize_state(state):
    if pd.isna(state):
        return np.nan
    
    state_str = str(state).strip()
    # Check if it's already an abbreviation
    if len(state_str) == 2 and state_str.upper() in state_mapping.values():
        return state_str.upper()
    
    # Check if it's a full state name
    for full_name, abbrev in state_mapping.items():
        if full_name.lower() in state_str.lower():
            return abbrev
    
    return state_str

df['state_standardized'] = df['If you\'re in the U.S., what state do you work in?'].apply(standardize_state)

print(f"State standardization completed:")
print(f"  - Records with state data: {df['state_standardized'].notna().sum():,}")
print(f"  - Unique states: {df['state_standardized'].nunique()}")
print()

# Create analysis-ready dataset
print("6. ANALYSIS-READY DATASET CREATION:")
print("=" * 35)

# Create clean dataset for analysis
analysis_df = df[[
    'salary_usd', 'total_compensation_usd', 'experience_years', 'is_software_engineer',
    'is_us', 'is_tech', 'is_remote', 'state_standardized', 'What industry do you work in?',
    'Job title', 'How old are you?', 'What is your gender?'
]].copy()

# Rename columns for easier analysis
analysis_df.columns = [
    'salary_usd', 'total_compensation_usd', 'experience_years', 'is_software_engineer',
    'is_us', 'is_tech', 'is_remote', 'state', 'industry', 'job_title', 'age_group', 'gender'
]

# Remove extreme outliers (salary > $1M or < $10K)
analysis_df = analysis_df[
    (analysis_df['salary_usd'] >= 10000) & 
    (analysis_df['salary_usd'] <= 1000000)
]

print(f"Analysis-ready dataset created:")
print(f"  - Final records: {len(analysis_df):,}")
print(f"  - Records removed (outliers): {len(df) - len(analysis_df):,}")
print(f"  - US Software Engineers: {len(analysis_df[analysis_df['is_us'] & analysis_df['is_software_engineer']]):,}")
print(f"  - US Tech workers: {len(analysis_df[analysis_df['is_us'] & analysis_df['is_tech']]):,}")
print()

# Save cleaned dataset
analysis_df.to_csv('cleaned_salary_data.csv', index=False)
print("Cleaned dataset saved as 'cleaned_salary_data.csv'")
print()

print("=== PHASE 3 COMPLETE ===")
print()

df = pd.read_csv('cleaned_salary_data.csv')

=== PHASE 3: DATA CLEANING & PREPARATION ===

1. CURRENCY STANDARDIZATION:
Currency conversion completed:
  - Records with valid salary: 28,062
  - Mean salary (USD): $310,369
  - Median salary (USD): $75,000

2. JOB TITLE STANDARDIZATION:
Software Engineer identification:
  - Total Software Engineer roles: 1,477
  - Percentage of all roles: 5.3%

3. MISSING VALUES HANDLING:
US worker identification:
  - Total US workers: 22,518
  - Percentage of all workers: 80.2%

Tech industry identification:
  - Total tech workers: 6,585
  - Percentage of all workers: 23.5%

4. DERIVED VARIABLES CREATION:
Derived variables created:
  - Records with experience data: 28,062
  - Remote workers identified: 366
  - Remote work percentage: 1.3%

5. STATE AND INDUSTRY STANDARDIZATION:
State standardization completed:
  - Records with state data: 23,039
  - Unique states: 50

6. ANALYSIS-READY DATASET CREATION:
Analysis-ready dataset created:
  - Final records: 27,863
  - Records removed (outliers): 199
  

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [23]:
# Question 1: What is the median salary for Software Engineers in the United States?
us_software_engineers = df[
    (df['is_us'] == True) & 
    (df['is_software_engineer'] == True) & 
    (df['salary_usd'].notna())
]

median_salary_se_us = us_software_engineers['salary_usd'].median()

print("1. Median salary for Software Engineers in US:")
print(f"   ${median_salary_se_us:,.0f}")
print(f"   (Sample: {len(us_software_engineers):,} US Software Engineers)")
print()

1. Median salary for Software Engineers in US:
   $145,500
   (Sample: 1,143 US Software Engineers)



In [24]:
# Question 2: Which US state has the highest average salary for tech workers?
us_tech_workers = df[
    (df['is_us'] == True) & 
    (df['is_tech'] == True) & 
    (df['salary_usd'].notna()) &
    (df['state'].notna())
]

state_salaries = us_tech_workers.groupby('state')['salary_usd'].agg(['median', 'count']).reset_index()
state_salaries = state_salaries[state_salaries['count'] >= 10]
state_salaries = state_salaries.sort_values('median', ascending=False)

highest_paying_state = state_salaries.iloc[0]

print("2. Highest paying US state for tech:")
print(f"   {highest_paying_state['state']} (${highest_paying_state['median']:,.0f})")
print(f"   (Sample: {highest_paying_state['count']} tech workers)")
print()

2. Highest paying US state for tech:
   CA ($141,000)
   (Sample: 833 tech workers)



In [25]:
# Question 3: How much does salary increase on average for each year of experience in tech?
us_with_experience = df[
    (df['is_us'] == True) & 
    (df['salary_usd'].notna()) & 
    (df['experience_years'].notna())
]

slope, intercept, r_value, p_value, std_err = stats.linregress(
    us_with_experience['experience_years'], 
    us_with_experience['salary_usd']
)

print("3. Salary increase per year of experience:")
print(f"   ${slope:,.0f} per year")
print(f"   (Sample: {len(us_with_experience):,} US workers)")
print()

3. Salary increase per year of experience:
   $1,188 per year
   (Sample: 22,424 US workers)



In [26]:

# Question 4: What percentage of respondents work remotely vs. in-office?
total_workers = len(df)
remote_workers = df['is_remote'].sum()
office_workers = total_workers - remote_workers

remote_percentage = (remote_workers / total_workers) * 100
office_percentage = (office_workers / total_workers) * 100

print("4. Remote vs office percentage:")
print(f"   {remote_percentage:.1f}% remote, {office_percentage:.1f}% office")
print(f"   ({remote_workers:,} remote workers out of {total_workers:,} total)")
print()

4. Remote vs office percentage:
   1.3% remote, 98.7% office
   (364 remote workers out of 27,863 total)



In [27]:
# Question 5: Which industry (besides tech) has the highest median salary?
us_non_tech = df[
    (df['is_us'] == True) & 
    (df['is_tech'] == False) & 
    (df['salary_usd'].notna()) &
    (df['industry'].notna())
]

industry_salaries = us_non_tech.groupby('industry')['salary_usd'].agg(['median', 'count']).reset_index()
industry_salaries = industry_salaries[industry_salaries['count'] >= 20]
industry_salaries = industry_salaries.sort_values('median', ascending=False)

highest_paying_industry = industry_salaries.iloc[0]

print("5. Highest paying non-tech industry:")
print(f"   {highest_paying_industry['industry']} (${highest_paying_industry['median']:,.0f})")
print(f"   (Sample: {highest_paying_industry['count']} workers)")
print()

print("=" * 60)
print("ANALYSIS COMPLETE")
print("=" * 60)

5. Highest paying non-tech industry:
   Law ($95,800)
   (Sample: 934 workers)

ANALYSIS COMPLETE


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $145,500
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** $1,188 per year
4. **Remote vs office percentage:** 1.3% remote, 98.7% office
5. **Highest paying non-tech industry:** Law

**Key insights:**
- AI doesn't catch everything, you have to look through the results and check for any obvious mistakes
- Be clear about what you want from AI, that means how you want things sectioned and answered
- AI is very fast!

**Challenges faced:**
- The newest Jupyter version wasn't working for me so installing the previous version worked for me
- I couldn't prompt engineer the best and so I had to go through the code to use the snippets I needed as the results weren't as seamless

**What you learned about vibe coding:**
- A lot of comments are given in the code which is good for following along
- Lot of trial and error
- The AI will ask to run the code to check itself
