# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

Based on my analysis of the ManagerSalarySurvey2021.tsv file, here's my comprehensive data cleaning plan:

### Data Overview:
- **28,062 rows** of survey responses
- **17 columns** including salary, location, job title, experience, demographics
- **Mixed data quality** with inconsistent formatting, missing values, and various currencies

### Cleaning Steps:

1. **Data Exploration & Loading** ✅
   - Load dataset and examine structure
   - Identify data types and column meanings
   - Check for encoding issues

2. **Missing Value Analysis** 
   - Identify empty strings, nulls, and inconsistent representations
   - Standardize missing value formats (convert to NaN)
   - Document missing value patterns

3. **Salary Data Cleaning**
   - Remove commas from salary numbers
   - Handle different currencies (USD, GBP, CAD, etc.)
   - Convert bonus/compensation columns to numeric
   - Identify and flag suspicious salary values

4. **Location Data Standardization**
   - Standardize country names (USA vs United States vs US)
   - Clean state abbreviations and full names
   - Handle missing location data

5. **Job Title Standardization**
   - Identify and categorize tech roles
   - Standardize job title formatting
   - Handle additional context in job titles

6. **Experience Data Conversion**
   - Convert experience ranges to numeric midpoints
   - Handle "years" columns for analysis

7. **Demographic Data Cleaning**
   - Standardize gender categories
   - Clean race/ethnicity categories
   - Standardize education levels

8. **Data Quality Validation**
   - Remove obvious outliers and invalid entries
   - Cross-validate salary vs experience relationships
   - Final data quality summary

9. **Analysis-Ready Dataset**
   - Create cleaned dataset for business questions
   - Document all cleaning decisions and assumptions


## Step 1: Data Loading and Exploration



In [3]:
import sys
print(sys.executable)


/usr/local/bin/python3


In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [79]:
# Load the data
df = pd.read_csv('/Users/thasmias/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep='\t')

print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nFirst few rows:")
df.head()


Dataset Shape: (28062, 18)

Column Names:
['Timestamp', 'How old are you?', 'What industry do you work in?', 'Job title', 'If your job title needs additional context, please clarify here:', "What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)", 'How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.', 'Please indicate the currency', 'If "Other," please indicate the currency here: ', 'If your income needs additional context, please provide it here:', 'What country do you work in?', "If you're in the U.S., what state do you work in?", 'What city do you work in?', 'How many years of professional work experience do you have overall?', 'How many years of professional work experience 

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


In [None]:
#change long col names to short col names
df.columns = [
    'timestamp', 'age', 'industry', 'job_title', 'job_context',
    'annual_salary', 'additional_compensation', 'currency', 'currency_other',
    'income_context', 'country', 'US_state', 'city',
    'experience_total_years', 'experience_field_years', 'education', 'gender', 'race'
]
print(df.columns.tolist())

['timestamp', 'age', 'industry', 'job_title', 'job_context', 'annual_salary', 'additional_compensation', 'currency', 'currency_other', 'income_context', 'country', 'US_state', 'city', 'experience_total_years', 'experience_field_years', 'education', 'gender', 'race']


## Step 2: Data Cleaning

In [7]:
# Missing Values Analysis
print("Data Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nEmpty Strings (checking for empty strings that might not be NaN):")
for col in df.columns:
    empty_count = (df[col] == '').sum()
    if empty_count > 0:
        print(f"{col}: {empty_count} empty strings")

# Let's also check for other common "missing" representations
print("\nChecking for other missing value representations:")
missing_representations = ['N/A', 'n/a', 'NA', 'na', 'NULL', 'null', 'None', 'none', '-', '--']
for rep in missing_representations:
    for col in df.columns:
        count = (df[col] == rep).sum()
        if count > 0:
            print(f"{col}: {count} instances of '{rep}'")


Data Types:
timestamp                   object
age                         object
industry                    object
job_title                   object
job_context                 object
annual_salary               object
additional_compensation    float64
currency                    object
currency_other              object
income_context              object
country                     object
US_state                    object
city                        object
experience_total_years      object
experience_field_years      object
education                   object
gender                      object
race                        object
dtype: object

Missing Values:
timestamp                      0
age                            0
industry                      74
job_title                      1
job_context                20800
annual_salary                  0
additional_compensation     7296
currency                       0
currency_other             27856
income_context             250

In [8]:
#Salary Data Cleaning
print("Salary column analysis:")
print("Unique values in salary column (first 20):")
print(df['annual_salary'].value_counts().head(20))

print("\nCurrency column analysis:")
print("Unique currencies:")
print(df['currency'].value_counts())

print("\nBonus column analysis:")
print("Unique values in bonus column (first 20):")
print(df['additional_compensation'].value_counts().head(20))


Salary column analysis:
Unique values in salary column (first 20):
annual_salary
60,000     430
80,000     406
70,000     402
65,000     400
75,000     383
90,000     368
50,000     361
100,000    313
85,000     306
55,000     303
120,000    295
110,000    263
52,000     251
45,000     247
130,000    245
40,000     234
72,000     218
95,000     213
125,000    197
105,000    191
Name: count, dtype: int64

Currency column analysis:
Unique currencies:
currency
USD        23374
CAD         1673
GBP         1591
EUR          643
AUD/NZD      504
Other        160
CHF           37
SEK           37
JPY           23
ZAR           16
HKD            4
Name: count, dtype: int64

Bonus column analysis:
Unique values in bonus column (first 20):
additional_compensation
0.0        7949
5000.0     1112
10000.0     962
2000.0      775
1000.0      678
3000.0      635
15000.0     538
20000.0     508
4000.0      428
6000.0      349
500.0       333
8000.0      296
1500.0      290
12000.0     271
30000.0    

IMPLEMENTING DATA CLEANING


In [9]:

# Create a copy of the original data for cleaning
df_clean = df.copy()

print("Starting data cleaning process...")
print(f"Original dataset shape: {df_clean.shape}")

print("Missing values before standardization:")
print(df_clean.isnull().sum())

# 1. Standardize missing values
print("\n1. Standardizing missing values...")

# Replace empty strings with NaN
df_clean = df_clean.replace('', np.nan)

# Replace common missing representations with NaN
missing_values = ['N/A', 'n/a', 'NA', 'na', 'NULL', 'null', 'None', 'none', '-', '--']
for val in missing_values:
    df_clean = df_clean.replace(val, np.nan)

print("Missing values after standardization:")
print(df_clean.isnull().sum())


Starting data cleaning process...
Original dataset shape: (28062, 18)
Missing values before standardization:
timestamp                      0
age                            0
industry                      74
job_title                      1
job_context                20800
annual_salary                  0
additional_compensation     7296
currency                       0
currency_other             27856
income_context             25020
country                        0
US_state                    5023
city                          82
experience_total_years         0
experience_field_years         0
education                    222
gender                       171
race                         177
dtype: int64

1. Standardizing missing values...
Missing values after standardization:
timestamp                      0
age                            0
industry                      74
job_title                      4
job_context                20802
annual_salary                  0
additional_c

In [10]:
# 2. Clean Salary Data
print("\n2. Cleaning salary data...")

# Clean the main salary column
salary_col = 'annual_salary'

# Remove commas from salary numbers
df_clean[salary_col] = df_clean[salary_col].astype(str).str.replace(',', '')

# Convert to numeric, errors='coerce' will convert invalid values to NaN
df_clean['salary_numeric'] = pd.to_numeric(df_clean[salary_col], errors='coerce')

print(f"Salary conversion results:")
print(f"Valid numeric salaries: {df_clean['salary_numeric'].notna().sum()}")
print(f"Invalid/NaN salaries: {df_clean['salary_numeric'].isna().sum()}")

# Clean bonus column
bonus_col = 'additional_compensation'
df_clean[bonus_col] = df_clean[bonus_col].astype(str).str.replace(',', '')
df_clean['bonus_numeric'] = pd.to_numeric(df_clean[bonus_col], errors='coerce')

print(f"\nBonus conversion results:")
print(f"Valid numeric bonuses: {df_clean['bonus_numeric'].notna().sum()}")
print(f"Invalid/NaN bonuses: {df_clean['bonus_numeric'].isna().sum()}")

# Show some salary statistics
print(f"\nSalary statistics (USD only):")
usd_mask = df_clean['currency'] == 'USD'
print(f"USD salaries count: {usd_mask.sum()}")
print(f"USD salary range: ${df_clean[usd_mask]['salary_numeric'].min():,.0f} - ${df_clean[usd_mask]['salary_numeric'].max():,.0f}")
print(f"USD median salary: ${df_clean[usd_mask]['salary_numeric'].median():,.0f}")



2. Cleaning salary data...
Salary conversion results:
Valid numeric salaries: 28062
Invalid/NaN salaries: 0

Bonus conversion results:
Valid numeric bonuses: 20766
Invalid/NaN bonuses: 7296

Salary statistics (USD only):
USD salaries count: 23374
USD salary range: $0 - $102,000,000
USD median salary: $78,000


In [11]:
#df['country'].unique()
df['US_state'].unique()

array(['Massachusetts', nan, 'Tennessee', 'Wisconsin', 'South Carolina',
       'New Hampshire', 'Arizona', 'Missouri', 'Florida', 'Pennsylvania',
       'Michigan', 'Minnesota', 'Illinois', 'California', 'Georgia',
       'Ohio', 'District of Columbia', 'Maryland', 'Texas', 'Virginia',
       'North Carolina', 'New York', 'New Jersey', 'Rhode Island',
       'Colorado', 'Oregon', 'Washington', 'Indiana', 'Iowa', 'Nebraska',
       'Oklahoma', 'Maine', 'Connecticut', 'South Dakota',
       'West Virginia', 'Idaho', 'Louisiana', 'Montana', 'Kentucky',
       'North Dakota', 'Kansas', 'Vermont', 'Arkansas', 'Alabama',
       'Nevada', 'Delaware', 'New Mexico', 'Hawaii', 'Utah',
       'Mississippi', 'Kentucky, Ohio', 'District of Columbia, Virginia',
       'District of Columbia, Maryland', 'Alaska', 'Arizona, Washington',
       'Georgia, New York', 'California, Colorado', 'California, Oregon',
       'District of Columbia, Maryland, Pennsylvania, Virginia',
       'Arizona, California'

In [12]:
# 3. Standardize Location Data
print("\n3. Standardizing location data...")

# Standardize country names
df_clean['country'] = df_clean['country'].str.strip().str.lower()

country_mapping = {
    # United States variants
    'us': 'United States', 'u.s.': 'United States', 'usa': 'United States', 
    'u.s': 'United States', 'u.s.a': 'United States', 'united states': 'United States',
    'united state': 'United States', 'united stated': 'United States', 
    'united statws': 'United States', 'united ststes': 'United States',
    'united statss': 'United States', 'united stares': 'United States',
    'united stattes': 'United States', 'united states of america': 'United States',
    'united state of america': 'United States', 'america': 'United States',
    'the us': 'United States', 'u.s.a.': 'United States', 'u. s.': 'United States',
    'unites states': 'United States', 'us ': 'United States', 'u.s>': 'United States',
    'usa ': 'United States', 'united  states': 'United States', 
    'united states of americas': 'United States', 'u.s.a ': 'United States',
    'u.s. ': 'United States', 'united statea': 'United States',
    'united statues': 'United States', 'united sates': 'United States',
    'united stted': 'United States', 'united statesp': 'United States',
    'usa-- virgin islands': 'United States', 'united states- puerto rico': 'United States',
    'united states is america': 'United States', 'u. s ': 'United States',
    'united states of american': 'United States', 'united states of american ': 'United States',
    'united states of american ': 'United States', 'u.s.a. ': 'United States',
    'u.sa': 'United States', 'u.s.a.': 'United States', 'uniteed states': 'United States',
    'untied states': 'United States', 'uniited states': 'United States',
    'unitied states': 'United States', 'unite states': 'United States',
    'united statws': 'United States', 'usaa': 'United States', 'usab': 'United States',
    'us of a': 'United States', 'united statss': 'United States',
    '🇺🇸': 'United States', 'unitedstates': 'United States',
    'us govt employee overseas, country withheld': 'United States',

    # United Kingdom variants
    'uk': 'United Kingdom', 'u.k.': 'United Kingdom', 'u.k': 'United Kingdom',
    'united kingdom': 'United Kingdom', 'united kingdom.': 'United Kingdom',
    'united kingdom ': 'United Kingdom', 'england': 'United Kingdom',
    'scotland': 'United Kingdom', 'wales': 'United Kingdom',
    'northern ireland': 'United Kingdom', 'great britain': 'United Kingdom',
    'britain': 'United Kingdom', 'england, uk': 'United Kingdom',
    'england/uk': 'United Kingdom', 'uk (england)': 'United Kingdom',
    'uk (northern ireland)': 'United Kingdom', 'scotland, uk': 'United Kingdom',
    'wales (uk)': 'United Kingdom', 'england, united kingdom': 'United Kingdom',

    # Canada variants
    'canada': 'Canada', 'canada ': 'Canada', 'can': 'Canada', 
    'canda': 'Canada', 'canadw': 'Canada', 'csnada': 'Canada',
    'canadá': 'Canada', 'canada, ottawa, ontario': 'Canada',
    'canada and usa': 'Canada',

    # Australia
    'australia': 'Australia', 'australia ': 'Australia', 'australi': 'Australia',
    'australian': 'Australia', 'new zealand aotearoa': 'New Zealand',

    # Other major countries
    'france': 'France', 'france ': 'France',
    'germany': 'Germany', 'germany ': 'Germany',
    'india': 'India', 'india ': 'India', 'ibdia': 'India',
    'ireland': 'Ireland', 'ireland ': 'Ireland',
    'spain': 'Spain', 'spain ': 'Spain',
    'netherlands': 'Netherlands', 'the netherlands': 'Netherlands',
    'the netherlands ': 'Netherlands', 'nl': 'Netherlands', 'nederland': 'Netherlands',
    'denmark': 'Denmark', 'danmark': 'Denmark',
    'switzerland': 'Switzerland', 'switzerland ': 'Switzerland',
    'sweden': 'Sweden', 'sweden ': 'Sweden',
    'italy': 'Italy', 'italy (south)': 'Italy',
    'japan': 'Japan', 'japan ': 'Japan',
    'china': 'China', 'mainland china': 'China',
    'hong kong': 'Hong Kong', 'hong kong ': 'Hong Kong',
    'new zealand': 'New Zealand', 'nz': 'New Zealand',
    'south africa': 'South Africa', 'south africa ': 'South Africa',
    'brazil': 'Brazil', 'brasil': 'Brazil',
    'mexico': 'Mexico', 'méxico': 'Mexico', 'mexico ': 'Mexico',
    'portugal': 'Portugal', 'portugal ': 'Portugal',
    'belgium': 'Belgium', 'belgium ': 'Belgium',
    'finland': 'Finland', 'finland ': 'Finland',
    'norway': 'Norway', 'norway ': 'Norway',
    'austria': 'Austria',
    'argentina': 'Argentina',
    'pakistan': 'Pakistan',
    'bangladesh': 'Bangladesh',
    'philippines': 'Philippines',
    'poland': 'Poland',
    'romania': 'Romania',
    'singapore': 'Singapore',
    'south korea': 'South Korea',
    'uae': 'United Arab Emirates', 'united arab emirates': 'United Arab Emirates',
    'malaysia': 'Malaysia', 'malaysia ': 'Malaysia',
    'israel': 'Israel',
    'nigeria': 'Nigeria', 'nigeria ': 'Nigeria',
    'czech republic': 'Czech Republic', 'czechia': 'Czech Republic',
    'ireland ': 'Ireland',
    'hungary': 'Hungary',
    'luxembourg': 'Luxembourg',
    'slovakia': 'Slovakia',
    'croatia': 'Croatia',
    'portugal ': 'Portugal',
    'wales, uk': 'United Kingdom',
    'scotland ': 'United Kingdom',
    'england, gb': 'United Kingdom',
    'englang': 'United Kingdom',
    'united kingdom (england)': 'United Kingdom',
}


df_clean['country_clean'] = df_clean['country'].replace(country_mapping)

print("Country standardization results:")
print(df_clean['country_clean'].value_counts().head(20))

# Clean state data for US respondents
us_mask = df_clean['country_clean'] == 'United States'
print(f"\nUS respondents: {us_mask.sum()}")

# Standardize some common state variations
df_clean['US_state'] = df_clean['US_state'].str.strip().str.lower().str.replace(r'[^a-z ]', '', regex=True)

state_mapping = {
    # Full names
    'alabama': 'Alabama', 'alaska': 'Alaska', 'arizona': 'Arizona', 'arkansas': 'Arkansas',
    'california': 'California', 'colorado': 'Colorado', 'connecticut': 'Connecticut',
    'delaware': 'Delaware', 'florida': 'Florida', 'georgia': 'Georgia', 'hawaii': 'Hawaii',
    'idaho': 'Idaho', 'illinois': 'Illinois', 'indiana': 'Indiana', 'iowa': 'Iowa',
    'kansas': 'Kansas', 'kentucky': 'Kentucky', 'louisiana': 'Louisiana', 'maine': 'Maine',
    'maryland': 'Maryland', 'massachusetts': 'Massachusetts', 'michigan': 'Michigan',
    'minnesota': 'Minnesota', 'mississippi': 'Mississippi', 'missouri': 'Missouri',
    'montana': 'Montana', 'nebraska': 'Nebraska', 'nevada': 'Nevada',
    'new hampshire': 'New Hampshire', 'new jersey': 'New Jersey', 'new mexico': 'New Mexico',
    'new york': 'New York', 'north carolina': 'North Carolina', 'north dakota': 'North Dakota',
    'ohio': 'Ohio', 'oklahoma': 'Oklahoma', 'oregon': 'Oregon', 'pennsylvania': 'Pennsylvania',
    'rhode island': 'Rhode Island', 'south carolina': 'South Carolina', 'south dakota': 'South Dakota',
    'tennessee': 'Tennessee', 'texas': 'Texas', 'utah': 'Utah', 'vermont': 'Vermont',
    'virginia': 'Virginia', 'washington': 'Washington', 'west virginia': 'West Virginia',
    'wisconsin': 'Wisconsin', 'wyoming': 'Wyoming', 'district of columbia': 'District of Columbia',
    'dc': 'District of Columbia', 'washington dc': 'District of Columbia', 'd c': 'District of Columbia',

    # Common abbreviations
    'al': 'Alabama', 'ak': 'Alaska', 'az': 'Arizona', 'ar': 'Arkansas', 'ca': 'California',
    'co': 'Colorado', 'ct': 'Connecticut', 'de': 'Delaware', 'fl': 'Florida', 'ga': 'Georgia',
    'hi': 'Hawaii', 'id': 'Idaho', 'il': 'Illinois', 'in': 'Indiana', 'ia': 'Iowa', 'ks': 'Kansas',
    'ky': 'Kentucky', 'la': 'Louisiana', 'me': 'Maine', 'md': 'Maryland', 'ma': 'Massachusetts',
    'mi': 'Michigan', 'mn': 'Minnesota', 'ms': 'Mississippi', 'mo': 'Missouri', 'mt': 'Montana',
    'ne': 'Nebraska', 'nv': 'Nevada', 'nh': 'New Hampshire', 'nj': 'New Jersey', 'nm': 'New Mexico',
    'ny': 'New York', 'nc': 'North Carolina', 'nd': 'North Dakota', 'oh': 'Ohio', 'ok': 'Oklahoma',
    'or': 'Oregon', 'pa': 'Pennsylvania', 'ri': 'Rhode Island', 'sc': 'South Carolina',
    'sd': 'South Dakota', 'tn': 'Tennessee', 'tx': 'Texas', 'ut': 'Utah', 'vt': 'Vermont',
    'va': 'Virginia', 'wa': 'Washington', 'wv': 'West Virginia', 'wi': 'Wisconsin', 'wy': 'Wyoming',

    # Extra messy variants
    'calif': 'California', 'cal': 'California', 'wash dc': 'District of Columbia',
    'districtofcolumbia': 'District of Columbia', 'nyc': 'New York', 'newyork': 'New York',
    'mass': 'Massachusetts', 'penn': 'Pennsylvania', 'penna': 'Pennsylvania',
    'flor': 'Florida', 'virgina': 'Virginia', 'virgina ': 'Virginia', 'tex': 'Texas',
    'ill': 'Illinois', 'colo': 'Colorado', 'ore': 'Oregon'
}


df_clean['US_state_clean'] = df_clean['US_state']
df_clean.loc[us_mask, 'US_state_clean'] = df_clean.loc[us_mask, 'US_state_clean'].replace(state_mapping)

print(f"\nTop US states:")
print(df_clean[us_mask]['US_state_clean'].value_counts().head(15))



3. Standardizing location data...
Country standardization results:
country_clean
United States     23142
Canada             1686
United Kingdom     1586
Australia           391
Germany             195
New Zealand         130
Ireland             124
Netherlands          90
France               68
Spain                49
Sweden               41
Switzerland          38
Belgium              35
Japan                29
Denmark              24
India                22
South Africa         19
Singapore            19
Austria              17
Finland              16
Name: count, dtype: int64

US respondents: 23142

Top US states:
US_state_clean
California              2599
New York                2170
Massachusetts           1519
Texas                   1267
Illinois                1210
Washington              1181
District of Columbia     976
Pennsylvania             942
Virginia                 784
Minnesota                720
Ohio                     653
Colorado                 631
Oregon    

In [13]:
# 4. Convert Experience Data to Numeric
print("\n4. Converting experience data to numeric...")

# Create mapping for experience ranges to numeric midpoints
experience_mapping = {
    '1 year or less': 1,
    '2 - 4 years': 3,
    '5-7 years': 6,
    '8 - 10 years': 9,
    '11 - 20 years': 15.5,
    '21 - 30 years': 25.5,
    '31 - 40 years': 35.5,
    '41 years or more': 45
}

# Convert overall experience
df_clean['experience_total_numeric'] = df_clean['experience_total_years'].map(experience_mapping)

# Convert field experience  
df_clean['experience_field_numeric'] = df_clean['experience_field_years'].map(experience_mapping)

print("Experience conversion results:")
print(f"Total experience - valid: {df_clean['experience_total_numeric'].notna().sum()}")
print(f"Field experience - valid: {df_clean['experience_field_numeric'].notna().sum()}")

print(f"\nTotal experience distribution:")
print(df_clean['experience_total_numeric'].value_counts().sort_index())
print(f"\nField experience distribution:")
print(df_clean['experience_field_numeric'].value_counts().sort_index())



4. Converting experience data to numeric...
Experience conversion results:
Total experience - valid: 28062
Field experience - valid: 28062

Total experience distribution:
experience_total_numeric
1.0      523
3.0     3026
6.0     4882
9.0     5377
15.5    9624
25.5    3637
35.5     869
45.0     124
Name: count, dtype: int64

Field experience distribution:
experience_field_numeric
1.0     1485
3.0     6249
6.0     6519
9.0     4982
15.5    6536
25.5    1868
35.5     382
45.0      41
Name: count, dtype: int64


In [14]:
print(df_clean['job_title'].value_counts())
print(df_clean['industry'].value_counts())

job_title
Software Engineer               286
Project Manager                 230
Director                        198
Senior Software Engineer        196
Program Manager                 152
                               ... 
Grants Administrator              1
Supervisor, Communications        1
Procurement Category Manager      1
Director, Cloud Solutions         1
Clinical physiologist             1
Name: count, Length: 14346, dtype: int64
industry
Computing or Tech                          4699
Education (Higher Education)               2464
Nonprofits                                 2419
Health care                                1896
Government and Public Administration       1889
                                           ... 
Gaming (Gambling)                             1
Regulatory Affairs- nutraceuticals            1
Manufacturing : corporate admin support       1
Real Estate Investment Support                1
Wine & Spirits                                1
Name: count, Len

In [15]:
# 5. Identify Tech Roles and Clean Job Titles
print("\n5. Identifying tech roles and cleaning job titles...")

import re

def is_tech_role(row):
    job_title = str(row['job_title']).lower().strip()

    # 1️⃣ Definite tech phrases (safe to include)
    strong_tech = [
        'software engineer', 'software developer', 'devops', 'cloud engineer',
        'systems engineer', 'system administrator', 'sysadmin', 'it support',
        'network engineer', 'data engineer', 'data scientist', 'data analyst',
        'ml engineer', 'machine learning', 'ai engineer', 'cybersecurity',
        'security engineer', 'full stack', 'frontend developer', 'backend developer',
        'database administrator', 'dba', 'site reliability engineer', 'sre'
    ]
    if any(phrase in job_title for phrase in strong_tech):
        return True

    # 2️⃣ Definite non-tech contexts — if these appear, exclude
    strong_nontech = [
        'medical', 'vet', 'veterinary', 'nurse', 'lab', 'laboratory',
        'scientist', 'biology', 'chemist', 'pharmacy', 'psychology', 'clinical',
        'communications', 'marketing', 'policy', 'legal', 'attorney',
        'finance', 'audit', 'account', 'recruit', 'grant', 'writer',
        'teacher', 'education', 'coach', 'diversity', 'equity'
    ]
    if any(word in job_title for word in strong_nontech):
        return False

    # 3️⃣ Broader but contextual tech words
    if re.search(r'\b(it|technical|developer|engineer|architect|programmer|technology)\b', job_title):
        return True

    # 4️⃣ Technician → only keep IT-related ones
    if 'technician' in job_title:
        if any(term in job_title for term in ['it', 'system', 'network', 'computer']):
            return True
        return False

    return False


df_clean['is_tech_role'] = df_clean.apply(is_tech_role, axis=1)

print(f"Tech roles identified: {df_clean['is_tech_role'].sum()}")
print(f"Non-tech roles: {(~df_clean['is_tech_role']).sum()}")

print(f"\nTech roles by industry:")
tech_roles = df_clean[df_clean['is_tech_role']]
print(tech_roles['industry'].value_counts().head(10))



5. Identifying tech roles and cleaning job titles...
Tech roles identified: 3991
Non-tech roles: 24071

Tech roles by industry:
industry
Computing or Tech                       2235
Engineering or Manufacturing             765
Education (Higher Education)             115
Government and Public Administration      97
Health care                               96
Accounting, Banking & Finance             87
Utilities & Telecommunications            65
Nonprofits                                59
Media & Digital                           55
Insurance                                 50
Name: count, dtype: int64


In [16]:
df_clean.loc[df_clean['is_tech_role'], 'job_title'].dropna().unique()
df_clean['is_tech_role'] = df_clean.apply(is_tech_role, axis=1)
tech_titles = df_clean.loc[df_clean['is_tech_role'], 'job_title']
tech_titles.sample(50)



25393                                     Data Analyst
18731                           Lead software engineer
25524                             Product Data Analyst
22927                               Frontend Developer
19875                         Senior Software Engineer
8626                          Senior Software Engineer
18811                             Electrical Engineer 
16364                                    Data Engineer
6079                         Senior Software Developer
15885                             Chief Data Scientist
10705                       Staff Developer Evangelist
19712                                 Systems Engineer
7581                                  Senior Developer
7385                                  Systems Engineer
18628                         Senior Software engineer
4808                      Technical Services Librarian
20429                     Optical development engineer
10426                              Mechanical Engineer
10199     

In [17]:
# 6. Data Quality Validation and Outlier Detection
print("\n6. Data quality validation and outlier detection...")

# Focus on USD salaries for outlier detection
usd_salaries = df_clean[(df_clean['currency'] == 'USD') & 
                        (df_clean['salary_numeric'].notna())]['salary_numeric']

print(f"USD salary analysis:")
print(f"Count: {len(usd_salaries)}")
print(f"Min: ${usd_salaries.min():,.0f}")
print(f"Max: ${usd_salaries.max():,.0f}")
print(f"Median: ${usd_salaries.median():,.0f}")
print(f"Mean: ${usd_salaries.mean():,.0f}")

# Identify potential outliers (salaries < 10k or > 1M)
outlier_low = df_clean['salary_numeric'] < 10000
outlier_high = df_clean['salary_numeric'] > 1000000

print(f"\nPotential outliers:")
print(f"Salaries < $10,000: {outlier_low.sum()}")
print(f"Salaries > $1,000,000: {outlier_high.sum()}")

# Show some examples of low outliers
print(f"\nExamples of low salary outliers:")
low_outliers = df_clean[outlier_low & (df_clean['currency'] == 'USD')]
if len(low_outliers) > 0:
    print(low_outliers[['job_title', 'industry', 'salary_numeric']].head(10))



6. Data quality validation and outlier detection...
USD salary analysis:
Count: 23374
Min: $0
Max: $102,000,000
Median: $78,000
Mean: $97,041

Potential outliers:
Salaries < $10,000: 149
Salaries > $1,000,000: 71

Examples of low salary outliers:
                      job_title                        industry  \
97       Quality Assurance Lead        Environmental regulation   
895   Special Education Teacher   Education (Primary/Secondary)   
968             Managing Editor                 Media & Digital   
1607       Chief Data Scientist         Agriculture or Forestry   
3887        freelance captioner                   accessibility   
4081         Operations Manager                          Retail   
4124     Engineering Supervisor    Engineering or Manufacturing   
5211        Assistant Registrar  Museum (University Affiliated)   
6124           Regional Manager                  Public Library   
6550                    Manager   Education (Primary/Secondary)   

      salary_n

In [18]:
# 7. Create Analysis-Ready Dataset
print("\n7. Creating analysis-ready dataset...")

# Filter for reasonable salary range (10k - 1M USD)
reasonable_salaries = (df_clean['salary_numeric'] >= 10000) & (df_clean['salary_numeric'] <= 1000000)

# Create final cleaned dataset
df_final = df_clean[
    (df_clean['currency'] == 'USD') &  # USD only for analysis
    (df_clean['salary_numeric'].notna()) &                 # Valid salary
    reasonable_salaries &                                   # Reasonable salary range
    (df_clean['country_clean'] == 'United States')         # US only
].copy()

print(f"Final dataset shape: {df_final.shape}")
print(f"Data retention: {len(df_final)/len(df)*100:.1f}% of original data")

# Summary of cleaned data
print(f"\nFinal dataset summary:")
print(f"- Total records: {len(df_final):,}")
print(f"- Tech roles: {df_final['is_tech_role'].sum():,}")
print(f"- Non-tech roles: {(~df_final['is_tech_role']).sum():,}")
print(f"- Salary range: ${df_final['salary_numeric'].min():,.0f} - ${df_final['salary_numeric'].max():,.0f}")
print(f"- Median salary: ${df_final['salary_numeric'].median():,.0f}")

print(f"\nFinal dataset columns:")
print(df_final.columns.tolist())



7. Creating analysis-ready dataset...
Final dataset shape: (23023, 25)
Data retention: 82.0% of original data

Final dataset summary:
- Total records: 23,023
- Tech roles: 3,160
- Non-tech roles: 19,863
- Salary range: $10,000 - $954,000
- Median salary: $78,750

Final dataset columns:
['timestamp', 'age', 'industry', 'job_title', 'job_context', 'annual_salary', 'additional_compensation', 'currency', 'currency_other', 'income_context', 'country', 'US_state', 'city', 'experience_total_years', 'experience_field_years', 'education', 'gender', 'race', 'salary_numeric', 'bonus_numeric', 'country_clean', 'US_state_clean', 'experience_total_numeric', 'experience_field_numeric', 'is_tech_role']


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [19]:
# Question 1: What is the median salary for Software Engineers in the United States?
software_engineer_keywords = [
    'software engineer', 'software developer', 'developer', 'programmer',
    'application developer', 'application engineer', 'sde', 'backend',
    'frontend', 'full stack', 'platform engineer', 'infrastructure engineer',
    'site reliability engineer', 'sre', 'devops', 'cloud engineer',
    'data engineer', 'ml engineer', 'ai engineer', 'android developer',
    'ios developer', 'mobile developer', 'embedded engineer', 'firmware engineer',
    'web developer', 'ui engineer', 'ux engineer', 'api developer',
    'tools engineer', 'automation engineer', 'senior software engineer',
    'principal software engineer', 'staff software engineer', 'lead software engineer',
    'senior developer', 'principal developer', 'staff developer', 'lead developer',
    'senior programmer', 'principal programmer', 'staff programmer', 'lead programmer',
    'software architect', 'systems engineer', 'technical lead', 'engineering manager',
    'software development engineer', 'swe', 'sw engineer', 'software dev',
    'fullstack', 'full-stack', 'backend developer', 'frontend developer',
    'full stack developer', 'full stack engineer', 'web engineer', 'mobile engineer',
    'machine learning engineer', 'ai/ml engineer', 'data scientist', 'research engineer',
    'security engineer', 'cybersecurity engineer', 'network engineer', 'database engineer',
    'qa engineer', 'test engineer', 'quality assurance engineer', 'performance engineer',
    'game developer', 'game engineer', 'unity developer', 'react developer',
    'python developer', 'java developer', 'javascript developer', 'c++ developer',
    'ios engineer', 'android engineer', 'cross-platform developer'
]

def is_software_engineer(job_title):
    """Check if a job title indicates a software engineering role"""
    if pd.isna(job_title):
        return False
    
    job_title_lower = str(job_title).lower()
    
    for keyword in software_engineer_keywords:
        if keyword in job_title_lower:
            return True
    
    return False

# Apply the function to identify software engineers
df_final['is_software_engineer'] = df_final['job_title'].apply(is_software_engineer)


In [20]:
# Calculate median salary for software engineers

# Filter for software engineers with valid salaries
swe_data = df_final[
    (df_final['is_software_engineer'] == True) & 
    (df_final['salary_numeric'].notna()) &
    (df_final['salary_numeric'] > 0)
].copy()

print(f"\nSoftware engineers with valid salaries: {len(swe_data)}")

# Remove extreme outliers for more realistic analysis
swe_data_filtered = swe_data[
    (swe_data['salary_numeric'] >= 20000) & 
    (swe_data['salary_numeric'] <= 700000)
].copy()

print(f"Software engineers with realistic salaries ($20k-$700k): {len(swe_data_filtered)}")

# Calculate median salary
median_salary = swe_data_filtered['salary_numeric'].median()
mean_salary = swe_data_filtered['salary_numeric'].mean()

print(f"\n📊 SOFTWARE ENGINEER SALARY ANALYSIS")
print(f"=" * 50)
print(f"Sample size: {len(swe_data_filtered):,} software engineers")
print(f"Median salary: ${median_salary:,.0f}")
print(f"Mean salary: ${mean_salary:,.0f}")
print(f"Min salary: ${swe_data_filtered['salary_numeric'].min():,.0f}")
print(f"Max salary: ${swe_data_filtered['salary_numeric'].max():,.0f}")
print(f"25th percentile: ${swe_data_filtered['salary_numeric'].quantile(0.25):,.0f}")
print(f"75th percentile: ${swe_data_filtered['salary_numeric'].quantile(0.75):,.0f}")

# Analysis by seniority levels
def get_seniority_level(job_title):
    if pd.isna(job_title):
        return 'Unknown'
    
    title_lower = str(job_title).lower()
    
    if any(word in title_lower for word in ['principal', 'staff', 'distinguished']):
        return 'Principal/Staff'
    elif any(word in title_lower for word in ['senior', 'sr', 'lead']):
        return 'Senior'
    elif any(word in title_lower for word in ['junior', 'jr', 'entry', 'associate']):
        return 'Junior/Entry'
    else:
        return 'Mid-level'

swe_data_filtered['seniority'] = swe_data_filtered['job_title'].apply(get_seniority_level)

print(f"\n📈 SALARY BY SENIORITY LEVEL")
print(f"=" * 35)
seniority_analysis = swe_data_filtered.groupby('seniority')['salary_numeric'].agg([
    'count', 'median', 'mean'
]).round(0)
print(seniority_analysis)

# Final answer
print(f"\n🎯 FINAL ANSWER TO QUESTION 1")
print(f"=" * 50)
print(f"Question: What is the median salary for Software Engineers in the United States?")
print(f"Answer: ${median_salary:,.0f}")
print(f"Based on {len(swe_data_filtered):,} software engineers in the US with valid salary data.")




Software engineers with valid salaries: 1771
Software engineers with realistic salaries ($20k-$700k): 1763

📊 SOFTWARE ENGINEER SALARY ANALYSIS
Sample size: 1,763 software engineers
Median salary: $131,000
Mean salary: $138,559
Min salary: $28,800
Max salary: $630,000
25th percentile: $100,000
75th percentile: $165,000

📈 SALARY BY SENIORITY LEVEL
                 count    median      mean
seniority                                 
Junior/Entry        24   76500.0   76121.0
Mid-level         1097  120000.0  128018.0
Principal/Staff    113  180000.0  185020.0
Senior             529  148000.0  153327.0

🎯 FINAL ANSWER TO QUESTION 1
Question: What is the median salary for Software Engineers in the United States?
Answer: $131,000
Based on 1,763 software engineers in the US with valid salary data.


In [21]:
# Question 2: Which US state has the highest average salary for tech workers?
tech_keywords = [
    # Software Engineering (already defined above)
    'software engineer', 'software developer', 'developer', 'programmer',
    'application developer', 'application engineer', 'sde', 'backend',
    'frontend', 'full stack', 'platform engineer', 'infrastructure engineer',
    'site reliability engineer', 'sre', 'devops', 'cloud engineer',
    'data engineer', 'ml engineer', 'ai engineer', 'android developer',
    'ios developer', 'mobile developer', 'embedded engineer', 'firmware engineer',
    'web developer', 'ui engineer', 'ux engineer', 'api developer',
    'tools engineer', 'automation engineer',
    
    # Data Science and Analytics
    'data scientist', 'data analyst', 'analytics engineer', 'business analyst',
    'data architect', 'machine learning engineer', 'research scientist',
    'quantitative analyst', 'statistician',
    
    # Product and Design Tech Roles
    'product manager', 'product owner', 'scrum master', 'agile coach',
    'technical product manager', 'product analyst',
    
    # IT and Systems
    'systems administrator', 'sysadmin', 'network administrator', 'database administrator',
    'it manager', 'it director', 'technical support', 'help desk',
    'cybersecurity analyst', 'security analyst', 'information security',
    
    # Tech Sales and Marketing
    'sales engineer', 'technical sales', 'solution architect', 'pre-sales engineer',
    'technical marketing', 'developer advocate', 'technical writer',
    
    # Engineering Management
    'engineering manager', 'technical director', 'cto', 'vp engineering',
    'head of engineering', 'engineering lead', 'tech lead',
    
    # Quality Assurance
    'qa engineer', 'test engineer', 'quality assurance engineer', 'automation tester',
    'performance engineer', 'test automation engineer',
    
    # Other Tech Roles
    'solutions architect', 'enterprise architect', 'technical consultant',
    'implementation engineer', 'integration engineer', 'technical project manager'
]

def is_tech_worker(job_title):
    """Check if a job title indicates a tech worker role"""
    if pd.isna(job_title):
        return False
    
    job_title_lower = str(job_title).lower()
    
    for keyword in tech_keywords:
        if keyword.lower() in job_title_lower:
            return True
    
    return False

# Apply the function to identify tech workers
df_final['is_tech_worker'] = df_final['job_title'].apply(is_tech_worker)

# Filter for tech workers with valid salaries
tech_workers = df_final[
    (df_final['is_tech_worker'] == True) & 
    (df_final['salary_numeric'].notna()) &
    (df_final['salary_numeric'] > 0) & (df_final['country_clean'] == 'United States') &
    (df_final['US_state_clean'].notna())  # Must have state data
].copy()

print(f"Tech workers with valid salary and state data: {len(tech_workers)}")

# Remove extreme outliers for more realistic analysis
tech_workers_filtered = tech_workers[
    (tech_workers['salary_numeric'] >= 20000) & 
    (tech_workers['salary_numeric'] <= 700000)
].copy()

print(f"Tech workers with realistic salaries ($20k-$700k): {len(tech_workers_filtered)}")

# Calculate average salary by state
state_salaries = tech_workers_filtered.groupby('US_state_clean')['salary_numeric'].agg([
    'count', 'mean', 'median', 'std'
]).round(0)

# Filter for states with at least 10 tech workers for statistical significance
state_salaries_significant = state_salaries[state_salaries['count'] >= 10].copy()

# Sort by average salary (mean)
state_salaries_significant = state_salaries_significant.sort_values('mean', ascending=False)

print(f"\n📊 TECH WORKER SALARIES BY STATE")
print(f"=" * 50)
print(f"States with 10+ tech workers (sorted by average salary):")
print(f"{'State':<25} {'Count':<8} {'Avg Salary':<12} {'Median':<12} {'Std Dev':<12}")
print("-" * 70)

for state, row in state_salaries_significant.head(10).iterrows():
    print(f"{state:<25} {row['count']:<8.0f} ${row['mean']:<11,.0f} ${row['median']:<11,.0f} ${row['std']:<11,.0f}")

# Find the state with highest average salary
highest_paying_state = state_salaries_significant.index[0]
highest_avg_salary = state_salaries_significant.iloc[0]['mean']
state_count = state_salaries_significant.iloc[0]['count']

print(f"\n🎯 FINAL ANSWER TO QUESTION 2")
print(f"=" * 50)
print(f"Question: Which US state has the highest average salary for tech workers?")
print(f"Answer: {highest_paying_state}")
print(f"Average salary: ${highest_avg_salary:,.0f}")
print(f"Sample size: {state_count:.0f} tech workers")
print(f"Top 5 highest-paying states for tech workers:")
for i, (state, row) in enumerate(state_salaries_significant.head(5).iterrows()):
    print(f"{i+1}. {state}: ${row['mean']:,.0f} (n={row['count']:.0f})")

# Additional analysis - show some context
print(f"\n📈 ADDITIONAL INSIGHTS")
print(f"=" * 30)
print(f"• Total states analyzed: {len(state_salaries_significant)}")
print(f"• Tech worker salary range across all states: ${state_salaries_significant['mean'].min():,.0f} - ${state_salaries_significant['mean'].max():,.0f}")
print(f"• Salary difference between #1 and #2: ${highest_avg_salary - state_salaries_significant.iloc[1]['mean']:,.0f}")
print(f"• {highest_paying_state} has {state_count:.0f} tech workers, representing {state_count/len(tech_workers_filtered)*100:.1f}% of all tech workers in the dataset")


Tech workers with valid salary and state data: 5037
Tech workers with realistic salaries ($20k-$700k): 5022

📊 TECH WORKER SALARIES BY STATE
States with 10+ tech workers (sorted by average salary):
State                     Count    Avg Salary   Median       Std Dev     
----------------------------------------------------------------------
California                661      $147,461     $141,000     $67,037     
Washington                284      $136,140     $135,000     $49,994     
New York                  588      $130,706     $120,000     $63,037     
Massachusetts             414      $120,919     $112,000     $47,893     
District of Columbia      192      $119,302     $108,030     $54,758     
Oregon                    163      $118,499     $110,000     $49,446     
Virginia                  156      $117,898     $109,600     $50,861     
New Jersey                70       $117,064     $109,500     $54,397     
Colorado                  153      $116,480     $110,000     $59,

In [74]:
#Question 3:    
# Filter for tech roles with valid salary and experience
tech_df = df_final[
    (df_final['is_tech_role'] == True) &
    df_final['salary_numeric'].notna() &
    df_final['experience_total_numeric'].notna()
]

# Group by years of experience and calculate median salary
median_salary_by_exp = tech_df.groupby('experience_total_numeric')['salary_numeric'].median().sort_index()

# Compute year-over-year differences
salary_diff = median_salary_by_exp.diff().dropna()

# Average increase per year
avg_increase_per_year = salary_diff.mean()

# Estimated starting salary (median at 0 or 1 year)
starting_salary = median_salary_by_exp.iloc[0]

print(f"Estimated salary increase per year (tech roles): ${avg_increase_per_year:,.2f}")
print(f"Estimated starting salary (tech roles): ${starting_salary:,.0f}")



Estimated salary increase per year (tech roles): $9,278.57
Estimated starting salary (tech roles): $80,050


In [54]:
# Question 4: What percentage of respondents work remotely vs. in-office?
from rapidfuzz import fuzz

def infer_work_type(row):
    title = str(row.get('job_title', '')).lower()
    context = str(row.get('job_context', '')).lower()
    text = title + ' ' + context  # combine both for context

    if not text.strip():
        return 'Unknown'
    
    remote_keywords = ['remote', 'work from home', 'wfh', 'telework', 'telecommute',
        'virtual work', 'distributed team', 'home office', 'remote-first',
        'fully remote', 'work remotely', 'offsite', 'off-site', 'anywhere',
        'global team', 'remote position', '100% remote', 'remote role',
        'remote environment', 'work at home', 'remote job',
        'online only', 'remote-based', 'location independent',
        'work anywhere', 'home-based', 'flexible location',
        'digital nomad', 'remote contractor', 'remote freelancer',
        'remote setup', 'virtual position', 'virtual role', 'in house', 'self-employed', 'customer service rep',
        'customer support specialist'
        ]
    if any(fuzz.partial_ratio(word, text) > 90 for word in remote_keywords):
        return 'Remote'
    hybrid_keywords = ['hybrid', 'part remote', 'partly remote', 'split schedule',
        '2 days remote', '3 days remote', 'couple days remote',
        'office + home', 'remote some days', 'flexible work', 'mixed remote',
        'in-person part time', 'partly in office', 'few days in office',
        'flex work', 'flexible schedule', 'alternate remote', 'rotating remote',
        'part office', 'half remote', 'some remote', 'blended work',
        'combo of office and remote', 'partial remote']
    if any(fuzz.partial_ratio(word, text) > 90 for word in hybrid_keywords):
        return 'Hybrid'
    office_keywords = [ 'in office', 'in-person','in person','onsite','on site', 'on-site', 'office-based',
        'at office', 'in the building', 'at the facility', 'on campus', 
        'in hospital', 'in school', 'in clinic', 'front desk', 'in warehouse',
        'on location', 'field work', 'at headquarters', 'office required',
        'must be onsite', 'shift work', 'facility-based', 'work location',
        'on the floor', 'in the store', 'site-based', 'rn']
    if any(fuzz.partial_ratio(word, text) > 90 for word in office_keywords):
        return 'In-Office'
    # --- 1️⃣ Explicit indicators ---
    if any(word in text for word in remote_keywords):
        return 'Remote'
    if any(word in text for word in hybrid_keywords ):
        return 'Hybrid'
    if any(word in text for word in office_keywords):
        return 'In-Office'
    
    # --- 3️⃣ Likely in-office roles ---
    office_roles = [
        'teacher', 'professor', 'educator', 'lecturer', 'faculty',
        'nurse', 'doctor', 'therapist', 'counselor', 'psychologist',
        'retail', 'cashier', 'store associate', 'sales clerk', 'barista',
        'server', 'waiter', 'cook', 'chef', 'housekeeper', 'janitor',
        'construction', 'mechanic', 'driver', 'warehouse', 'delivery',
        'security', 'maintenance', 'manufacturing', 'assembly', 'technician',
        'public library', 'government', 'healthcare', 'hospital',
        'school', 'university', 'campus', 'clinic', 'library', 'librarian', 'chemist', 'scientist', 'attorney',
        'police', 'firefighter', 'military', 'lab technician', 'groundskeeper', 'driving', 'receptionist', 'veterinarian', 'ogist', 'cist', 'ician',
        'lab', 'electrician', 'dental', 'surgeon', 'ortho', 'carpent', 'plumb' 
    ]
    if any(word in text for word in office_roles):
        return 'In-Office'
    if any(fuzz.partial_ratio(word, text) > 90 for word in office_roles):
        return 'In-Office'
    return 'Unknown'

df_final['work_type'] = df_final.apply(infer_work_type, axis=1)
work_type_percent = (df_final['work_type'].value_counts(normalize=True) * 100).round(2)
for work_type, pct in work_type_percent.items():
    print(f"{work_type}: {pct}%")



Unknown: 77.92%
In-Office: 21.36%
Remote: 0.67%
Hybrid: 0.05%


In [None]:
unknown_rows = df_final[df_final['work_type'] == 'Unknown'][['job_title', 'job_context']]


job_context
Fundraising                                                              19
In commercial real estate industry                                       10
                                                                          9
Compliance                                                                5
Human Resources                                                           4
                                                                         ..
I work on Digital Marketing                                               1
Applications and Data Management                                          1
Software developer, DevOps practitioner                                   1
It's a combination role of Project Manager and Client Success Manager     1
I build airplanes                                                         1
Name: count, Length: 100, dtype: int64


In [55]:
# Question 5: Which industry (besides tech) has the highest median salary?
# Filter out tech roles and remove missing industries
non_tech_df = df_final[(df_final['is_tech_role'] == False) & (df_final['industry'].notna())]

# Group by industry and calculate median salary
industry_median_salary = non_tech_df.groupby('industry')['salary_numeric'].median()

# Sort descending to get the highest median salaries
industry_median_salary_sorted = industry_median_salary.sort_values(ascending=False)

# Print top industry nicely
top_industry = industry_median_salary_sorted.head(1)
for industry, median_salary in top_industry.items():
    print(f"The non-tech industry with the highest median salary is '{industry}' with a median salary of ${median_salary:,}")

# Optional: Print top 5 industries for context
print("\nTop 5 non-tech industries by median salary:")
for industry, median_salary in industry_median_salary_sorted.head(5).items():
    print(f"{industry}: ${median_salary:,}")



The non-tech industry with the highest median salary is 'Commercial Building Material Distribution' with a median salary of $400,000.0

Top 5 non-tech industries by median salary:
Commercial Building Material Distribution: $400,000.0
Sports: $300,000.0
Corporate Training: $280,000.0
Energy (oil & gas & associated products, renewable power, etc): $253,300.0
Pharmaceutical/biotechnology: $227,500.0


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
salary_df = df_final[
    df_final['salary_numeric'].notna() & 
    df_final['gender'].notna() & 
    df_final['job_title'].notna()
]
mean_salary_by_role = salary_df.groupby(['job_title', 'gender'])['salary_numeric'].mean().unstack()
mean_salary_by_role['gap'] = mean_salary_by_role['Man'] - mean_salary_by_role['Woman']
mean_salary_by_role = mean_salary_by_role.sort_values('gap', ascending=False)

print("Top 10 roles with largest mean Man-Woman salary gaps:\n")
print(f"{'Job Title':40} {'Man Salary':>12} {'Woman Salary':>14} {'Gap':>12}")
print("-" * 80)
for job, row in mean_salary_by_role.head(10).iterrows():
    print(f"{job:40} ${row['Man']:>11,.0f} ${row['Woman']:>13,.0f} ${row['gap']:>11,.0f}")



Top 10 roles with largest mean Man-Woman salary gaps:

Job Title                                  Man Salary   Woman Salary          Gap
--------------------------------------------------------------------------------
Program Associate                        $    510,000 $      127,491 $    382,509
Managing director                        $    390,000 $       91,500 $    298,500
Senior Financial Analyst                 $    346,800 $       90,980 $    255,820
Physician                                $    488,000 $      300,000 $    188,000
Staff Engineer                           $    253,000 $       82,416 $    170,584
Design Strategist                        $    250,000 $       85,000 $    165,000
Director of Education                    $    205,000 $       51,500 $    153,500
Grants Director                          $    207,500 $       54,000 $    153,500
Head of School                           $    210,000 $       61,000 $    149,000
VP Engineering                           $  

In [None]:
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Filter relevant rows
edu_salary_df = df_final[
    df_final['salary_numeric'].notna() & 
    df_final['education'].notna()
]

# Select only Master's and College degrees
edu_salary_df = edu_salary_df[edu_salary_df['education'].isin(["Master's degree", "College degree"])]

# Group by education and compute median and mean salary
salary_by_edu = edu_salary_df.groupby('education')['salary_numeric'].agg(['median', 'mean', 'count']).sort_values('median', ascending=False)

# Print nicely
print("Salary comparison: Master's degree vs College degree\n")
for edu, row in salary_by_edu.iterrows():
    print(f"{edu:15} | Median: ${row['median']:,.0f} | Mean: ${row['mean']:,.0f} | Count: {int(row['count'])}")
print('Answer: NO')



Salary comparison: Master's degree vs College degree

Master's degree | Median: $80,000 | Mean: $91,780 | Count: 7399
College degree  | Median: $75,000 | Mean: $87,097 | Count: 11221
Answer: NO


In [72]:
# Question 8: Which company size (startup, medium, large) pays the most on average?

# --- 1️⃣ Function to infer company size ---
def infer_company_size(text):
    if pd.isna(text) or not str(text).strip():
        return 'Unknown'
    
    text = str(text).lower()
    
    # Keywords for small/startup
    small_keywords = ['startup', 'early-stage', 'small team', 'founder', 'small company', 'bootstrapped', 'seed round']
    if any(word in text for word in small_keywords):
        return 'Startup'
    
    # Keywords for medium
    medium_keywords = ['mid-size', 'medium', 'growing company', 'scaling company', 'series b', 'series c']
    if any(word in text for word in medium_keywords):
        return 'Medium'
    
    # Keywords for large
    large_keywords = ['large company', 'enterprise', 'fortune 500', 'corporation', 'multinational', 'public company', 'global company']
    if any(word in text for word in large_keywords):
        return 'Large'
    
    return 'Unknown'

# --- 2️⃣ Apply function ---
df_final['company_size'] = df_final['job_context'].apply(infer_company_size)

# --- 3️⃣ Compute average, median, and count ---
size_salary_stats = df_final[df_final['company_size'] != 'Unknown'].groupby('company_size')['salary_numeric'].agg(
    count='count',
    median='median',
    mean='mean'
).sort_values('median', ascending=False)

# --- 4️⃣ Print nicely ---
print(f"{'Company Size':10} | {'Count':>5} | {'Median Salary':>14} | {'Average Salary':>15}")
print("-" * 55)
for size, row in size_salary_stats.iterrows():
    print(f"{size:10} | {row['count']:>5} | ${row['median']:>13,.0f} | ${row['mean']:>14,.0f}")


Company Size | Count |  Median Salary |  Average Salary
-------------------------------------------------------
Large      |  32.0 | $      101,250 | $       117,578
Startup    |  53.0 | $       89,000 | $       114,340
Medium     |  21.0 | $       75,000 | $        80,541


## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $131,000
2. **Highest paying US state for tech:** California
3. **Salary increase per year of experience:** Estimated salary increase per year (tech roles): $9,278.57
4. **Remote vs office percentage:** 
In-Office: 21.36%
Remote: 0.67%
5. **Highest paying non-tech industry:** Commercial Building Material Distribution


**Key insights:**
- Insight 1 - Master's degree is not that significant in salary increases.
- Insight 2 - Salary gaps between men and women persist even within the same roles
- Insight 3 - Certain industries outside tech, like finance or healthcare, can offer high median salaries despite smaller representation in the dataset.

**Challenges faced:**
- Challenge 1 and how you solved it - Making inferences just from job context was hard and time consuming. I had to produce a keyword list and check the actual data then refine the list repeatedly.
- Challenge 2 and how you solved it - Cursor deleted some of my previous work and I was unable to properly restore it. Next time, I will have to be more specific about what I am prompting.

**What you learned about vibe coding:**
- Learning 1 - While vibe coding can handle repetitive tasks, decisions like what counts as “remote” or “large company” still need human judgment.
- Learning 2 - The first output is rarely perfect, so you learn to refine instructions and outputs like debugging a conversation
- Learning 3 - Sometimes it produces overly detailed code so I have learn to extract only what’s relevant
