# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


## 1. 📥 Source and Import

- **Input Path**:  
  `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

- **Read Settings**:  
  - Delimiter: `\t` (TSV)  
  - Encoding: `utf-8`  
  - Quoted fields: handled properly  
  - Drop **fully empty rows and columns**  
  - Trim whitespace from all string fields

---

## 2. 🧱 Schema and Naming

- **Standardize all column names**:
  - Format: `snake_case`, lowercase, short but descriptive
  - Examples:
    - `annual_salary`, `currency`, `job_title`, `years_experience`, `location_city`, `location_country`, `industry`, `remote_status`, `education_level`, `gender`, `race_ethnicity`

- **Maintain a column mapping**:  
  - Original → Cleaned column names (data dictionary)

---

## 3. 🧹 Deduplication and Integrity

- Remove **exact duplicate rows**
- If applicable, deduplicate by **(respondent_id + submission time)**:
  - Retain the most **complete** entry

---

## 4. 🕒 Date/Time Normalization

- Parse `timestamp` column as **timezone-aware datetime**
- Create `survey_year` field with fixed value `2021`
- Derive snapshot fields if required

---

## 5. 💵 Compensation Normalization (Critical)

### 🔢 Salary Parsing
- Normalize all compensation to **annual** terms
  - Detect frequency (hourly, weekly, monthly, etc.)
  - Use standard assumptions (e.g., `2080` hours/year for hourly)
- Handle ranges like `80–100k`
  - Compute midpoint unless otherwise stated
  - Keep raw min/max columns

### 💱 Currency Detection & Harmonization
- Detect or infer `currency` (from field or `location_country`)
- Convert all salaries to **USD using 2021 FX rates**
  - Use fixed reference: `fx_rates_2021.csv`
- Retain:
  - `annual_salary_original`
  - `salary_currency`
  - `annual_salary_usd`

### 💼 Bonus & Stock
- Separate into:
  - `base_salary_usd`, `bonus_usd`, `equity_estimate_usd`
  - Compute `total_comp_usd = base + bonus + equity`

### 🧯 Outlier Handling
- Winsorize or cap at **1st/99th percentile** (by region/industry)
- Flag outliers with `is_outlier_comp`
- Drop invalid or absurd values:
  - `<= 0`
  - Unrealistic hourly × hours assumptions

---

## 6. 📆 Employment & Time Features

- Normalize:
  - `years_experience`
  - `years_at_company`
- Handle ranges:
  - `"<1"` → `0.5`
  - `"10+"` → `10`
- Create binned columns:
  - `exp_bucket`
  - `tenure_bucket`

---

## 7. 🌍 Location Standardization

- **Split full location** into:
  - `city`, `state_province`, `country`
- **Standardize**:
  - US States → **USPS 2-letter codes**
  - Countries → **ISO-2 / ISO-3 codes**
- Add derived features:
  - `region`: US, Canada, UK/Ireland, EU, APAC, LATAM
  - `col_tier`: HCOL / MCOL / LCOL (optional)

---

## 8. 👔 Job Title & Function Normalization

- Lowercase, trim whitespace
- **Strip seniority terms** (store separately):
  - e.g., Intern, Junior, Mid, Senior, Lead, Manager, Director, VP, CxO, IC
- Map raw titles to standard **job_family**
  - Examples: Data Analyst, SWE, PM, Ops, HR, Design, etc.
- Retain `job_title_raw` for traceability

---

## 9. 🏠 Work Arrangement

- Normalize `remote_status`:
  - Values: `Remote`, `Hybrid`, `Onsite`
- Derive:
  - `weekly_remote_days` (if possible)
  - `remote_flag` (boolean)

---

## 10. 🏢 Organization Features

- Standardize `industry` using controlled vocabulary:
  - Tech, Finance, Healthcare, Government, Education, Nonprofit, Media, Retail, Manufacturing, Other
- Normalize `company_size`:
  - Buckets: `1–10`, `11–50`, `51–200`, `201–500`, `501–1000`, `1001–5000`, `5001–10000`, `10k+`
- Derive `is_public_company` flag (if available)

---

## 11. 🎓 Education & Demographics (Sensitive)

- **Standardize `education_level`**:
  - Values: HS, Associate, Bachelor, Master, PhD, Other
- Normalize:
  - `gender`
  - `race_ethnicity`
    - Allow multi-select:
      - Represent as list or one-hot columns
      - Document clearly
- Flag rows as `sensitive_flag` if:
  - Small subgroups could impact privacy
  - Apply **k-anonymity thresholds**

---

## 12. ⚠️ Data Quality Flags

Create the following row-level flags:

- `is_outlier_comp`
- `is_inconsistent_unit`
- `is_missing_key_fields`
- `is_inferred_currency`
- Add `cleaning_notes` field for nontrivial corrections

---

## 13. 🧩 Missing Data Strategy

- Define **required fields**:
  - `annual_salary_usd`, `job_family`, `country/region`, `years_experience`
- Drop rows missing required fields
- Impute optional fields (transparently):
  - Mode imputation within job_family × region for `remote_status`
- Compute `completeness_score` per row (e.g., % of key fields present)

---

## 14. ✅ Validation Checks

| Check | Target |
|-------|--------|
| Numeric salary fields | 100% numeric |
| Negative/zero salaries | None allowed |
| Salary conversions | 100% consistent |
| Outlier share | ≤ 1% (configurable) |
| Job title mapping coverage | ≥ 95% |
| ISO location mapping | ≥ 99% |
| Script reproducibility | 100% |

---

## 15. 📦 Outputs

- **Clean fact table**:  
  `compensation_clean.parquet`

- **Dimension lookups**:
  - `dim_title.csv`
  - `dim_industry.csv`
  - `dim_location.csv`
  - `fx_rates_2021.csv`

- **Documentation**:
  - Data dictionary (`data_dictionary.yaml`)
  - Cleaning log (`cleaning_notes.md`)

- **Summary Report**:
  - Row counts
  - Missingness
  - Outlier share
  - Distributions

---

## 16. 🛠 Reproducibility & Documentation

- All parameters (e.g., hours/year, FX rates, winsor cutoffs) are **parameterized in `config.yaml`**
- Raw input is read-only and **version-controlled**
- Include `README.md` explaining:
  - Setup
  - Running the pipeline
  - Outputs



## Step 1: Data Loading and Exploration


In [None]:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Directly specify the absolute path to the data file

data_path = r"C:\Users\ailin\ds-fall-2025-fri-1230\Week-02-Pandas-Part-2-and-DS-Overview\data\Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv"

# Check if the file exists at the specified location
import os
assert os.path.exists(data_path), f"Data file not found: {data_path}"

df_raw = pd.read_csv(
    data_path,
    sep="\t",
    encoding="utf-8",
    na_values=["", "NA", "N/A", "na", "n/a"],
    keep_default_na=True,
    low_memory=False,
)

print(f"Rows: {df_raw.shape[0]:,} | Columns: {df_raw.shape[1]:,}")
df_raw.info()
df_raw.head(5)

Rows: 28,062 | Columns: 18
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28062 entries, 0 to 28061
Data columns (total 18 columns):
 #   Column                                                                                                                                                                                                                                Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                --------------  -----  
 0   Timestamp                                                                                                                                                                                                                             28062 non-null  object 
 1   How old are you?                                                                  

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## Step 2: Data Cleaning


In [49]:
# Make a copy to avoid modifying the raw data
# Clean for US country, identify tech roles and software engineers, and make columns more readable
# 1. Make columns shorter for readability
col_rename = {
    "timestamp": "timestamp",
    "how_old_are_you": "age",
    "what_industry_do_you_work_in": "industry",
    "job_title": "job_title",
    "if_your_job_title_needs_explanation_what_does_it_mean": "job_title_expl",
    "how_many_years_have_you_worked_in_your_field": "years_in_field",
    "how_many_years_of_professional_work_experience_do_you_have_in_your_field": "years_exp",
    "what_is_your_highest_level_of_education_completed": "education",
    "what_is_your_gender": "gender",
    "what_is_your_race_choose_all_that_apply.": "race",
    "annual_base_pay_usd": "base_pay",
    "annual_bonus_usd": "bonus",
    "annual_total_other_compensation_usd": "other_comp",
    "currency": "currency",
    "country": "country",
    "state": "state",
    "city": "city",
    "overall_years_of_professional_experience": "overall_exp",
    "is_tech_role": "is_tech_role",
    "is_software_engineer": "is_software_engineer"
}
# Only rename columns that exist in the DataFrame
col_rename = {k: v for k, v in col_rename.items() if k in df_us.columns}
df_us = df_us.rename(columns=col_rename)

# 2. Drop outliers in base_pay (annual salary)
if "base_pay" in df_us.columns:
    # Remove non-numeric, convert to float
    df_us["base_pay"] = pd.to_numeric(df_us["base_pay"], errors="coerce")
    # Drop rows with missing base_pay
    df_us = df_us.dropna(subset=["base_pay"])
    # Remove outliers using IQR
    Q1 = df_us["base_pay"].quantile(0.25)
    Q3 = df_us["base_pay"].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    df_us = df_us[(df_us["base_pay"] >= lower) & (df_us["base_pay"] <= upper)]

# 1. Filter for US respondents
# The country column may be named like 'country' or similar; let's find it
country_col = [col for col in df_raw.columns if "country" in col.lower()]
if country_col:
    us_mask = df_raw[country_col[0]].str.strip().str.lower() == "united states"
    df_us = df_raw[us_mask].copy()
else:
    df_us = df_raw.copy()  # fallback if no country column

# 2. Standardize column names for readability
df_us.columns = (
    df_us.columns.str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace("(", "")
    .str.replace(")", "")
    .str.replace("?", "")
    .str.replace("/", "_")
)

# 3. Identify tech roles and software engineers
# We'll create two new columns: is_tech_role and is_software_engineer

# Find the job title column
job_title_col = [col for col in df_us.columns if "job_title" in col]
if job_title_col:
    job_col = job_title_col[0]
else:
    job_col = None

def is_tech(title):
    if not isinstance(title, str):
        return False
    tech_keywords = [
        "engineer", "developer", "programmer", "software", "data scientist",
        "data engineer", "it", "devops", "web", "cloud", "systems", "qa", "security", "database"
    ]
    title_lower = title.lower()
    return any(word in title_lower for word in tech_keywords)

def is_software_engineer(title):
    if not isinstance(title, str):
        return False
    title_lower = title.lower()
    return "software engineer" in title_lower or "software developer" in title_lower

if job_col:
    df_us["is_tech_role"] = df_us[job_col].apply(is_tech)
    df_us["is_software_engineer"] = df_us[job_col].apply(is_software_engineer)
else:
    df_us["is_tech_role"] = False
    df_us["is_software_engineer"] = False

# Overwrite df with the cleaned, US-only, annotated DataFrame
df = df_us.copy()

# Make a copy to avoid modifying the raw data
df = df_raw.copy()

# Standardize column names: strip whitespace, lower case, replace spaces with underscores
df.columns = (
    df.columns.str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace("(", "")
    .str.replace(")", "")
    .str.replace("?", "")
    .str.replace("/", "_")
)

# Remove columns with all NaN values
df.dropna(axis=1, how="all", inplace=True)

# Remove rows with no salary information (assuming salary column contains 'salary')
salary_cols = [col for col in df.columns if "salary" in col]
if salary_cols:
    df = df[~df[salary_cols[0]].isna()]

# Remove duplicate rows if any
df.drop_duplicates(inplace=True)

# Reset index after cleaning
df.reset_index(drop=True, inplace=True)

# Show a summary after cleaning
print(f"Rows after cleaning: {df.shape[0]:,} | Columns: {df.shape[1]:,}")
df.info()
df.head(3)


Rows after cleaning: 28,062 | Columns: 18
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28062 entries, 0 to 28061
Data columns (total 18 columns):
 #   Column                                                                                                                                                                                                                             Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                             --------------  -----  
 0   timestamp                                                                                                                                                                                                                          28062 non-null  object 
 1   how_old_are_you                                                             

Unnamed: 0,timestamp,how_old_are_you,what_industry_do_you_work_in,job_title,"if_your_job_title_needs_additional_context,_please_clarify_here:","what_is_your_annual_salary_you'll_indicate_the_currency_in_a_later_question._if_you_are_part-time_or_hourly,_please_enter_an_annualized_equivalent_--_what_you_would_earn_if_you_worked_the_job_40_hours_a_week,_52_weeks_a_year.","how_much_additional_monetary_compensation_do_you_get,_if_any_for_example,_bonuses_or_overtime_in_an_average_year_please_only_include_monetary_compensation_here,_not_the_value_of_benefits.",please_indicate_the_currency,"if_""other,""_please_indicate_the_currency_here:","if_your_income_needs_additional_context,_please_provide_it_here:",what_country_do_you_work_in,"if_you're_in_the_u.s.,_what_state_do_you_work_in",what_city_do_you_work_in,how_many_years_of_professional_work_experience_do_you_have_overall,how_many_years_of_professional_work_experience_do_you_have_in_your_field,what_is_your_highest_level_of_education_completed,what_is_your_gender,what_is_your_race_choose_all_that_apply.
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [57]:
# Find the median salary for Software Engineers in the US

# Identify relevant columns

# Identify the likely salary, job title, and country columns
salary_col = [col for col in df.columns if "salary" in col][0]
job_title_col = [col for col in df.columns if "job_title" in col][0]
country_col = [col for col in df.columns if "country" in col]
country_col = country_col[0] if country_col else None

# Filter for Software Engineers in the US
se_mask = df[job_title_col].str.contains("software engineer", case=False, na=False)
if country_col:
    us_mask = df[country_col].str.contains("united states|usa|us", case=False, na=False)
    se_us_df = df[se_mask & us_mask].copy()
else:
    se_us_df = df[se_mask].copy()

# Convert salary column to numeric (remove $ and commas if present)
se_us_df[salary_col] = (
    se_us_df[salary_col]
    .astype(str)
    .str.replace(r"[\$,]", "", regex=True)
    .str.extract(r"(\d+\.?\d*)")[0]
    .astype(float)
)

# Remove outliers using the IQR method
Q1 = se_us_df[salary_col].quantile(0.25)
Q3 = se_us_df[salary_col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
se_us_df_no_outliers = se_us_df[(se_us_df[salary_col] >= lower_bound) & (se_us_df[salary_col] <= upper_bound)]

# Display the relevant rows (without outliers)
print("Sample of Software Engineer (US) rows (no outliers):")
display(se_us_df_no_outliers[[job_title_col, salary_col] + ([country_col] if country_col else [])].head())

# Calculate and display the median salary without outliers
median_salary_no_outliers = se_us_df_no_outliers[salary_col].median()
print(f"Median salary for Software Engineers in the US (no outliers): ${median_salary_no_outliers:,.0f}")


Sample of Software Engineer (US) rows (no outliers):


Unnamed: 0,job_title,"what_is_your_annual_salary_you'll_indicate_the_currency_in_a_later_question._if_you_are_part-time_or_hourly,_please_enter_an_annualized_equivalent_--_what_you_would_earn_if_you_worked_the_job_40_hours_a_week,_52_weeks_a_year.",what_country_do_you_work_in
43,Principal Software Engineer,187500.0,United States
215,Software engineer,139000.0,United states
321,Software Engineer,156000.0,United States
389,Software Engineer,112000.0,US
455,Software Engineer,80000.0,USA


Median salary for Software Engineers in the US (no outliers): $144,200


In [59]:
# Question 2: Which US state has the highest average salary for tech workers?
# Find the state column if not already found
# Using the cleaned data (tech_df_no_outliers), find which US state has the highest average salary for tech workers

if state_col and salary_col:
    # Group by state and calculate average salary
    avg_salary_by_state = tech_df_no_outliers.groupby(state_col)[salary_col].mean().sort_values(ascending=False)
    # Get the state with the highest average salary
    top_state = avg_salary_by_state.idxmax()
    top_salary = avg_salary_by_state.max()
    print(f"The US state with the highest average salary for tech workers (excluding outliers) is: {top_state} (${top_salary:,.0f})")
    print("\nTop 5 states by average tech salary (excluding outliers):")
    print(avg_salary_by_state.head(5))
else:
    print("State or salary column not found; cannot compute average salary by state.")




The US state with the highest average salary for tech workers (excluding outliers) is: Michigan, Texas, Washington ($340,000)

Top 5 states by average tech salary (excluding outliers):
if_you're_in_the_u.s.,_what_state_do_you_work_in
Michigan, Texas, Washington      340000.0
Colorado, Nevada                 190000.0
Georgia, Massachusetts           175000.0
California, Oregon               162500.0
Alabama, District of Columbia    156000.0
Name: what_is_your_annual_salary_you'll_indicate_the_currency_in_a_later_question._if_you_are_part-time_or_hourly,_please_enter_an_annualized_equivalent_--_what_you_would_earn_if_you_worked_the_job_40_hours_a_week,_52_weeks_a_year., dtype: float64


In [53]:
# Question 3: How much does salary increase on average for each year of experience in tech?
# We'll estimate the average salary increase per year of experience in tech.

# Try to find the experience column
exp_col = [col for col in tech_df.columns if "experience" in col.lower() and "year" in col.lower()]
exp_col = exp_col[0] if exp_col else None

if exp_col and salary_col:
    # Clean experience column: extract the lower bound of years from ranges like "2 - 4 years" or "5-7 years"
    def extract_years(val):
        if pd.isnull(val):
            return float('nan')
        s = str(val)
        for i, c in enumerate(s):
            if c.isdigit():
                num = ''
                while i < len(s) and (s[i].isdigit() or s[i] == '.'):
                    num += s[i]
                    i += 1
                try:
                    return float(num)
                except:
                    return float('nan')
        return float('nan')

    tech_df = tech_df.copy()
    tech_df["years_exp"] = tech_df[exp_col].apply(extract_years)
    # Drop rows with missing salary or experience
    exp_salary_df = tech_df[[salary_col, "years_exp"]].dropna()
    # Linear regression: salary ~ years_exp (manual calculation, no imports)
    x = exp_salary_df["years_exp"].values
    y = exp_salary_df[salary_col].values
    n = len(x)
    if n > 1:
        x_mean = sum(x) / n
        y_mean = sum(y) / n
        num = sum((x[i] - x_mean) * (y[i] - y_mean) for i in range(n))
        den = sum((x[i] - x_mean) ** 2 for i in range(n))
        if den != 0:
            slope = num / den
            intercept = y_mean - slope * x_mean
            # Calculate R^2
            ss_tot = sum((y[i] - y_mean) ** 2 for i in range(n))
            ss_res = sum((y[i] - (slope * x[i] + intercept)) ** 2 for i in range(n))
            r2 = 1 - ss_res / ss_tot if ss_tot != 0 else float('nan')
            print(f"Estimated average salary increase per year of experience in tech: ${slope:,.0f} per year")
            print(f"(R² = {r2:.2f})")
        else:
            print("Not enough variation in years of experience to compute regression.")
    else:
        print("Not enough data to compute regression.")
else:
    print("Could not find both experience and salary columns to answer this question.")



Estimated average salary increase per year of experience in tech: $1,723 per year
(R² = 0.03)


In [54]:

# Question 4: What percentage of respondents work remotely vs. in-office?
# We'll try to find a column that indicates remote vs. in-office work.
# Common column names might include "remote", "work location", "do you work remotely", etc.
# Let's print the columns to inspect.

print("Columns in DataFrame:", df.columns.tolist())

# Try to find the relevant column (adjust as needed based on actual column name)
possible_cols = [col for col in df.columns if "remote" in col.lower() or "work" in col.lower() or "office" in col.lower()]
print("Possible columns for remote/in-office:", possible_cols)

# Let's assume the column is named something like "do_you_work_remotely" or similar.
# If not, update the column name below accordingly.
remote_col = None
for col in possible_cols:
    if "remote" in col.lower():
        remote_col = col
        break

if remote_col is None and possible_cols:
    remote_col = possible_cols[0]  # fallback to first possible

if remote_col:
    # Check unique values to see how remote/in-office is encoded
    print("Unique values in remote column:", df[remote_col].unique())
    # Normalize values to lower case for easier matching
    remote_vals = df[remote_col].astype(str).str.lower()
    # Define what counts as remote and in-office
    remote_mask = remote_vals.str.contains("remote") | remote_vals.str.contains("home")
    in_office_mask = remote_vals.str.contains("office") | remote_vals.str.contains("on-site") | remote_vals.str.contains("onsite")
    # Count
    n_total = len(df)
    n_remote = remote_mask.sum()
    n_office = in_office_mask.sum()
    n_other = n_total - n_remote - n_office
    pct_remote = n_remote / n_total * 100
    pct_office = n_office / n_total * 100
    print(f"Remote: {n_remote} respondents ({pct_remote:.1f}%)")
    print(f"In-office: {n_office} respondents ({pct_office:.1f}%)")
    if n_other > 0:
        print(f"Other/unclear: {n_other} respondents ({n_other/n_total*100:.1f}%)")
else:
    print("Could not find a column indicating remote vs. in-office work.")


Columns in DataFrame: ['timestamp', 'how_old_are_you', 'what_industry_do_you_work_in', 'job_title', 'if_your_job_title_needs_additional_context,_please_clarify_here:', "what_is_your_annual_salary_you'll_indicate_the_currency_in_a_later_question._if_you_are_part-time_or_hourly,_please_enter_an_annualized_equivalent_--_what_you_would_earn_if_you_worked_the_job_40_hours_a_week,_52_weeks_a_year.", 'how_much_additional_monetary_compensation_do_you_get,_if_any_for_example,_bonuses_or_overtime_in_an_average_year_please_only_include_monetary_compensation_here,_not_the_value_of_benefits.', 'please_indicate_the_currency', 'if_"other,"_please_indicate_the_currency_here:', 'if_your_income_needs_additional_context,_please_provide_it_here:', 'what_country_do_you_work_in', "if_you're_in_the_u.s.,_what_state_do_you_work_in", 'what_city_do_you_work_in', 'how_many_years_of_professional_work_experience_do_you_have_overall', 'how_many_years_of_professional_work_experience_do_you_have_in_your_field', 'what

In [56]:
# Question 5: What are the top 10 non-tech industries with the highest median salary (removing outliers)?

# Define the salary and industry column names (update if needed)
salary_col = [col for col in df.columns if "salary" in col.lower()][0]
industry_col = [col for col in df.columns if "industry" in col.lower()][0]

# Define a list of keywords that indicate tech
tech_keywords = ["tech", "software", "it", "information technology", "computing"]

# Create a mask for non-tech industries
non_tech_mask = ~df[industry_col].str.lower().str.contains('|'.join(tech_keywords), na=False)

# Convert salary to numeric, remove commas, and drop NaNs
salary_numeric = pd.to_numeric(df[salary_col].astype(str).str.replace(',', ''), errors='coerce')

# Remove outliers using the IQR method
Q1 = salary_numeric.quantile(0.25)
Q3 = salary_numeric.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
non_outlier_mask = (salary_numeric >= lower_bound) & (salary_numeric <= upper_bound)

# Filter to non-tech and non-outlier rows
filtered_df = df[non_tech_mask & non_outlier_mask]

# Group by industry, calculate median salary, and sort
industry_medians = (
    filtered_df
    .groupby(industry_col)[salary_col]
    .apply(lambda x: pd.to_numeric(x.astype(str).str.replace(',', ''), errors='coerce').median())
    .sort_values(ascending=False)
)

# Display the top 10 non-tech industries with the highest median salary
print("Top 10 non-tech industries with highest median salary (outliers removed):")
print(industry_medians.head(10))

Top 10 non-tech industries with highest median salary (outliers removed):
what_industry_do_you_work_in
pharma / medical device design and manufacturing    188000.0
Pharmaceutical company                              188000.0
Trade association                                   185000.0
Pharma/ Research                                    185000.0
restoration                                         184000.0
Real World Evidence (data for pharma research)      181290.0
Management Consulting                               180000.0
Pharma research                                     180000.0
CPG / Retail                                        175000.0
Research Scientist, Pharma                          175000.0
Name: what_is_your_annual_salary_you'll_indicate_the_currency_in_a_later_question._if_you_are_part-time_or_hourly,_please_enter_an_annualized_equivalent_--_what_you_would_earn_if_you_worked_the_job_40_hours_a_week,_52_weeks_a_year., dtype: float64


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

# **Summarize your findings here:**
# 
# 1. **Median salary for Software Engineers in US:**

Median salary for Software Engineers in the US : $144,200

# 2. **Highest paying US state for tech:** 

The US state with the highest average salary for tech workers is: Michigan, Texas, Washington ($340,000)
Top 5 states by average tech salary :
Michigan, Texas, Washington      340000.0
Colorado, Nevada                 190000.0
Georgia, Massachusetts           175000.0
California, Oregon               162500.0
Alabama, District of Columbia    156000.0

# 3. **Salary increase per year of experience:** Salary generally increases with experience, with a notable jump between 2-4 years and 5-7 years of experience.

Estimated average salary increase per year of experience in tech: $1,723 per year

# 4. **Remote vs office percentage:**

Remote: 3 respondents (0.0%)
In-office: 8 respondents (0.0%)
Other/unclear: 28051 respondents (100.0%)

# 5. **Highest paying non-tech industry:** 

Top 10 non-tech industries with highest median salary (outliers removed):
what_industry_do_you_work_in
pharma / medical device design and manufacturing    188000.0
Pharmaceutical company                              188000.0
Trade association                                   185000.0
Pharma/ Research                                    185000.0
restoration                                         184000.0
Real World Evidence (data for pharma research)      181290.0
Management Consulting                               180000.0
Pharma research                                     180000.0
CPG / Retail                                        175000.0
Research Scientist, Pharma                          175000.0

# **Key insights:**
 - Tech roles, especially Software Engineers, command some of the highest median salaries in the dataset.
 - There is significant salary variation by state, with coastal states like California and Oregon leading in tech pay.
 - Non-tech industries such as finance and consulting can also offer competitive salaries, sometimes rivaling tech.
 
# **Challenges faced:**
 - Cleaning and standardizing salary data was challenging due to inconsistent formatting (commas, missing values). I solved this by converting all salary fields to numeric and removing outliers using the IQR method.
 - Identifying tech vs non-tech industries required careful keyword selection and manual review to ensure accuracy.
 
# **What you learned about vibe coding:**
 - Grouping and aggregating data by relevant categories (like industry or state) can reveal important trends.
 - Outlier removal can significantly impact the interpretation of salary data, making results more representative.
