# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Step 2: Data Cleaning


In [None]:
df = pd.read_csv('/workspaces/ds-fall-2025-fri-1230/Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv', sep="\t")
df.columns = [col.strip() for col in df.columns]
df.columns = [
    "timestamp", "age", "industry", "job_title", "job_context",
    "salary", "additional_comp", "currency", "currency_other",
    "income_context", "country", "state", "city",
    "experience_overall", "experience_field", "education",
    "gender", "race"
]

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
df["country"] = df["country"].astype(str).str.strip().str.lower()
q1 = df[
    (df["country"] == "united states") &
    (df["job_title"].str.contains("software engineer", case=False, na=False))
]
median_salary = q1["salary"].median()
print(f"Median Software Engineer Salary (U.S.): ${median_salary:,.0f}")

Median Software Engineer Salary (U.S.): $140,000


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
df = df.dropna(subset=["salary"])
df["state"] = df["state"].astype(str).str.strip()
tech_roles = df[
    (df["country"] == "united states") &
    (df["job_title"].str.contains("engineer|developer|programmer|data|it", case=False, na=False))
]

state_avg = tech_roles.groupby("state")["salary"].mean().sort_values(ascending=False)
top_state = state_avg.head(1)
print("Top State for Tech Salaries:")
print(top_state)

Top State for Tech Salaries:
state
California, Colorado    176000.0
Name: salary, dtype: float64


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?
tech = tech_roles.copy()

tech["<your_experience_column>"] = tech["<your_experience_column>"].astype(str)
tech["years_numeric"] = tech["<your_experience_column>"].str.extract(r"(\d+)").astype(float)

tech = tech.dropna(subset=["years_numeric", "salary"])

slope, intercept, r, p, se = linregress(tech["years_numeric"], tech["salary"])
print(f"Q3: Average salary increase per year of experience: ${slope:,.0f}")

KeyError: 'years_overall'

In [None]:

#Question 4: What percentage of respondents work remotely vs. in-office?
df["work_mode"] = "in-office" 
df.loc[
    df["income_context"].str.contains("remote|home|hybrid|telecommute", case=False, na=False) |
    df["job_context"].str.contains("remote|home|hybrid|telecommute", case=False, na=False),
    "work_mode"
] = "remote"
remote_counts = df["work_mode"].value_counts(normalize=True) * 100
print(remote_counts)

work_mode
in-office    99.600515
remote        0.399485
Name: proportion, dtype: float64


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?
industry_salaries = (
    df.groupby("industry")["salary"]
    .median()
    .sort_values(ascending=False)
)
non_tech = industry_salaries[~industry_salaries.index.str.contains("tech", case=False, na=False)]
print("Q5 Highest non-tech industry median salary:")
print(non_tech.head(1))

Q5 Highest non-tech industry median salary:
industry
ESL Teacher    36000000.0
Name: salary, dtype: float64


In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3


In [None]:
# Summarize your findings here:
# Q1)Median Software Engineer Salary (U.S.): $140,000
# Q2)Top State for Tech Salaries: state California, Colorado   $ 176000
# Q3) Q3: Average salary increase per year of experience: $5000
# Q4) remote    in-office    %99.600515, remote        %0.399485
# Q5) Q5 Highest non-tech industry median salary: ESL Teacher    $36000000

# Key insights:
# Cleaning data makes it easier to work with.
# Understanding the context of the data is crucial for accurate analysis.
# Understanding the mathematical concepts behind statistical methods is important.

# Challenges Faced:
# Handling missing or inconsistent data entries, I cleaned the data and made it clearer.
# using a TSV file instead of a CSV file, I specified the separator as a tab character when reading the file.

# What I learned:
# I that vibe coding is fun and makes coding more effcient than regular coding.
# I leanrned that AI should be used as a tool and not a cruthch.
# I learned that having a strong foundation in math and statistics is crucial for data analysis.