# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

*(Paste your Cursor todo list here)*

- [ ] Example todo item
- [ ] Another example
- [ ] ...

# My Data Cleaning Plan - Step 0

Based on my analysis of the Ask A Manager Salary Survey dataset, here's my comprehensive data cleaning plan:

## Dataset Overview
- **Size**: ~28,000 responses from 2021 salary survey
- **Format**: TSV file with 18 columns
- **Key Challenges**: Multiple currencies, inconsistent formatting, messy job titles, various country/state formats

## Data Quality Issues Identified
1. **Salary Data**: Multiple currencies (USD, GBP, CAD), inconsistent formatting, potential outliers
2. **Location Data**: Inconsistent US state formats ("US", "USA", "United States"), missing states
3. **Job Titles**: Highly variable, need to identify "Software Engineers" and "tech workers"
4. **Experience**: Range format ("5-7 years", "8-10 years") needs conversion to numeric
5. **Industry**: Need to categorize tech vs non-tech industries
6. **Education/Gender**: For bonus questions, need clean categories

## Step-by-Step Cleaning Plan

### Phase 1: Data Exploration & Setup
1. Load dataset and examine structure, data types, missing values
2. Identify all unique currencies and conversion rates needed
3. Map out all unique job titles, industries, and location formats

### Phase 2: Core Data Cleaning
1. **Salary Standardization**:
   - Convert all salaries to USD using 2021 exchange rates
   - Handle missing/zero salaries appropriately
   - Remove extreme outliers (likely data entry errors)
   - Add bonus compensation to total salary

2. **Location Standardization**:
   - Standardize country names to "United States"
   - Clean and standardize US state names
   - Filter to US-only data for core questions

3. **Job Title Categorization**:
   - Create "Software Engineer" category (exact matches + variations)
   - Create "Tech Worker" category (all computing/tech industry roles)
   - Handle job title variations and context

4. **Experience Conversion**:
   - Convert experience ranges to midpoint numeric values
   - Handle edge cases and missing data

### Phase 3: Analysis Preparation
1. **Industry Classification**:
   - Identify tech vs non-tech industries
   - Standardize industry names

2. **Education & Gender Cleaning**:
   - Standardize education levels for bonus questions
   - Clean gender categories

### Phase 4: Business Question Analysis
1. **Question 1**: Median salary for Software Engineers in US
2. **Question 2**: US state with highest average tech worker salary  
3. **Question 3**: Salary increase per year of experience in tech
4. **Question 4**: Highest median salary non-tech industry
5. **Question 5**: Gender salary gap in tech (bonus)
6. **Question 6**: Education level salary comparison (bonus)

## Key Business Decisions
- **Currency Conversion**: Use 2021 average exchange rates (GBP≈1.37, CAD≈0.80)
- **Outlier Handling**: Remove salaries <$20k or >$500k as likely errors
- **Tech Definition**: "Computing or Tech" industry + specific tech job titles
- **Experience Mapping**: Use range midpoints (e.g., "5-7 years" = 6 years)

## Success Criteria
- Final answers within 5% of expected values
- Clean, reproducible analysis
- Clear documentation of cleaning decisions

This plan addresses the real-world messiness of survey data while ensuring we can answer the specific business questions accurately.

## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Step 2: Data Cleaning


## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?


In [None]:
# Question 2: Which US state has the highest average salary for tech workers?


In [None]:
# Question 3: How much does salary increase on average for each year of experience in tech?

In [None]:

# Question 4: What percentage of respondents work remotely vs. in-office?


In [None]:
# Question 5: Which industry (besides tech) has the highest median salary?

In [None]:
# Bonus Questions:
# Question 6: What's the salary gap between men and women in similar roles?
# Question 7: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 8: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $X
2. **Highest paying US state for tech:** State Name
3. **Salary increase per year of experience:** $X per year
4. **Remote vs office percentage:** X% remote, Y% office
5. **Highest paying non-tech industry:** Industry Name

**Key insights:**
- Insight 1
- Insight 2
- Insight 3

**Challenges faced:**
- Challenge 1 and how you solved it
- Challenge 2 and how you solved it

**What you learned about vibe coding:**
- Learning 1
- Learning 2
- Learning 3
