# Vibe Coding: Real-World Data Cleaning Challenge

## The Mission

You're a Data Analyst at **TechSalary Insights**. Your manager needs answers to critical business questions, but the data is messy. Your job is to clean it and provide accurate insights.

**The catch:** You must figure out how to clean the data yourself. No step by step hints just you, your AI assistant, and real world messy data.

---

## The Dataset: Ask A Manager Salary Survey 2021

**Location:** `../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`

This is **real survey data** from Ask A Manager's 2021 salary survey with over 28,000 responses from working professionals. The data comes from this survey: https://www.askamanager.org/2021/04/how-much-money-do-you-make-4.html

**Why this dataset is perfect for vibe coding:**
- Real human responses (inconsistent formatting)
- Multiple currencies and formats  
- Messy job titles and location data
- Missing and invalid entries
- Requires business judgment calls

---

## Your Business Questions

Answer these **exact questions** with clean data. There's only one correct answer for each:

### Core Questions (Required):
1. **What is the median salary for Software Engineers in the United States?** 
2. **Which US state has the highest average salary for tech workers?**
3. **How much does salary increase on average for each year of experience in tech?**
4. **Which industry (besides tech) has the highest median salary?**

### Bonus Questions (If time permits):
5. **What's the salary gap between men and women in tech roles?**
6. **Do people with Master's degrees earn significantly more than those with Bachelor's degrees?**

**Success Criteria:** Your final answers will be compared against the "official" results. Data cleaning approaches can vary, but final numbers should be within 5% of expected values.


---
# Your Work Starts Here

## Step 0: Create Your Plan
**Before writing any code, use Cursor to create your todo plan. Then paste it here:**

## My Data Cleaning Plan

**Dataset Overview**
- **File**: `Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv`
- **Records**: 28,062 responses
- **Columns**: 18 fields

**General Issues**
1. **Rename column names to short simple column names.**
2. **Get rid of duplicate rows.**
3. **Determine how to deal with missing values.**

**Top Issues by Feature**
1. **TIMESTAMP** (Column 1) 
    - **Convert to datetime**

2. **SALARY DATA** (Columns 6 & 7) 
    - **Fix concatenated values**: `6000070000` (should be `60,000` + `70,000`)
    - **Uniformize mixed formats**: Some with commas (55,000), some without (55000)
    - **Convert to proper data type**: Stored as text, needs numeric conversion

3. **CURRENCY DATA** (Columns 8 & 9) 
    - **Make currency labels consistent and uniform. Consider answers for other currency here.**
    - **Create new feature that represents the currency in USD to allow for salary comparison. Determine what to do about missing currency information.**

4. **LOCATION DATA** (Column 11, 12, 13)
    - **Ensure naming is consistent**: "United States", "US", "USA", "usa"

5. **DEMOGRAPHIC DATA + JOB TITLE** (Columns 2, 4, 16, 17)
    - **Ensure everything has a consistent format**


## Step 1: Data Loading and Exploration

Start here! Load the dataset and get familiar with what you're working with.


In [None]:
# Import necessary libraries
import pandas as pd
from scipy import stats


In [229]:
df = pd.read_csv(
    "../../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv",
    sep="\t"
)

In [230]:
df.shape

(28062, 18)

In [231]:
df.head()

Unnamed: 0,Timestamp,How old are you?,What industry do you work in?,Job title,"If your job title needs additional context, please clarify here:","What is your annual salary? (You'll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.)","How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.",Please indicate the currency,"If ""Other,"" please indicate the currency here:","If your income needs additional context, please provide it here:",What country do you work in?,"If you're in the U.S., what state do you work in?",What city do you work in?,How many years of professional work experience do you have overall?,How many years of professional work experience do you have in your field?,What is your highest level of education completed?,What is your gender?,What is your race? (Choose all that apply.)
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


In [232]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28062 entries, 0 to 28061
Data columns (total 18 columns):
 #   Column                                                                                                                                                                                                                                Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                --------------  -----  
 0   Timestamp                                                                                                                                                                                                                             28062 non-null  object 
 1   How old are you?                                                                                             

## Step 2: Data Cleaning


In [None]:
# DATA CLEANING SCRIPT

import pandas as pd

print("=== STARTING COMPREHENSIVE DATA CLEANING ===")

# STEP 1: Load raw data
print("\n1. Loading raw data...")
df = pd.read_csv(
    "../../Week-02-Pandas-Part-2-and-DS-Overview/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.tsv",
    sep="\t"
)
print(f"   Raw data loaded: {df.shape[0]} rows, {df.shape[1]} columns")

# STEP 2: Rename columns to short, simple names
print("\n2. Renaming columns...")
df.columns = [
    'timestamp', 'age', 'industry', 'job_title', 'job_clarification',
    'annual_salary', 'additional_compensation', 'currency', 'other_currency',
    'income_context', 'country', 'state', 'city', 'total_experience',
    'field_experience', 'education', 'gender', 'race'
]
print("   Columns renamed successfully")

# STEP 3: Handle duplicates
print("\n3. Handling duplicates...")
original_shape = df.shape
df = df.drop_duplicates()
print(f"   Removed {original_shape[0] - df.shape[0]} exact duplicates")

# STEP 4: Convert timestamp to datetime
print("\n4. Converting timestamp to datetime...")
df['timestamp'] = pd.to_datetime(df['timestamp'])
print(f"   Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")

# STEP 5: Clean Salary Data
print("\n5. Cleaning salary data...")

def fix_concatenated_salary(salary_str):
    """Fix concatenated salary values like 6000070000"""
    if pd.isna(salary_str):
        return salary_str
    salary_str = str(salary_str)
    if len(salary_str) > 8 and salary_str.isdigit():
        # Split concatenated values (like 6000070000 -> 60000)
        mid = len(salary_str) // 2
        return salary_str[:mid]
    return salary_str

# Remove commas and fix concatenated values
df['annual_salary'] = df['annual_salary'].str.replace(',', '')
df['annual_salary'] = df['annual_salary'].apply(fix_concatenated_salary)
df['additional_compensation'] = df['additional_compensation'].apply(fix_concatenated_salary)

# Convert to numeric
df['annual_salary'] = pd.to_numeric(df['annual_salary'], errors='coerce')
df['additional_compensation'] = pd.to_numeric(df['additional_compensation'], errors='coerce')

print(f"   Salary range: {df['annual_salary'].min():,.0f} to {df['annual_salary'].max():,.0f}")
print(f"   Missing salaries: {df['annual_salary'].isnull().sum()}")

# STEP 6: Clean Currency Data
print("\n6. Cleaning currency data...")

# Comprehensive currency mapping
currency_name_to_code = {
    'usd': 'USD', 'cad': 'CAD', 'gbp': 'GBP', 'eur': 'EUR', 'aud': 'AUD',
    'chf': 'CHF', 'nzd': 'NZD', 'sek': 'SEK', 'sgd': 'SGD', 'inr': 'INR',
    'zar': 'ZAR', 'jpy': 'JPY', 'mxn': 'MXN', 'brl': 'BRL', 'hkd': 'HKD',
    'dkk': 'DKK', 'nok': 'NOK', 'pln': 'PLN', 'ils': 'ILS', 'cny': 'CNY',
    'krw': 'KRW', 'rub': 'RUB', 'try': 'TRY', 'aed': 'AED', 'sar': 'SAR',
    'usd (us dollar)': 'USD', 'us dollar': 'USD', 'us dollars': 'USD',
    'united states dollar': 'USD', 'united states dollars': 'USD',
    'canadian dollar': 'CAD', 'canadian dollars': 'CAD',
    'pound sterling': 'GBP', 'british pound': 'GBP', 'british pounds': 'GBP',
    'euro': 'EUR', 'euros': 'EUR', 'australian dollar': 'AUD',
    'swiss franc': 'CHF', 'new zealand dollar': 'NZD',
    'swedish krona': 'SEK', 'singapore dollar': 'SGD',
    'indian rupee': 'INR', 'south african rand': 'ZAR',
    'japanese yen': 'JPY', 'mexican peso': 'MXN', 'brazilian real': 'BRL',
    'hong kong dollar': 'HKD', 'danish krone': 'DKK', 'norwegian krone': 'NOK',
    'polish zloty': 'PLN', 'israeli new shekel': 'ILS', 'chinese yuan': 'CNY',
    'korean won': 'KRW', 'russian ruble': 'RUB', 'turkish lira': 'TRY',
    'uae dirham': 'AED', 'saudi riyal': 'SAR', 'peso argentino': 'ARS',
    'argentinian peso (ars)': 'ARS', 'argentine peso': 'ARS',
    'br$': 'BRL', 'brl (r$)': 'BRL', 'indian rupees': 'INR',
    'inr (indian rupee)': 'INR', 'rupees': 'INR', 'mexican pesos': 'MXN',
    'php (philippine peso)': 'PHP', 'philippine peso': 'PHP',
    'philippine peso (php)': 'PHP', 'philippine pesos': 'PHP',
    'american dollars': 'USD', 'australian dollars': 'AUD',
    'aud australian': 'AUD', 'pln (polish zloty)': 'PLN',
    'polish złoty': 'PLN', 'czech crowns': 'CZK', 'croatian kuna': 'HRK',
    'norwegian kroner (nok)': 'NOK', 'danish kroner': 'DKK',
    'ils/nis': 'ILS', 'nis (new israeli shekel)': 'ILS',
    'israeli shekels': 'ILS', 'ils (shekel)': 'ILS',
    'rmb (chinese yuan)': 'CNY', 'china rmb': 'CNY',
    'taiwanese dollars': 'TWD', 'krw (korean won)': 'KRW',
    'thai baht': 'THB', 'equity': 'EQUITY', 'rm': 'MYR',
    'singapore dollara': 'SGD'
}

# USD conversion rates (approximate 2021 rates)
conversion_rates = {
    'USD': 1.0, 'CAD': 0.79, 'GBP': 1.37, 'EUR': 1.18, 'AUD': 0.74,
    'CHF': 1.09, 'NZD': 0.70, 'SEK': 0.12, 'SGD': 0.75, 'INR': 0.013,
    'ZAR': 0.069, 'JPY': 0.0091, 'MXN': 0.050, 'BRL': 0.19, 'HKD': 0.13,
    'DKK': 0.16, 'NOK': 0.12, 'PLN': 0.26, 'ILS': 0.31, 'CNY': 0.15,
    'KRW': 0.00089, 'RUB': 0.013, 'TRY': 0.12, 'AED': 0.27, 'SAR': 0.27,
    'ARS': 0.010, 'PHP': 0.018, 'CZK': 0.045, 'HRK': 0.16, 'TWD': 0.036,
    'THB': 0.030, 'MYR': 0.24, 'AUD/NZD': 0.72, 'Unknown': 1.0
}

# Clean currency data
df["currency_clean"] = df["currency"]
df.loc[df['currency_clean'] == 'Other', 'currency_clean'] = df.loc[df['currency_clean'] == 'Other', 'other_currency']

def standardize_currency(val):
    if pd.isna(val):
        return 'Unknown'
    val_str = str(val).strip().lower()
    if val_str in currency_name_to_code:
        return currency_name_to_code[val_str]
    if len(val_str) == 3 and val_str.isalpha():
        return val_str.upper()
    return 'Unknown'

df['currency_clean'] = df['currency_clean'].apply(standardize_currency)

# Create USD equivalent salary
df['salary_usd'] = df['annual_salary'] * df['currency_clean'].map(conversion_rates)
df['additional_comp_usd'] = df['additional_compensation'] * df['currency_clean'].map(conversion_rates)

print(f"   Currencies standardized: {len(df['currency_clean'].value_counts())} unique currencies")
print(f"   USD conversion created")

# STEP 7: Clean Location Data
print("\n7. Cleaning location data...")

# Comprehensive country mapping
country_mapping = {
    'us': 'United States', 'usa': 'United States', 'u.s.': 'United States',
    'u.s.a.': 'United States', 'united states of america': 'United States',
    'america': 'United States', 'united states': 'United States',
    'uk': 'United Kingdom', 'u.k.': 'United Kingdom', 'england': 'United Kingdom',
    'scotland': 'United Kingdom', 'wales': 'United Kingdom',
    'northern ireland': 'United Kingdom', 'united kingdom': 'United Kingdom',
    'republic of ireland': 'Ireland', 'eire': 'Ireland',
    'south korea': 'Korea, South', 'korea': 'Korea, South',
    'korea, republic of': 'Korea, South', 'viet nam': 'Vietnam',
    'czechia': 'Czech Republic', 'russian federation': 'Russia',
    'peoples republic of china': 'China', 'prc': 'China',
    'mainland china': 'China', 'hong kong sar': 'Hong Kong',
    'uae': 'United Arab Emirates', 'u.a.e.': 'United Arab Emirates',
    'emirates': 'United Arab Emirates', 'brasil': 'Brazil',
    'the netherlands': 'Netherlands', 'nl': 'Netherlands',
    'deutschland': 'Germany', 'sverige': 'Sweden', 'suomi': 'Finland',
    'espana': 'Spain', 'españa': 'Spain', 'österreich': 'Austria',
    'schweiz': 'Switzerland', 'suisse': 'Switzerland', 'ch': 'Switzerland',
    'méxico': 'Mexico', 'mx': 'Mexico', 'ca': 'Canada', 'canda': 'Canada',
    'au': 'Australia', 'nz': 'New Zealand', 'aotearoa': 'New Zealand'
}

def clean_country(val):
    if pd.isna(val) or not str(val).strip():
        return 'Unknown'
    val_norm = str(val).strip().lower()
    return country_mapping.get(val_norm, val.strip().title())

df['country'] = df['country'].apply(clean_country)
df['state'] = df['state'].str.strip().str.title()
df['city'] = df['city'].str.strip().str.title()

print(f"   Countries standardized: {df['country'].nunique()} unique countries")
print(f"   US states represented: {df[df['country'] == 'United States']['state'].nunique()}")

# STEP 8: Clean Demographic Data and Job Titles
print("\n8. Cleaning demographic and job data...")

df['gender'] = df['gender'].str.strip().str.title()
df['education'] = df['education'].fillna('Not specified')
df['job_title'] = df['job_title'].str.strip().str.title()
df['industry'] = df['industry'].str.strip()

print(f"   Gender categories: {df['gender'].nunique()}")
print(f"   Education levels: {df['education'].nunique()}")
print(f"   Industries: {df['industry'].nunique()}")

# STEP 9: Final Data Quality Summary
print("\n=== FINAL DATA CLEANING SUMMARY ===")
print(f"Final dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

print("\n=== MISSING VALUES ===")
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100
missing_summary = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing %': missing_percent
}).sort_values('Missing Count', ascending=False)
print(missing_summary[missing_summary['Missing Count'] > 0])

print("\n=== SALARY DATA QUALITY ===")
print(f"Valid salaries: {df['salary_usd'].notna().sum()}")
print(f"Salary range: ${df['salary_usd'].min():,.0f} to ${df['salary_usd'].max():,.0f}")
print(f"Median salary (USD): ${df['salary_usd'].median():,.0f}")

print("\n=== GEOGRAPHIC DISTRIBUTION ===")
print(f"Countries represented: {df['country'].nunique()}")
us_states = df.loc[df['country'] == 'United States', 'state']
single_states = us_states[(~us_states.isna()) & (~us_states.str.contains(',', na=False))]
num_unique_single_states = single_states.str.strip().nunique()
print(f"US states represented: {num_unique_single_states}")

print("\n=== CLEANING COMPLETE ===")
print("Dataset is now ready for analysis!")
print(f"All steps completed successfully")

=== STARTING COMPREHENSIVE DATA CLEANING ===

1. Loading raw data...
   Raw data loaded: 28062 rows, 18 columns

2. Renaming columns...
   Columns renamed successfully

3. Handling duplicates...
   Removed 0 exact duplicates

4. Converting timestamp to datetime...
   Date range: 2021-04-27 11:02:10 to 2024-08-20 01:06:13

5. Cleaning salary data...
   Salary range: 0 to 43,800,000
   Missing salaries: 0

6. Cleaning currency data...
   Currencies standardized: 39 unique currencies
   USD conversion created

7. Cleaning location data...
   Countries standardized: 238 unique countries
   US states represented: 129

8. Cleaning demographic and job data...
   Gender categories: 5
   Education levels: 7
   Industries: 1131

=== FINAL DATA CLEANING SUMMARY ===
Final dataset shape: (28062, 21)
Columns: ['timestamp', 'age', 'industry', 'job_title', 'job_clarification', 'annual_salary', 'additional_compensation', 'currency', 'other_currency', 'income_context', 'country', 'state', 'city', 'total

## Step 3: Business Questions Analysis

Now answer those important business questions!


In [None]:
# Question 1: What is the median salary for Software Engineers in the United States?
# Consider salaries for Software Engineer and similar job titles
print(df.loc[(df["country"] == "United States") & (df["job_title"].str.lower().str.contains("software engineer"))]["annual_salary"].median())

145000.0


In [265]:
print(df.loc[(df["country"] == "United States") & (df["job_title"].str.lower().str.contains("software engineer"))]["annual_salary"].mode())
print(df.loc[(df["country"] == "United States") & (df["job_title"].str.lower().str.contains("software engineer"))]["annual_salary"].mean())

0    150000
Name: annual_salary, dtype: int64
154428.52791878174


In [244]:
# Question 2: Which US state has the highest average salary for tech workers?
average_tech_salary_states = df.loc[df["industry"] == "Computing or Tech"].groupby("state")["annual_salary"].mean().reset_index()
average_tech_salary_states = average_tech_salary_states.loc[~average_tech_salary_states["state"].str.contains(",")].sort_values(by="annual_salary", ascending=False, ignore_index=True)
# Use pandas Styler to format only the 'annual_salary' column for display
styled_result = average_tech_salary_states.style.format({"annual_salary": "${:,.2f}"})
display(styled_result)

Unnamed: 0,state,annual_salary
0,Florida,"$157,457.23"
1,California,"$154,857.22"
2,Washington,"$151,423.67"
3,New York,"$148,054.64"
4,Nevada,"$141,310.00"
5,Massachusetts,"$135,213.54"
6,District Of Columbia,"$132,318.04"
7,New Mexico,"$132,200.00"
8,New Jersey,"$128,802.48"
9,Connecticut,"$128,375.00"


In [248]:
# Question 3: How much does salary increase on average for each year of experience in tech?

# Filter for tech workers
tech_workers = df[df['industry'] == 'Computing or Tech'].copy()

# Experience mapping
experience_mapping = {
    '1 year or less': 0.5,
    '2 - 4 years': 3,
    '5-7 years': 6,
    '8 - 10 years': 9,
    '11 - 20 years': 15.5,
    '21 - 30 years': 25.5,
    '31 - 40 years': 35.5,
    '41 years or more': 45
}

tech_workers['total_experience_numeric'] = tech_workers['total_experience'].map(experience_mapping)

# Clean data
tech_workers_clean = tech_workers.dropna(subset=['total_experience_numeric', 'salary_usd'])

# Remove outliers
tech_analysis = tech_workers_clean[
    (tech_workers_clean['salary_usd'] >= 20000) & 
    (tech_workers_clean['salary_usd'] <= 500000)
].copy()

print(f'Tech workers for analysis: {len(tech_analysis)}')

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(
    tech_analysis['total_experience_numeric'], 
    tech_analysis['salary_usd']
)

print(f'\n=== LINEAR REGRESSION RESULTS ===')
print(f'Salary increase per year of experience: ${slope:,.0f}')
print(f'Starting salary (0 years): ${intercept:,.0f}')
print(f'R-squared: {r_value**2:.3f}')
print(f'P-value: {p_value:.2e}')
print(f'Standard error: ${std_err:,.0f}')

Tech workers for analysis: 4627

=== LINEAR REGRESSION RESULTS ===
Salary increase per year of experience: $1,532
Starting salary (0 years): $99,935
R-squared: 0.053
P-value: 3.35e-57
Standard error: $95


In [252]:
# Question 4: Which industry (besides tech) has the highest median salary?
threshold = 100
industry_counts = df["industry"].value_counts()
industries_above_threshold = industry_counts[industry_counts >= threshold].index.to_list()
print("Industries with at least 100 survey entries:")
print(industries_above_threshold)

df.loc[df["industry"].isin(industries_above_threshold)].groupby("industry")["salary_usd"].median().nlargest(10)
# Law 

Industries with at least 100 survey entries:
['Computing or Tech', 'Education (Higher Education)', 'Nonprofits', 'Health care', 'Government and Public Administration', 'Accounting, Banking & Finance', 'Engineering or Manufacturing', 'Marketing, Advertising & PR', 'Law', 'Business or Consulting', 'Education (Primary/Secondary)', 'Media & Digital', 'Insurance', 'Retail', 'Recruitment or HR', 'Property or Construction', 'Art & Design', 'Utilities & Telecommunications', 'Transport or Logistics', 'Sales', 'Social Work', 'Hospitality & Events', 'Entertainment', 'Agriculture or Forestry']


industry
Computing or Tech                       112050.0
Law                                      91000.0
Business or Consulting                   88000.0
Engineering or Manufacturing             86900.0
Utilities & Telecommunications           79900.0
Accounting, Banking & Finance            78000.0
Health care                              76630.0
Government and Public Administration     75000.0
Insurance                                75000.0
Marketing, Advertising & PR              72500.0
Name: salary_usd, dtype: float64

In [247]:
# Bonus Questions:
# Question 5: What's the salary gap between men and women in similar roles?
# Question 6: Do people with Master's degrees earn significantly more than those with Bachelor's degrees?
# Question 7: Which company size (startup, medium, large) pays the most on average?

## Final Summary

**Summarize your findings here:**

1. **Median salary for Software Engineers in US:** $145,000.00
2. **Highest paying US state for tech:** Florida
3. **Salary increase per year of experience:** $1,532 per year
4. **Highest paying non-tech industry:** Law

**Key insights:**
- Insight 1: The salary of Software Engineers in the US is likely slightly right skewed since the mean is greater than the median, indicating that a small number of larger values are pulling the mean up. 
- Insight 2: According to our linear regression model, the starting salary of tech workers is around $99,935 (no experience) increasing by $1,532 per year of experience.
- Insight 3: The top industries by highest median salary are: Computing or Tech ($112,050.00), Law ($91,000.00), Business or Consulting ($88,000.00), Engineering or Manufacturing ($86,900.0), Utilities & Telecommunications ($79,900.00), among industries with representation of at least 100 examples in the dataset.

**Challenges faced:**
- Figuring out how to deal with different currencies. Solved by using GenAI to create a pretty comprehensive mapping btw various ill-formed values and their standardizations.
- Answering Business Question 3, since years of experience is a categorical value. Mapped each of the categorical values to the median value of each range and ran a linear regression on that proxy for years worked + salary.

**What you learned about vibe coding:**
- Learning 1: It can still take a lot of time to get stuff done. 
- Learning 2: Prompt specificity is important in getting the intended results.
- Learning 3: LLMs can help come up with different approaches for solving a problem.
