<a href="https://colab.research.google.com/github/rhodes-byu/cs180-winter25/blob/main/notebooks/06c-working-adult-eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b>After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work</b></p>

# EDA Investigation: What Factors Predict High Income?

Your goal is to uncover patterns and relationships that help answer the central question: **What factors are most strongly associated with earning more than $50K per year?**

## Learning Goals
By completing this investigation, you will:
1. Practice systematic data exploration techniques
2. Learn to ask meaningful questions about data
3. Create visualizations that reveal insights
4. Discover patterns in real-world census data
5. Develop hypotheses based on evidence

## Investigation Roadmap
1. [Initial Data Discovery](#discovery)
2. [Data Quality Detective Work](#quality)
3. [Individual Variable Exploration](#individual)
4. [Relationship Investigation](#relationships)
5. [Advanced Pattern Analysis](#patterns)
6. [Your Independent Investigation](#independent)
7. [Evidence Summary and Reflection](#summary)

## The Big Questions to Answer
- Which demographic factors matter most for income?
- How does education influence earning potential?
- Are there gender disparities in income?
- What role do work patterns play?
- Can you identify surprising patterns in the data?

In [2]:
# SETUP QUESTION: What tools do we need for data exploration?
# Import the essential libraries for data analysis and visualization
# Hint: You'll need pandas, numpy, matplotlib.pyplot, and seaborn
# Bonus: Add any warnings filters and styling preferences

# Your code here:

# 1. Initial Data Discovery {#discovery}

## 🔍 **Investigation Questions:**

### Q1.1: What does this dataset contain?
**Your task:** Research and summarize the Adult Income dataset
- **Source:** 1994 Census database from UCI Machine Learning Repository
- **Purpose:** What were researchers trying to predict?
- **Variables:** How many variables are there and what do they represent?

### Q1.2: How do we load and get our first look at the data?
**Your task:** Load the dataset and examine its structure
- **Data URL:** `https://raw.githubusercontent.com/rhodes-byu/cs-stat-180/refs/heads/main/data/adult.csv`
- **Questions to answer:** What's the size? What are the column names? What data types do we have?

In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/porterjenkins/CS180/main/data/adult.csv'

  from pandas.core import (


In [None]:
# Q1.1: What does this dataset contain?
# Your code here:

In [4]:
# Q1.2: Load the dataset and examine its basic properties
# Your code here:




In [5]:
# Q1.3: What do the basic statistics tell us about numerical variables?
# Your code here:





# Additional questions to answer:
# - What's the age range of people in the dataset?
# - What's the typical number of hours worked per week?
# - Are there any obvious outliers in the numerical data?

# 2. Data Quality Detective Work {#quality}

### Q2.1: Is our data clean and reliable?
Before we can trust our analysis, we need to check for data quality issues:

**Your investigation tasks:**
- Are there missing values? How many and where?
- Are there duplicate records?
- Do the data types make sense?
- Are there any obvious errors or inconsistencies?
- Are there potential outliers that need investigation?

In [None]:
# Q2.1: 
# Write code that checks data quality comprehensively

# Suggested checks to include:
# 1. Dataset shape and memory usage
# 2. Missing values count and percentage
# 3. Duplicate rows
# 4. Data types summary
# 5. Potential issues (constant columns, all unique values, etc.)

# 3. Individual Variable Exploration {#individual}

### Q3.1: How can we categorize our variables?
**Your task:** Identify and categorize the variables in the dataset
- Which variables are categorical (object type)?
- Which variables are numerical (int/float type)?
- What does each variable represent?

### Q3.2: What patterns do we see in categorical variables?
**Your investigation:** For each categorical variable, explore:
- How many unique categories are there?
- What's the distribution of each category?
- Which categories are most/least common?
- Are there any surprising patterns?

In [7]:
# Q3.1 & Q3.2: Explore categorical and numerical variables
# Step 1: Separate categorical and numerical variables
# Hint: Use select_dtypes(include=['object']) and select_dtypes(include=['int64', 'float64'])

# Your code here:




# Step 2: Analyze categorical variables
# For each categorical variable, find:
# - Number of unique values
# - Most common value
# - Value counts (if reasonable number of categories)

# Your code here:




# Questions to answer:
# - Which categorical variable has the most categories?
# - What's the most common education level?
# - How is the income variable distributed (our target variable)?
# - Are there any categories with very few observations?

### Q3.3: What do categorical variables look like visually?
**Your visualization challenge:** Create bar plots for the key categorical variables
- Select 4-6 most important categorical variables
- Create subplots showing their distributions
- Add proper titles, labels, and value counts on bars
- **Focus especially on the 'income' variable - this is what we're trying to predict!**

**Suggested variables:** income, education, sex, marital-status, occupation, race

In [8]:
# Q3.3: Create bar plots for categorical variables
# Your task: Visualize the distribution of key categorical variables

# Hints:
# - Use plt.subplots() to create multiple plots
# - Use .value_counts() to get the data for plotting
# - Add titles, labels, and rotate x-axis labels if needed
# - Consider adding count labels on top of bars

# Suggested approach:
# 1. Select key categorical variables to plot
# 2. Create a subplot grid (2x3 works well)
# 3. For each variable, create a bar plot
# 4. Style your plots with titles and labels

# Your code here:




# What patterns do you observe?
# - Which categories are most common in each variable?
# - What's the income distribution? (This is our target variable!)
# - Are there any surprisingly uneven distributions?

### Q3.4: What stories do the numerical variables tell?
**Your analysis tasks:**
1. **Calculate summary statistics** for each numerical variable
2. **Identify potential outliers** using the IQR method
3. **Create histograms and box plots** to visualize distributions
4. **Look for patterns** like skewness, multiple peaks, or unusual values

**Questions to investigate:**
- What's the age distribution? Are most people young, old, or evenly distributed?
- How many hours do people typically work? Are there part-time vs full-time patterns?
- Are there extreme outliers that might indicate data errors?

In [9]:
# Q3.4: Analyze numerical variables in detail
# Your task: Create a comprehensive analysis of each numerical variable

# For each numerical variable, calculate and print:
# - Count, mean, median, standard deviation
# - Min, max, range
# - Potential outliers using IQR method (Q1 - 1.5*IQR, Q3 + 1.5*IQR)

# Hint: Loop through numerical_vars and use pandas methods like .mean(), .median(), .quantile()

# Your code here:




# Questions to answer:
# - Which numerical variable has the most outliers?
# - Are the mean and median similar or different for each variable? What does this suggest?
# - What's the age range in the dataset?
# - What do you notice about the hours-per-week variable?

In [10]:
# Q3.5: Visualize numerical variable distributions
# Your task: Create histograms and box plots for numerical variables

# Create a visualization with:
# - Top row: Histograms with mean and median lines
# - Bottom row: Box plots to show outliers
# - Proper titles and labels

# Hints:
# - Use plt.subplots() with appropriate size
# - Use .axvline() to add mean/median lines to histograms
# - Use different colors for mean and median lines
# - Add legends to explain the lines

# Your code here:




# What do you observe?
# - Which variables are normally distributed vs skewed?
# - Where do you see the most outliers?
# - How do the mean and median compare for each variable?

# 4. Relationship Investigation {#relationships}

### Q4.1: Which categorical factors are most associated with high income?
**Your investigation:** For key categorical variables, explore their relationship with income:
- What percentage of people in each category earn >$50K?
- Which education levels have the highest income rates?
- Are there gender disparities in income?
- How does marital status relate to earning potential?
- Which occupations tend to have higher incomes?

In [11]:
# Q4.1: Investigate categorical variables vs income
# Your task: Analyze how different categorical factors relate to income levels

# Suggested approach:
# 1. Select key categorical variables (education, sex, marital-status, occupation)
# 2. For each variable, create a crosstab with income
# 3. Normalize by rows to show percentages within each category
# 4. Create stacked bar plots
# 5. Calculate and print the percentage earning >50K for each category

# Hints:
# - Use pd.crosstab(index=df[var], columns=df['income'], normalize='index')
# - Multiply by 100 to get percentages
# - Use .plot(kind='bar', stacked=True) for visualization

# Your code here:




# Key questions to answer:
# - Which education level has the highest percentage earning >50K?
# - Is there a gender gap in high income earners?
# - Which marital status categories are associated with higher income?
# - What occupations have the best income prospects?

### Q4.2: How do numerical factors differ between income groups?
**Your investigation:** Compare numerical variables between people earning ≤$50K vs >$50K:
- Do high earners tend to be older or younger?
- Do high earners work more hours per week?
- Are there other numerical patterns that distinguish income groups?

**Method:** Create box plots comparing income groups and calculate summary statistics

In [12]:
# Q4.2: Compare numerical variables across income groups
# Your task: Create box plots and calculate statistics comparing income groups

# Steps:
# 1. Create box plots for each numerical variable, grouped by income
# 2. Calculate mean and median for each income group
# 3. Calculate the difference between groups
# 4. Identify which numerical factors show the biggest differences

# Hints:
# - Use plt.subplots() for multiple plots
# - Use boxplot() with data separated by income groups
# - Filter data like: adult[adult['income'] == '<=50K']['age']

# Your code here:




# Questions to answer:
# - Which numerical variable shows the biggest difference between income groups?
# - Do high earners tend to work more hours?
# - What's the age difference between income groups?
# - Are there any surprising patterns?

### Q4.3: How do numerical variables relate to each other?
**Your investigation:** Explore correlations and relationships between numerical variables:
- Which numerical variables are most strongly correlated?
- Are there any surprising relationships?
- How do these relationships appear in scatter plots?
- Can you identify any patterns when colored by income level?

**Method:** Create correlation matrix, heatmap, and scatter plots

In [13]:
# Q4.3: Analyze correlations between numerical variables
# Your task: Create correlation analysis and scatter plots

# Part A: Correlation Matrix
# 1. Calculate correlation matrix for numerical variables
# 2. Create a heatmap to visualize correlations
# 3. Identify strong correlations (|r| > 0.5) and moderate ones (0.3 < |r| ≤ 0.5)

# Hints: Use .corr(), sns.heatmap(), and set up a mask for upper triangle

# Your code here:




# Part B: Scatter Plots
# Create scatter plots for key numerical variable pairs
# Color points by income level to see if patterns emerge
# Suggested plots: age vs hours-per-week, and others you find interesting

# Your code here:




# Questions to answer:
# - Which numerical variables are most strongly correlated?
# - Do you see different patterns for high vs low income earners in scatter plots?
# - Are there any unexpected relationships?

# 5. Advanced Pattern Analysis {#patterns}

### Q5.1: What's the education-income connection?
**Your sophisticated analysis:** Create a detailed heatmap showing income percentages by education level
- Order education levels from lowest to highest
- Show the percentage earning >$50K for each education level
- Identify the education "threshold" where income prospects improve dramatically

In [14]:
# Q5.1: Create sophisticated education-income analysis
# Your task: Build an advanced heatmap showing education levels and income

# Step 1: Define education levels in logical order (lowest to highest)
education_levels = [
    "Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th",
    "HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors", "Masters",
    "Doctorate", "Prof-school"
]

# Step 2: Create crosstab showing income percentages by education
# Hint: Use normalize='index' to get percentages within each education level

# Your code here:




# Step 3: Reorder the data by education level and create heatmap
# Hint: Use .reindex() to order by education_levels

# Your code here:




# Questions to investigate:
# - At what education level does the income advantage become clear?
# - Which education levels have the highest percentage earning >50K?
# - Are there any surprising patterns in the education-income relationship?

In [15]:
# Q5.2: Occupation analysis challenge
# Your task: Create a comprehensive analysis of occupations

# Part A: Occupation frequency analysis
# - Which occupations are most common in the dataset?
# - Create a bar plot showing occupation counts

# Your code here:




# Part B: Occupation income analysis  
# - Which occupations have the highest percentage earning >50K?
# - Create a horizontal bar plot showing income percentages by occupation
# - Rank occupations by income potential

# Your code here:




# Challenge questions:
# - Which occupation has the best income prospects?
# - Are there occupations with high frequency but low income potential?
# - What occupations would you recommend for high earning potential?

# 6. Your Independent Investigation {#independent}

Now it's time for you to lead the investigation! Pick 2-3 questions that interest you most and conduct your own analysis.

### Investigation Option A: Age and Work Patterns
**Question:** How do age and work hours relate, and does this vary by income level?
- Create scatter plots of age vs hours-per-week
- Color by income level
- Look for age groups that work different patterns
- Calculate correlations within income groups

### Investigation Option B: Gender Deep Dive  
**Question:** What are the detailed patterns of gender and income in this dataset?
- Calculate exact percentages of high earners by gender
- Create visualizations showing the gender-income relationship
- Investigate if gender patterns vary by other factors (education, occupation, etc.)

### Investigation Option C: Marital Status Investigation
**Question:** How does marital status relate to income, and why might this be?
- Analyze income percentages by marital status
- Create visualizations showing these patterns
- Hypothesize about the underlying reasons for patterns you find

### Investigation Option D: Work Hours Analysis
**Question:** What patterns exist in work hours, and how do they relate to income?
- Analyze the distribution of hours-per-week
- Look at median work hours by different demographic groups  
- Investigate outliers in work hours

### Investigation Option E: Create Your Own Question!
**Your question:** _____________________
Design and conduct your own analysis based on a question that interests you about this dataset.

In [16]:
# Your Independent Investigation - Investigation A: Age and Work Patterns
# Question: How do age and work hours relate, and does this vary by income level?

# Your analysis here:




# What did you discover?

In [17]:
# Your Independent Investigation - Investigation B: Gender Deep Dive
# Question: What are the detailed patterns of gender and income in this dataset?

# Your analysis here:




# What patterns did you find? What might explain these patterns?

In [18]:
# Your Independent Investigation - Investigation C or D or E
# State your question:

# Your analysis here:




# Conclusions and insights:

# 7. Evidence Summary and Reflection {#summary}

## 🎯 **Investigation Conclusions:**

### Your Key Findings
Based on your analysis, answer these summary questions:

#### **Demographic Patterns:**
1. **Age:** What age patterns did you discover in relation to income?
2. **Gender:** What gender-income patterns did you find? How significant are they?
3. **Education:** What's the relationship between education level and income? Where's the "turning point"?

#### **Work and Income Factors:**
4. **Hours Worked:** How do work hours relate to income levels?
5. **Occupation:** Which occupations offer the best income prospects?
6. **Marital Status:** How does marital status relate to income, and what might explain this?

#### **Most Important Discoveries:**
7. **Strongest Predictors:** Based on your analysis, what are the top 3 factors most associated with high income?
8. **Surprising Findings:** What patterns surprised you? What did you expect but didn't find?
9. **Limitations:** What questions couldn't you answer with this dataset? What additional data would be helpful?

### **EDA Skills Reflection:**
- **Data Quality:** What data quality issues did you encounter and how did you handle them?
- **Visualization:** Which types of plots were most effective for different types of analysis?
- **Insights:** What was the most interesting insight you discovered?

### **Real-World Implications:**
- **Policy:** How might these findings inform education or employment policy?
- **Individual Decisions:** What advice would you give someone trying to maximize their earning potential?
- **Bias Considerations:** What potential biases exist in this 1994 dataset? How might patterns be different today?

### **Next Steps in Analysis:**
If you were to continue this project, what would you do next?
- Additional variables to collect?
- Statistical tests to perform?
- Predictive models to build?
- Different time periods to compare?

---