# Week 6 In-Class Python Lab
---

## A Note About This Lab (And Real-World Data Work)


While this lab is a **guided EDA experience**, it is important to understand that in real data analyst and data scientist roles, you will often **not** be given step-by-step prompts like the ones you see here.


In professional settings, you are usually given:
- A dataset
- A business or research problem
- A general goal (e.g., reduce readmissions, improve patient outcomes, reduce cost)


From there, **you are responsible for:**
- Investigating the data structure
- Identifying data quality issues
- Deciding which questions are worth asking
- Determining which visualizations or summaries help answer those questions
- Translating findings into real-world recommendations


This lab is designed to help you **practice the thinking process** you will need later:


üëâ What questions should I ask first? 
üëâ What patterns should I check for? 
üëâ What might leadership or stakeholders care about? 
üëâ What might be misleading if I only look at high-level summaries?


As you progress through the course, prompts will become less specific, and you will be expected to:
- Formulate your own analytical questions 
- Choose appropriate methods independently 
- Justify your analytical decisions 


That transition is intentional and it reflects how real data work happens.


You are not just learning tools. 
You are learning how to **think like an analyst.**

---

## Advanced Pandas + Visual Exploratory Data Analysis (Healthcare Outcomes)

In this lab, you will work with a real-world healthcare dataset using pandas, matplotlib, and seaborn.

You will NOT be told exactly which charts or commands to use. Instead, you will answer healthcare questions using data summaries and visualizations.

---

## Scenario

You are a healthcare data analyst working with hospital leadership.

Leadership wants to understand:
- Which patients are most at risk of readmission
- Which hospital utilization factors relate to readmission
- Whether subgroup patterns change overall conclusions


In [1]:
# START HERE

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# Activity 1 ‚Äî Load the Data

## Healthcare Question
Is the dataset loaded correctly and ready for analysis?

In [None]:
# TODO
# Load dataset into a DataFrame named df
# Display first few rows
df = pd.read_csv('')
df.iloc[1:20]

FileNotFoundError: [Errno 2] No such file or directory: 'diabetic_data'

# Activity 2 ‚Äî Dataset Structure + Initial Visualization

## Healthcare Question
What types of patient and hospital variables exist in this dataset?

## Required Tasks
- Create summary tables
- Create at least ONE visualization showing distribution of an important variable


In [None]:
# TODO
# Inspect dataset structure
# Generate descriptive summary statistics


In [None]:
# TODO
# Create at least ONE visualization showing distribution of an important variable


# Activity 3 ‚Äî Data Cleaning

## Healthcare Question
What data quality issues must be addressed before making healthcare recommendations?

In [None]:
# TODO
# Standardize missing values
# Inspect missing value counts


Consider how handling outliers and/or missing values changes your analysis. Again, this is more of art than science. Be sure to share your reasoning here, just as you would to the stakeholders

# Activity 4 ‚Äî Population + Outcome Visualization

## Healthcare Question
Which patient groups represent the largest share of readmissions?

In [None]:
# TODO
# Create at least ONE visualization showing readmission distribution


# Activity 5 ‚Äî Binning + Visualization

## Healthcare Question
Does readmission risk increase as hospital stay length increases?

## Required Tasks
- Create bins using cut()
- Create bins using qcut()
- Calculate readmission rate per bin
- Visualize readmission patterns across bins


In [None]:
# TODO
# Create time_bin_cut using pd.cut()
# Create time_bin_qcut using pd.qcut()


In [None]:
# TODO
# Calculate readmission rates by bin


# üîç New Visualization Tool ‚Äî Heatmaps

## What Is A Heatmap?

A heatmap uses color intensity to show magnitude of values.

Heatmaps are commonly used to show:
- Risk levels across groups
- Rates or percentages
- Correlations

## Real Analyst Skill
Look up seaborn heatmap documentation before completing the next task.


In [None]:
# TODO
# Create heatmap showing readmission rates by bin


# Activity 6 ‚Äî Simpson‚Äôs Paradox Exploration Using lmplot

## Real Analyst Skill
Look up seaborn lmplot documentation before completing this section.

## Healthcare Question
Does the relationship between hospital stay length and medication count change when examining readmission groups?

## Simpson Investigation Questions

1. Does overall data show positive or negative relationship?
2. Do subgroup regression lines differ?
3. Is this Simpson‚Äôs Paradox? Why or why not?
4. Why could relying only on overall healthcare data be dangerous?


In [None]:
# TODO
# Create lmplot showing overall relationship


In [None]:
# TODO
# Create lmplot colored by readmission group


# Final Deliverable ‚Äî Stakeholder Memo

Include:
- Highest risk patient groups
- Key utilization drivers
- Whether subgroup analysis changed conclusions
- Whether Simpson‚Äôs Paradox was present
- One recommended hospital action


In [None]:
# TODO
# Write stakeholder memo in markdown

If you've made it this far, consider other questions that may have arisen as you've worked through this guided EDA. What else would you like to explore and why?