# Exploratory Data Analysis & Funnel Flags

This notebook explores housing affordability data and constructs funnel-stage
indicators used in downstream SQL analysis and dashboards.

Key goals:
- Understand distributions and trends
- Define transparent funnel logic
- Validate assumptions before dashboarding


In [6]:
# Import libraries
import pandas as pd
import numpy as np

In [7]:
# Load dataset
df = pd.read_csv("data/housing_affordability_data.csv")
df.head()

Unnamed: 0,NAME,median_household_income,median_home_value,state,year,mortgage_rate,hpi,real_income,price_to_income_ratio
0,Alabama,59674,200900,1,2022,5.344038,607.9425,16287.583333,3.366625
1,Alaska,88121,336900,2,2022,5.344038,607.9425,16287.583333,3.823152
2,Arizona,74568,402800,4,2022,5.344038,607.9425,16287.583333,5.401781
3,Arkansas,55432,179800,5,2022,5.344038,607.9425,16287.583333,3.243614
4,California,91551,715900,6,2022,5.344038,607.9425,16287.583333,7.819685


In [8]:
# Basic Data Overview
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   NAME                     52 non-null     object 
 1   median_household_income  52 non-null     int64  
 2   median_home_value        52 non-null     int64  
 3   state                    52 non-null     int64  
 4   year                     52 non-null     int64  
 5   mortgage_rate            52 non-null     float64
 6   hpi                      52 non-null     float64
 7   real_income              52 non-null     float64
 8   price_to_income_ratio    52 non-null     float64
dtypes: float64(4), int64(4), object(1)
memory usage: 3.8+ KB


Unnamed: 0,median_household_income,median_home_value,state,year,mortgage_rate,hpi,real_income,price_to_income_ratio
count,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0
mean,73477.557692,331092.307692,29.788462,2022.0,5.344038,607.9425,16287.58,4.385183
std,14043.682917,148750.900931,16.774557,0.0,4.484219e-15,3.44388e-13,9.18368e-12,1.304471
min,24112.0,122200.0,1.0,2022.0,5.344038,607.9425,16287.58,2.796459
25%,66518.75,223600.0,16.75,2022.0,5.344038,607.9425,16287.58,3.375004
50%,71884.0,291450.0,29.5,2022.0,5.344038,607.9425,16287.58,4.093289
75%,83221.75,398675.0,42.5,2022.0,5.344038,607.9425,16287.58,5.078374
max,101027.0,820100.0,72.0,2022.0,5.344038,607.9425,16287.58,8.869973


In [9]:
# Check for missing values
df.isna().mean().sort_values(ascending=False)

NAME                       0.0
median_household_income    0.0
median_home_value          0.0
state                      0.0
year                       0.0
mortgage_rate              0.0
hpi                        0.0
real_income                0.0
price_to_income_ratio      0.0
dtype: float64

# Trend Over Time

In [10]:
df.groupby("year")[[
        "median_household_income",
        "median_home_value",
        "mortgage_rate"
    ]].mean()

Unnamed: 0_level_0,median_household_income,median_home_value,mortgage_rate
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022,73477.557692,331092.307692,5.344038


Mortgage rates vary more sharply year-to-year than income, suggesting financing conditions may drive affordability shocks.

# FUNNEL LOGIC
## Funnel Definitions

Housing affordability is modeled as a four-stage funnel:

1. Income-qualified  
2. Savings-capable (down payment proxy)  
3. Mortgage-eligible  
4. Homeowner (observed outcome)

Thresholds are intentionally simple and transparent.


## Calculate Total Debt-to-Income (DTI) ratio

In [None]:
# Calculate gross monthly income from annual median household income
df["monthly_income"] = df["median_household_income"] / 12

# Convert annual mortgage rate to monthly decimal 
r = (df["mortgage_rate"] / 100) / 12
n = 360  

# Calculate monthly mortgage principal & interest (P&I) 
# Assumes a 20% down payment (financing 80% of the home value)
df["monthly_payment_proxy"] = (0.8 * df["median_home_value"]) * (r * (1 + r)**n) / ((1 + r)**n - 1)

# Calculate realistic housing costs (PITI: Principal, Interest, Taxes, and Insurance)
# Estimate annual Property Taxes + Home Insurance at 1.7% of the total home value
df["annual_tax_ins"] = df["median_home_value"] * 0.017
df["monthly_piti"] = df["monthly_payment_proxy"] + (df["annual_tax_ins"] / 12)

# Incorporate "Hidden Debt" (Car loans, credit cards)
# Assuming an average of $400/month
df["total_monthly_debt"] = df["monthly_piti"] + 400

# Calculate Total Debt-to-Income (DTI) ratio to evaluate mortgage eligibility
df["total_dti"] = df["total_monthly_debt"] / df["monthly_income"]

## Stage 1: Income-qualified

In [12]:
df["income_qualified"] = np.where(
    df["price_to_income_ratio"] <= 5, 1, 0)

df["income_qualified"].value_counts(normalize=True)

income_qualified
1    0.711538
0    0.288462
Name: proportion, dtype: float64

Households can plausibly afford housing if income supports price over a long horizon.

## Stage 2: Savings-capable (Down Payment Proxy)

In [13]:
df["savings_capable"] = np.where(
    (df["income_qualified"] == 1) & (df["price_to_income_ratio"] <= 3.8), 1, 0)

df["savings_capable"].value_counts(normalize=True)

savings_capable
0    0.634615
1    0.365385
Name: proportion, dtype: float64

Assumption clearly documented

## Stage 3: Mortgage-eligible

In [None]:
df["mortgage_eligible"] = np.where(
    (df["savings_capable"] == 1) & (df["total_dti"] <= 0.31), 1, 0)

df["mortgage_eligible"].value_counts(normalize=True)

mortgage_eligible
0    0.769231
1    0.230769
Name: proportion, dtype: float64

Captures financing constraint sensitivity.

## Stage 4: Homeowner Proxy

In [15]:
# Since ACS homeownership is not directly available here, define ownership as passing all prior stages

df["homeowner_proxy"] = np.where(
    (df["income_qualified"] == 1) &
    (df["savings_capable"] == 1) &
    (df["mortgage_eligible"] == 1),
    1,
    0
)

df["homeowner_proxy"].mean()

0.23076923076923078

In [16]:
df[[
    "income_qualified",
    "savings_capable",
    "mortgage_eligible",
    "homeowner_proxy"
]].mean()

income_qualified     0.711538
savings_capable      0.365385
mortgage_eligible    0.230769
homeowner_proxy      0.230769
dtype: float64

This proxy is used to illustrate structural bottlenecks rather than individual outcomes.

# FUNNEL DIAGNOSTICS (EDA LEVEL)

## Funnel Conversion Rates

In [None]:
# Calculate Survival Rates - Used for the Funnel Chart visualization
# This measures the percentage of the total population remaining at each stage
survival_rates = {
    "Income-qualified": df["income_qualified"].mean(),
    "Savings-capable": df["savings_capable"].mean(),
    "Mortgage-eligible": df["mortgage_eligible"].mean(),
    "Homeowner": df["homeowner_proxy"].mean()
}

# Calculate Step Conversion Rates - Used for generating Key Insights
# This measures the percentage of people who pass from one specific stage to the next
step_conv = {
    "Income-qualified": 1.0, # Base level = 100%
    "Savings-capable": survival_rates["Savings-capable"] / survival_rates["Income-qualified"],
    "Mortgage-eligible": survival_rates["Mortgage-eligible"] / survival_rates["Savings-capable"],
    "Homeowner": survival_rates["Homeowner"] / survival_rates["Mortgage-eligible"]
}

# Create a summary table
# Consolidates both total survival rates and step-by-step conversion for reporting
final_funnel = pd.DataFrame({
    "Survival_Rate_Total": survival_rates.values(),
    "Step_Conversion": step_conv.values()
}, index=survival_rates.keys())

print(final_funnel)

                   Survival_Rate_Total  Step_Conversion
Income-qualified              0.711538         1.000000
Savings-capable               0.365385         0.513514
Mortgage-eligible             0.230769         0.631579
Homeowner                     0.230769         1.000000


## Regional Bottleneck Check

In [18]:
region_bottleneck = (
    df.groupby("NAME")[[
        "income_qualified",
        "savings_capable",
        "mortgage_eligible"
    ]]
    .mean()
    .sort_values("mortgage_eligible")
)

region_bottleneck.head()


Unnamed: 0_level_0,income_qualified,savings_capable,mortgage_eligible
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,1.0,1.0,0.0
Montana,0.0,0.0,0.0
Nevada,0.0,0.0,0.0
New Hampshire,1.0,0.0,0.0
New Jersey,1.0,0.0,0.0


This state-level breakdown identifies distinct structural bottlenecks: while some regions face immediate market exclusion due to low income-to-price ratios (Montana, Nevada), others suffer from a "Savings Trap" where high earners cannot bridge the down payment gap (New Jersey). In late-stage failure cases like Alabama, households pass the first two hurdles but are ultimately blocked by the financing barrier of 2022's high mortgage rates and debt constraints.

In [19]:
# Save to csv
df.to_csv("data/housing_funnel_ready.csv", index=False)