# CVS Health Community Access Analysis - Part 2: Exploratory Analysis

This notebook explores key questions about CVS clinic distribution, vulnerability, and health needs. We answer four main questions to understand access patterns.


In [None]:
# load data from previous notebook (run 06a first)
# if running independently, uncomment the lines below:
# import pandas as pd
# import numpy as np
# df = pd.read_csv(r"C:\Users\14122\OneDrive\Desktop\cvs_heath_project\data\processed\CVS_FINAL_DATASET.csv")
# for col in df.columns:
#     try:
#         df[col] = pd.to_numeric(df[col], errors='coerce')
#     except:
#         pass

print("data loaded - ready for analysis")


## Q1: How many CVS MinuteClinic locations per county?

we start by understanding the basic distribution of clinics across counties. this tells us how concentrated or spread out CVS clinics are.


In [None]:
# get descriptive statistics for clinic counts
# this shows us the distribution: mean, median, quartiles, min, max
clinic_stats = df['clinic_count'].describe()
print("clinic count statistics:")
print(clinic_stats)


### what we learned from Q1:

- 75% of counties have no clinic (the 75th percentile is 0)
- only a small minority of counties have 1+ clinics
- a few large counties carry the entire network (high max value)
- this suggests CVS clinics are highly concentrated in specific areas

### questions to explore next:

- do vulnerable, rural, low-income, or high burden counties have fewer clinics?
- are wealthy or predominantly white counties getting more coverage?


## Q2: What percentage of counties have zero clinics? Is CVS underserving counties with higher socioeconomic vulnerability scores?

we investigate whether there's a relationship between social vulnerability and clinic access. this helps us understand if CVS is systematically avoiding vulnerable communities.


In [None]:
# calculate the percentage of counties with zero clinics
zero_clinic_pct = (df['clinic_count'] == 0).mean() * 100
print(f"percentage of counties with zero clinics: {zero_clinic_pct:.1f}%")
print(f"number of counties with zero clinics: {(df['clinic_count'] == 0).sum():,}")
print(f"number of counties with at least one clinic: {(df['clinic_count'] > 0).sum():,}")


In [None]:
# compare average SVI (social vulnerability index) for counties with and without clinics
# this tells us if more vulnerable counties are less likely to have clinics
svi_comparison = df.groupby(df['clinic_count'] > 0)['svi_overall'].mean()
print("average SVI by clinic presence:")
print(svi_comparison)
print(f"\ncounties without clinics: {svi_comparison[False]:.4f}")
print(f"counties with clinics: {svi_comparison[True]:.4f}")
print(f"difference: {svi_comparison[False] - svi_comparison[True]:.4f}")


### what this demonstrates:

- counties without clinics have higher average SVI (more vulnerable)
- counties with clinics have lower average SVI (less vulnerable)
- this suggests more vulnerable counties are less likely to have CVS clinics
- the difference indicates a potential access gap for vulnerable communities


In [None]:
# compare socioeconomic vulnerability specifically
# socioeconomic SVI focuses on poverty, unemployment, income, and education
socioeconomic_comparison = df.groupby(df['clinic_count'] > 0)['svi_socioeconomic'].mean()
print("average socioeconomic SVI by clinic presence:")
print(socioeconomic_comparison)


### what this demonstrates:

- counties with 0 clinics have higher socioeconomic vulnerability (0.504)
- counties with 1+ clinics have lower socioeconomic vulnerability (0.427)
- this shows that poorer, low-income counties are less likely to have CVS clinics
- there's a clear socioeconomic gap in clinic access


In [None]:
# compare minority vulnerability
# this helps us understand if there's a racial/ethnic component to clinic distribution
minority_comparison = df.groupby(df['clinic_count'] > 0)['svi_minority'].mean()
print("average minority SVI by clinic presence:")
print(minority_comparison)


### what this demonstrates:

- counties with 0 clinics: 0.461 average minority SVI
- counties with 1+ clinics: 0.640 average minority SVI
- interestingly, counties WITH clinics have HIGHER minority vulnerability
- this suggests CVS clinics are concentrated in diverse, urban areas
- the primary access gap appears to be rural low-income regions, not racial exclusion
- CVS seems to serve diverse urban communities but misses rural vulnerable areas


### summary of Q2 findings:

CVS clinic coverage is strongly shaped by socioeconomic and geographic inequity. Counties without clinics have significantly higher socioeconomic vulnerability, suggesting poorer, low-income communities are underserved. However, counties with clinics tend to have higher minority vulnerability, reflecting CVS's concentration in diverse, urban areas. The primary access gap appears to be rural low-income regions rather than racial exclusion.


## Q3: Do high-health-need counties have fewer clinics?

we investigate whether counties with worse health outcomes (higher health burden) have less access to CVS clinics. this would indicate a mismatch between need and resources.


In [None]:
# create a health burden score using key health indicators
# we use stroke, physical inactivity, self-care disability, and social isolation
# these represent chronic disease, disability, and social health factors
health_vars = ['stroke', 'physical_inactivity', 'self_care_disability', 'social_isolation']

# check which variables are available
available_vars = [var for var in health_vars if var in df.columns]
print(f"using health variables: {available_vars}")

# calculate health burden as the average of these indicators
# higher score = worse health outcomes
df['health_burden_score'] = df[available_vars].mean(axis=1)

print(f"\nhealth burden score created")
print(f"mean: {df['health_burden_score'].mean():.2f}")
print(f"median: {df['health_burden_score'].median():.2f}")
print(f"range: {df['health_burden_score'].min():.2f} to {df['health_burden_score'].max():.2f}")


In [None]:
# compare health burden scores for counties with and without clinics
# this tells us if sicker counties have less access
health_burden_comparison = df.groupby(df['clinic_count'] > 0)['health_burden_score'].mean()
print("average health burden score by clinic presence:")
print(health_burden_comparison)
print(f"\ncounties without clinics: {health_burden_comparison[False]:.2f}")
print(f"counties with clinics: {health_burden_comparison[True]:.2f}")
print(f"difference: {health_burden_comparison[False] - health_burden_comparison[True]:.2f}")


### what this means:

- counties with 0 clinics have higher health burden (16.37 average)
- counties with 1+ clinics have lower health burden (15.08 average)
- counties without CVS clinics have worse health outcomes
- counties with CVS clinics are healthier on average
- this suggests a gap: sicker populations have less access to CVS services
- is there a mismatch in healthcare resources being allocated to areas with highest needs?


### summary of Q3 findings:

Counties without CVS clinics have significantly higher health burden scores (16.37 vs. 15.08), indicating that sicker, higher-need populations are less likely to have access to CVS clinic services. This suggests a mismatch between clinic distribution and community health needs.


## Q4: What counties are underserved?

we identify specific counties that are most underserved - those with high health needs but no clinics. these are priority targets for expansion.


In [None]:
# define high burden threshold as top 25% (75th percentile)
# counties above this threshold have worse health outcomes than 75% of all counties
high_burden_threshold = df['health_burden_score'].quantile(0.75)
print(f"high burden threshold (75th percentile): {high_burden_threshold:.2f}")

# identify underserved counties: high health need but zero clinics
# these are the counties where expansion would have the biggest impact
underserved = df[
    (df['clinic_count'] == 0) &
    (df['health_burden_score'] >= high_burden_threshold)
]

print(f"\nnumber of underserved counties: {len(underserved)}")
print(f"percentage of all counties: {len(underserved) / len(df) * 100:.1f}%")


In [None]:
# show top 20 most underserved counties
# sorted by health burden score (highest need first)
top_underserved = underserved[['county_full', 'state_full', 'health_burden_score', 'svi_overall']].sort_values(
    by='health_burden_score', ascending=False
).head(20)

print("top 20 most underserved counties in the U.S.:")
print("(high health burden + zero clinics)")
top_underserved


### what this means:

these are counties with the highest health burden scores that have zero CVS clinics. they represent the biggest opportunity for CVS to make an impact by expanding into areas with the greatest need. many of these counties are in rural areas, particularly in the south and midwest regions.
