# DATA 602 Final Project
### Naomi Buell, Richie Rivera, Alexander Simon

## Abstract

**Background**: Opioid addiction is a public health crisis that has affected countless lives. Medicaid, which is a health insurance program for low-income individuals in the US, plays an important role in helping people with opioid use disorder (OUD) get medical treatment. The Affordable Care Act expanded Medicaid coverage in 2014, but not all states have implemented it yet. 

**Research Question**: Is there a correlation between OUD prevalence rates in US states and the status of Medicaid expansion?

**Methods**: We downloaded data on the prevalence of pain reliever misuse in each state from the National Survey on Drug Use and Health from the Substance and Mental Health Services Administration for all available years (2016–2019 and 2022) and the status of each state’s decision on Medicaid expansion from KFF, a health policy organization. Data were cleaned and merged into a Pandas dataframe and visualized with scatter/line plots and choropleth maps created using matplotlib and Plotly Express. We also performed linear regression and a paired t-test for difference in means using Python.

**Results**: Forty states had post-expansion data, 10 had pre-expansion data, and 8 had both pre- and post-expansion data. Overall OUD prevalence rates ranged from 1.3% to 6.5% (mean 3.9%). Choropleth maps showed that OUD prevalence rates have trended downward nationwide from 2016 to 2022. Linear regression analysis suggested that the rate of decline in post-expansion states was slightly faster than in pre-expansion states (slope = -0.24 vs -0.19), but R2 values were low (0.2–0.3). The t test was significant (P=0.048), however, not all test assumptions were met.

**Conclusion**: Our results suggest that there is a correlation between OUD prevalence rates and Medicaid expansion status, and that expansion is associated with a faster decline. However, we did not have enough data points to conclude that the difference is statistically significant.


## Introduction

Opioid addiction is a public health crisis in the US that has affected countless lives. Medicaid is a joint federal and state health insurance program for low-income individuals. As a result of the Affordable Care Act (ACA), expanded coverage became available in 2014, but not all states have implemented it.

As professionals in public health and biology, we are interested in knowing whether improved access to medical treatment can reduce the prevalence of substance use disorders. 

We obtained data on the prevalence of pain reliever misuse in each state from the National Survey on Drug Use and Health (NSDUH) from the [Substance and Mental Health Services Administration (SAMHSA)](https://www.samhsa.gov/) and the status of each state’s decision on Medicaid expansion from [KFF](https://www.kff.org/affordable-care-act/issue-brief/status-of-state-medicaid-expansion-decisions-interactive-map/), a health policy organization. 

Below, we show our data analysis and findings.


## Data Wrangling 

Our data are from the SAMHSA [National Survey on Drug Use and Health (NSDUH)](https://datatools.samhsa.gov/) 2-year restricted-use data sets for 2015-2016, 2016-17, 2017-18, 2018-19, and 2021-2022. No data related to our research question were available prior to 2015 (survey question of interest was not being asked yet) or for 2020 (likely due to COVID).

On the SAMHSA Data Tools webpage, we created "crosstabs" (data subsets) for the following variables and downloaded the CSV files:
-  PNRNMYR - During the past 12 months, if they misused prescription pain relievers
-  STUSAB - State US abbreviation

We also downloaded Medicaid expansion data (CSV) from [KFF](https://www.kff.org/affordable-care-act/issue-brief/status-of-state-medicaid-expansion-decisions-interactive-map/).

## Exploratory Data Analysis


### NSDUH Opioid Misuse Data

Below we import the NSDUH datasets, create dataframes, and explore this data.  

In [None]:
# Import libraries
import pandas as pd
import os
import re
import plotly.express as px
from matplotlib import pyplot as plt
from scipy.stats import norm
import numpy as np

# Set up filepaths
file_paths = [
    'data/STUSAB X PNRNMYR (2015-16).csv',
    'data/STUSAB X PNRNMYR (2016-17).csv',
    'data/STUSAB X PNRNMYR (2017-18).csv',
    'data/STUSAB X PNRNMYR (2018-19).csv',
    'data/STUSAB X PNRNMYR (2021-22).csv',
]

# Iterate over each path to add the CSV file to a list
df_collection = []
for path in file_paths:
    print(f'Reading in "{path}"')

    # Use regex to extract report year from path
    match = re.search(r'-(\d{2})\)', path)
    report_year = '20' + match.group(1)

    t_df = pd.read_csv(path)

    # t_df['rpt_yr'] = pd.to_datetime(f'20{path[28:30]}-01-01')
    t_df['rpt_yr'] = pd.to_datetime(f'{report_year}-01-01')    

    df_collection.append(
        t_df
    )

# Combine the collection of dataframes into one
df = pd.concat(df_collection)

print(df.head())

Below, we print list of columns, length, number of non-missing observations, and data types.

In [None]:
# Info
df.info()

All 780 observations of the column `Count` are missing, but we can instead use the `Weighted Count` column for our analysis, so this is OK.<sup>[1](#footnote1)</sup> There are up to 260 missing observations in the columns of this dataset, however, our main variables of interest `STATE US ABBREVIATION`, `RC-PAIN RELIEVERS - PAST YEAR MISUSE`, and `Row %` are complete. We may also use `Row % CI (lower)` and `Row % CI (upper)`, which are 67% complete in the full dataset, but (as explored later) are 100% complete after we filter data down to observations of interest.

<sup id="footnote1">1</sup> Note that `Row %`s are rounded, so we may opt to calculate prevalence rates ourselves using `Weighted Count` for more precision.

Below are the means, medians, and other summary statistics of numeric columns.

In [None]:
# Summary statistics
df.describe()

Here is a preview of our data after filtering down to just our columns and rows of interest.

In [None]:
# Selecting columns of interest from data
df_cols = df[['STATE US ABBREVIATION',
'RC-PAIN RELIEVERS - PAST YEAR MISUSE',
'Row %',
'Row % CI (lower)',
'Row % CI (upper)',
'Weighted Count',
'rpt_yr']]

# Subset the rows with states, removing the overall US observations
# Also removed DC (District of Columbia) because it's not a state
df_states = df_cols[(df_cols['STATE US ABBREVIATION'] != 'Overall') & 
                    (df_cols['STATE US ABBREVIATION'] != 'DC')]

# Subset the rows where RC-PAIN RELIEVERS - PAST YEAR MISUSE = "1 - Misused within the past year" to get prevalence of opioid misuse
df_filtered = df_states[df_states['RC-PAIN RELIEVERS - PAST YEAR MISUSE'] == "1 - Misused within the past year"]

# Preview filtered data
df_filtered.head(10)


Here are  summary statistics of our numeric variables in this filtered data frame.

In [None]:
# Show missingness of filtered data
print(df_filtered.info())

# Show summary statistics of filtered data
df_filtered.describe()

After filtering data, we have 100% completeness. States have, on average, 3.9% prevalence of opioid misuse per year. 

### KFF State Medicaid Expansion Data

Below we import the KFF dataset and explore this data.  

In [None]:
# Import KFF data
path_kff = "data/raw_data_kff.xlsx"

df_kff = pd.read_excel(path_kff, skiprows=2)

# Remove District of Columbia because it's not a state
df_kff = df_kff[df_kff['Location'] != 'District of Columbia']

df_kff.head(10)

We convert the state names to abbreviations to match NSDUH data.

In [None]:
# Create a dictionary with state names and their abbreviations as key:value pairs
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY"
    # "District of Columbia": "DC",
    # "American Samoa": "AS",  
    # "Guam": "GU",  
    # "Northern Mariana Islands": "MP",  
    # "Puerto Rico": "PR",
    # "United States": "US",
}

# Map the state names in the KFF dataframe to the corresponding abbreviation
df_kff['Abbrev'] = df_kff['Location'].map(us_state_to_abbrev)

df_kff.head(10)

Below, we print the list of columns, length, number of non-missing observations, and data types.

In [None]:
# Info
df_kff.info()

We convert the `Implemented Expansion On` variable to a datetime datatype and summarize it below.

In [None]:
# Convert to datetime
df_kff['Implemented Expansion On'] = pd.to_datetime(df_kff['Implemented Expansion On'], errors='coerce')

# Range of dates
df_kff.describe()

The 'count' row shows that 40 states have expanded Medicaid so far (missing dates indicate that a state has not yet adopted Medicaid expansion). Most states that have expanded Medicaid did so on the first day of 2014. The last state to expand Medicaid, North Carolina, did so in December 2023.

Lastly, we'll combine the KFF and NSDUH data into one dataframe that we will perform our analysis on:

In [None]:
working_df = df_filtered.merge(
    df_kff,
    left_on = 'STATE US ABBREVIATION',
    right_on = 'Abbrev',
    how = 'left'
)

working_df.head()

Perform some additional wrangling to faciliate analyses

In [None]:
# Add Boolean to indicate whether rates are post expansion (pre-expansion = False)
working_df['Post Expansion'] = working_df['rpt_yr'] >= working_df['Implemented Expansion On']
# Extract year from report datetime
working_df['rpt_yr'] = working_df['rpt_yr'].dt.year
# Scale the row % to make it easier to understand in the heatmaps
working_df['Scaled Row %'] = working_df['Row %'] * 100

working_df.head()

Pivot the 'Row %' column from long to wide so each report year is in its own column. This makes it easier to calculate the average OUD prevalence for each state.

In [None]:
# This cell contributed by Alex
# I divided working_df into expanded and not_expanded dataframes and pivoted them from long to wide

# States that expanded Medicaid
expanded_df = working_df[working_df['Implemented Expansion On'].notna()]
expanded_pivot_df = expanded_df.pivot(index = 'Abbrev', columns = 'rpt_yr', values = 'Scaled Row %').reset_index()
# Calculate average OUD rates during 2016-2019 (ie, before COVID)
expanded_pivot_df['Avg_2016_2019'] = expanded_pivot_df[[2016, 2017, 2018, 2019]].mean(axis = 1).round(1)
expanded_pivot_df['Expansion_status'] = 'Post-Expansion'
expanded_df.head()

# States that had not expanded Medicaid
not_expanded_df = working_df[working_df['Implemented Expansion On'].isna()]
not_expanded_pivot_df = not_expanded_df.pivot(index = 'Abbrev', columns = 'rpt_yr', values = 'Scaled Row %').reset_index()
# Calculate average OUD rates during 2016-2019
not_expanded_pivot_df['Avg_2016_2019'] = not_expanded_pivot_df[[2016, 2017, 2018, 2019]].mean(axis = 1).round(1)
not_expanded_pivot_df['Expansion_status'] = 'Pre-Expansion'

# Merge the dataframes
all_pivot_df = pd.concat([not_expanded_pivot_df, expanded_pivot_df], axis = 0)
print(all_pivot_df.head())

## Data Analysis

The histogram below shows the distribution of OUD prevalence rates in our dataset. We also fit the data to a normal distribution, which is overlaid on the histogram and indicates that the rates are approximately normally distributed.

In [None]:
# This cell contributed by Alex
# Create histogram of OUD rates with a normal distribution for comparison

# Fit normal distribution to the data
mu, std = norm.fit(working_df['Scaled Row %']) 

# Plot histogram of OUD rates
plt.hist(working_df['Scaled Row %'], bins = 30, density = True, 
         color = 'lightblue', edgecolor = 'darkgray')
plt.axvline(working_df['Scaled Row %'].mean(), color = 'red', 
            linestyle = 'dashed')
plt.axvline(working_df['Scaled Row %'].median(), color = 'blue', 
            linestyle = 'dashed')

# Overlay normal distribution
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k')

# Labels
min_ylim, max_ylim = plt.ylim()
plt.text(working_df['Scaled Row %'].mean() * 1.2, max_ylim * 0.9, 
         'Mean: {:.2f}'.format(working_df['Scaled Row %'].mean()), 
         color = 'red')
plt.text(working_df['Scaled Row %'].mean() * 1.2, max_ylim * 0.85, 
         'Median: {:.2f}'.format(working_df['Scaled Row %'].median()), 
         color = 'blue')
plt.xlabel('OUD Prevalence (%)')
plt.ylabel('Frequency')
plt.title('Histogram of OUD Prevalence Rates')

# View plot
plt.figure(figsize = (10, 6))
plt.show()

We created heatmaps (technically choropleth maps) of the average rates of opioid misuse during 2016-2019 (ie, pre-COVID) in states pre- and post- Medicaid expansion.

Pre-expansion (left plot), Alabama had the highest average rate of misuse (5.0%) and Wyoming had the average lowest rate (3.2%). Post-expansion (right plot), Oregon and Nevada had the highest average rate of misuse (5.4% for both). Maine and Nebraska had the lowest average rate of misuse (3.0% for both).

In [None]:
# Author: Alex

def draw_heatmap(df: pd.DataFrame, rate_column: str, 
                 facet_column: str, plots_per_row: int, title: str) -> px:
  '''
  This function takes a dataframe with OUD rates and 
  returns a choropleth map (ie, geographic heatmap)

  Args:
    df: Dataframe containing OUD prevalence rates
    rate_column: Name of column with OUD rates
    facet_column: Name of column with titles for each facet
    plots_per_row: Number of facets per row
    title: Main plot title
  
  Returns:
    heatmap: Plotly Express object
  '''
  heatmap = px.choropleth(df,
                        locations = 'Abbrev',
                        locationmode = "USA-states",
                        color = rate_column,
                        color_continuous_scale = 'rdbu_r',
                        labels = {rate_column : '% OUD'},
                        hover_name = 'Abbrev',
                        scope = 'usa',
                        facet_col = facet_column,
                        facet_col_wrap = plots_per_row)

  # Customize facets
  # Remove column name from title
  heatmap.for_each_annotation(lambda a: a.update(text = a.text.split("=")[-1]))
  # Adjust font size of title
  heatmap.update_annotations(font_size = 16)
  # Overall layout
  heatmap.update_layout(
      autosize = False,
      width = 1200,
      height = 600,
      title = {
          'text': title,
          'y' : 0.99,
          'x' : 0.5,
          'xanchor': 'center',
          'yanchor': 'top'},
      title_font_weight = 600)

  return heatmap

Note that the Plotly Express heatmaps are interactive. You can see information about individual states by hovering over it and zoom/pan each map.

In [None]:
# This cell contributed by Alex

# Compare average OUD rates pre vs post expansion
supertitle = 'Average OUD Prevalence During 2016-2019'
rate_column = 'Avg_2016_2019'
facet_column = 'Expansion_status'
plots_per_row = 2
avg_heatmap = draw_heatmap(all_pivot_df, rate_column, facet_column, plots_per_row, supertitle)
avg_heatmap.show()

We also created heatmaps for individual years to show the change in OUD prevalence over time pre- and post- expansion.

In general, OUD rates decreased from 2016 to 2022 in both pre-expansion states and post-expansion states, suggesting that the decline in OUD rates was not solely due to Medicaid expansion. In addition, a few pre-expansion states (eg, Alabama) and post-expansion states (eg, Nevada) had consistently high OUD rates over time.

In [None]:
# Author: Alex

# Annual changes in OUD rates pre-expansion
supertitle = 'OUD Prevalence Pre-Medicaid Expansion'
rate_column = 'Scaled Row %'
facet_column = 'rpt_yr'
plots_per_row = 3
pre_heatmap = draw_heatmap(not_expanded_df, rate_column, facet_column, plots_per_row, supertitle)
pre_heatmap.show()

In [None]:
# Author: Alex

# Annual changes in OUD rates post-expansion
supertitle = 'OUD Prevalence Post-Medicaid Expansion'
rate_column = 'Scaled Row %'
facet_column = 'rpt_yr'
plots_per_row = 3
post_heatmap = draw_heatmap(expanded_df, rate_column, facet_column, plots_per_row, supertitle)
post_heatmap.show()

## Conclusion

Although we did not have enough data to make a statistically significant conclusion, our analysis suggests that, yes, there is a correlation between OUD and Medicaid expansion, and that Medicaid expansion is associated with faster decline of OUD rates. Our findings suggest that policymakers can reduce the burden of OUD in the US by extending Medicaid expansion to states that have not yet adopted it.
