# **Final Project Proposal**

Our research question is: how has Medicaid expansion affected opioid misuse in the US? To this end, we plan to investigate the correlation between pain reliever misuse and states that have expanded Medicaid, and whether the rate of opioid misuse decreased after states expanded their Medicaid programs. We picked this topic because, as professionals in public health and biology, we are well aware of the toll that substance use disorders have taken, and the potential that Medicaid expansion under the Affordable Care Act (ACA) has to improve access to treatment for these disorders. 

We will obtain data on the prevalence of pain reliever misuse in each state from the [National Survey on Drug Use and Health (NSDUH)](https://datatools.samhsa.gov/nsduh/2019/nsduh-2018-2019-rd02yr/crosstab?row=PNRNMYR&column=STUSAB&weight=DASWT_1) from the Substance and Mental Health Services Administration (SAMHSA) and the status of each state’s decision on Medicaid expansion from [KFF](https://www.kff.org/affordable-care-act/issue-brief/status-of-state-medicaid-expansion-decisions-interactive-map/), a health policy organization. We will use the pandas and matplotlib libraries for our data analysis. Below is our exploratory data analysis and some summary statistics:

______________
## Data Exploration
### NSDUH

Below we import the NSDUH dataset, create dataframes, and explore this data.  

In [1]:
# import data
import pandas as pd

path = "https://raw.githubusercontent.com/naomibuell/DATA607/refs/heads/main/STUSAB%20x%20PNRNMYR.csv"
df = pd.read_csv(path)

df.head()

Unnamed: 0,STATE US ABBREVIATION,RC-PAIN RELIEVERS - PAST YEAR MISUSE,Total %,Total % SE,Total % CI (lower),Total % CI (upper),Row %,Row % SE,Row % CI (lower),Row % CI (upper),Column %,Column % SE,Column % CI (lower),Column % CI (upper),Weighted Count,Count SE
0,Overall,Overall,1.0,0.0,,,1.0,0.0,,,1.0,0.0,,,274487000.0,1650000.0
1,Overall,0 - Did not misuse in the past year,0.964,0.0008,0.963,0.966,1.0,0.0,,,0.964,0.0008,0.963,0.966,264651000.0,1612000.0
2,Overall,1 - Misused within the past year,0.036,0.0008,0.034,0.037,1.0,0.0,,,0.036,0.0008,0.034,0.037,9836000.0,216000.0
3,AK,Overall,0.002,0.0001,0.002,0.002,0.002,0.0001,0.002,0.002,1.0,0.0,,,585000.0,20000.0
4,AK,0 - Did not misuse in the past year,0.002,0.0001,0.002,0.002,0.002,0.0001,0.002,0.002,0.958,0.0051,0.947,0.967,560000.0,18000.0


Below I print list of columns, length, number of non-missing observations, and data types.

In [2]:
# Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 16 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   STATE US ABBREVIATION                 156 non-null    object 
 1   RC-PAIN RELIEVERS - PAST YEAR MISUSE  156 non-null    object 
 2   Total %                               156 non-null    float64
 3   Total % SE                            156 non-null    float64
 4   Total % CI (lower)                    155 non-null    float64
 5   Total % CI (upper)                    155 non-null    float64
 6   Row %                                 156 non-null    float64
 7   Row % SE                              156 non-null    float64
 8   Row % CI (lower)                      153 non-null    float64
 9   Row % CI (upper)                      153 non-null    float64
 10  Column %                              156 non-null    float64
 11  Column % SE        

There are up to 52 missing observations in columns of this dataset. 

Below are the means of numeric columns.

In [3]:
# Means
df.mean(numeric_only=True)

Total %                2.564103e-02
Total % SE             4.698718e-04
Total % CI (lower)     1.844516e-02
Total % CI (upper)     2.028387e-02
Row %                  3.848718e-02
Row % SE               1.231410e-03
Row % CI (lower)       1.734641e-02
Row % CI (upper)       2.230719e-02
Column %               6.666667e-01
Column % SE            3.447436e-03
Column % CI (lower)    4.897115e-01
Column % CI (upper)    5.102885e-01
Weighted Count         7.038128e+06
Count SE               1.511923e+05
dtype: float64

Here are medians of the numeric columns.

In [4]:
# Median
df.median(numeric_only=True)

Total %                5.500000e-03
Total % SE             3.000000e-04
Total % CI (lower)     5.000000e-03
Total % CI (upper)     6.000000e-03
Row %                  1.450000e-02
Row % SE               7.500000e-04
Row % CI (lower)       1.200000e-02
Row % CI (upper)       1.500000e-02
Column %               9.640000e-01
Column % SE            3.950000e-03
Column % CI (lower)    4.840000e-01
Column % CI (upper)    5.160000e-01
Weighted Count         1.513500e+06
Count SE               7.350000e+04
dtype: float64

### KFF

Below we import the KFF dataset and explore this data.  

In [None]:
# Import