In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
import requests
from sodapy import Socrata

## EVALUATING IMPACT OF DOJ MEASURES IN OPIOID INTERDICTION
In order to evaluate the efficacy of different interventions in the fight against the opioid epidemic, I am reviewing prescription and enrollment information for Kentucky Medicaid. The data is available publicly through data.medicaid.gov and the Socrata client. SodaPy is the Python library that works with Socrata. 

I obtained 5 years of prescription data (2015-2019) and the corresponding enrollment. I also obtained a dataset of only the medications prescribed for substance use disorder treatment, including Naloxone (Narcan). 

I will review the data from each year 2015-2018 to create forecasts, then compare the forecasts to the actual prescription values in 2019. 

This is the basis of my evaluation of the impact of US DOJ interventions in KY starting in October, 2018. 

### OBTAINING PRESCRIPTION DATA FROM MEDICAID

In [2]:
client = Socrata("data.medicaid.gov", None)
#code from the medicaid.gov website for using Socrata with Python

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.medicaid.gov,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.

rx_2015 = client.get("hse5-m4bk", state_code = 'KY', limit = 250000)
rx_2016 = client.get("dpqa-tc6u", state_code = 'KY', limit = 250000)
rx_2017 = client.get("qams-sami", state_code = 'KY', limit = 250000)
rx_2018 = client.get("ddz4-5k5v",state_code = 'KY', limit = 250000)
rx_2019 = client.get("8fjh-49cj", state_code = 'KY', limit = 250000)
rx_2020 = client.get("yuuq-gv5v", state_code = 'KY', limit = 250000)


rx_mat = client.get("3mnv-bath", state_code = 'KY', limit = 10000)
mcd_enrollment =client.get("nkdi-f9a2", limit = 100)


# Convert to pandas DataFrame
df_2020 = pd.DataFrame.from_records(rx_2020)
df_2019 = pd.DataFrame.from_records(rx_2019)
df_2018 = pd.DataFrame.from_records(rx_2018)
df_2017 = pd.DataFrame.from_records(rx_2017)
df_2016 = pd.DataFrame.from_records(rx_2016)
df_2015 = pd.DataFrame.from_records(rx_2015)
df_mat = pd.DataFrame.from_records(rx_mat)
df_enrollment =pd.DataFrame.from_records(mcd_enrollment)



Once all the data is obtained, check the shape of each year's data set, consolidate into 1 data frame for years 2015-2018, remove all rows with nulls in the 'number of prescriptions' field. 

In [3]:
#df_2015.shape, df_2016.shape, df_2017.shape, df_2018.shape, df_2019.shape, df_mat.shape, df_enrollment.shape
#shapes of all data


In [4]:
#df_2015.columns, df_2016.columns, df_2017.columns, df_2018.columns, df_2019.columns -check columns in each frame

Define the fields that are consistent across all prescription data and identify features to use in model

In [5]:
keepers = ['quarter', 'state_code', 
        'product_code', 'package_size', '_quarter_begin', 'number_of_prescriptions',
        'product_fda_list_name', 'labeler_code', 'total_amount_reimbursed',
        'units_reimbursed', 'period_covered', 'ndc']

Use concatenate to create one data frame for years 2015-2018, and drop any rows with NaN in number_of_prescriptions

In [6]:
rx_df = pd.concat([df_2015[keepers],df_2016[keepers], df_2017[keepers], df_2018[keepers]] )

In [7]:
rx_df.dropna(subset = ['number_of_prescriptions'], inplace =True)

In [8]:
rx_df.shape

(203597, 12)

### OBTAINING PRESCRIPTION DRUG INFORMATION FROM FDA

To limit the prescription data to only opioid drugs, I'm using the OpenFDA API to query the FDA NDC drug database. This database is updated daily with the latest information from the FDA about all products with an NDC (national drug code) identifier. The API documentation indicates the search is limited to a max of 1000 rows, and according to the metadata, there are 1751 opioid drugs in the database. 2 pulls with a maximum limit of 1000 rows each should provide all the opioids in the database.
Documentation for the API is here: https://open.fda.gov/apis/

In [9]:
data1 = requests.get('https://api.fda.gov/drug/ndc.json?search=pharm_class:"Opioid"&limit=1000').json()
data2 = requests.get('https://api.fda.gov/drug/ndc.json?search=pharm_class:"Opioid"&limit=1000&skip=1000').json()


In [10]:
#code assistance from James
#concatenate the 2 results JSON streams and put in a dataframe

inner = data1['results'] + data2['results']
fda_df = pd.DataFrame.from_records(inner)
fda_df['route'] = fda_df['route'].str[0]

#define the fields I wish to keep in my data frame
fda_df = fda_df[['product_ndc','generic_name','dea_schedule','brand_name','active_ingredients','route','pharm_class']]


Splitting the list into 5 smaller dataframes based on DEA drug class. Most, but not all, partial opioid agonists used in medication-assisted substance abuse treatment are classified as Schedule III prescription medications. 
This may be useful if I decide to use the DEA class as a feature or summarize total rx by schedule. Also a quick way to see what meds are in which class using value-count

In [11]:
CII = fda_df.loc[fda_df['dea_schedule']== 'CII']
CIII = fda_df.loc[fda_df['dea_schedule'] == 'CIII']
CIV = fda_df.loc[fda_df['dea_schedule'] == 'CIV']
CV = fda_df.loc[fda_df['dea_schedule'] == 'CV']


### DATA CLEANING AND TRANSFORMATIONS

Only methadone, butorphanol tartrate, and buprenorphine-containing medications are acceptable forms of medicated-assisted treatment for substance use disorder covered by Medicaid in KY (there are other non-opioid medications like Wellbutrin approved for substance abuse treatment, but I've limited my dataframe to only opioid medications). Create a dummy column for MAT - '1' if lowercase generic name contains methadone, butorphanol, or buprenorphine; '0' if not.

Likewise, create a dummy column for Opioid - '1' if lowercase generic name contains one of the base opioids. I started with trying to match NDC codes, but discovered that the current directory does not include recently retired NDC's. Out of 203k transaction records, 6k registered as opioids when I matched on NDC's. With the dummy column based on name, I also capture opioid drugs historically prescribed under a now-expired NDC. 

For simplicity, I created a list of the root generic names and brand names, limited to the first 10 characters, so I can match wth the product name in the Medicaid RX Claims dataframe

Using this example from StackOverflow as my basis - https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function-o?rq=1 and assistance from Alex getting my syntax correct.

In [12]:
#Create list of opioid drugs using the FDA info. Truncate names to 10 characters. Use Brand Name, to cover all 
#label names that may appear on the Medicaid RX. Removed references to acetaminophen, promethazine, and carisoprodol 
#which are non-opioids commonly found in opioid-containing pain medications and cough syrup. 

notopioid = ['promethazi','acetaminop', 'tylenol', 'carisoprod']
drug = fda_df['brand_name'].str.split(' ').str[0].str[:10].str.lower()
drugs = list(drug.drop_duplicates())
drugs = [x for x in drugs if x not in notopioid] 

In [13]:
def mat(value):
    """Function that accepts a value, compares the first 10 characters of the string to the names of
    drugs recognized for medication-assisted treatment of substance use disorder. Returns 1 if match,
    0 if no match"""
    if value[:10].lower() in ['buprenorph','methadone','butorphano']:
        return 1
    else:
        return 0

In [14]:
def opioid(value):
    """Function that accepts a value, compares the first 10 characters of the string to the names of
    drugs classified as opioids by the FDA. Returns 1 if match,0 if no match"""
    if value.lower() in drugs:
        return 1
    else:
        return 0

In [15]:
rx_df['mat'] = rx_df['product_fda_list_name'].apply(lambda x: mat(x))
fda_df['mat'] = fda_df['generic_name'].apply(lambda x:mat(x))

In [16]:
rx_df['opioid']=rx_df['product_fda_list_name'].apply(lambda x: opioid(x))

In [17]:
rx_df['opioid'].value_counts()

0    198610
1      4987
Name: opioid, dtype: int64

In [18]:
rx_df['mat'].value_counts()

0    203131
1       466
Name: mat, dtype: int64

Medicaid pads their NDC numbers. Example- NDC 0406-0540 is 004060540; 76420-127 is 764200127. In other words, the digits to the left of the hash are converted to a 5-digit 'labeler code' and the digits to the right of the hash are converted to a 4-digit 'product code'. 

In [19]:
##Use string methods to transform the NDC codes to have the same format. NDC is the key for merging the RX data 
## with the drug data from the FDA

fda_df['label_code']=fda_df['product_ndc'].str.split('-').str[0].str.zfill(5).astype("string")

fda_df['prod_code']=fda_df['product_ndc'].str.split('-').str[1].str.zfill(4).astype("string")

fda_df['ndc'] = fda_df['label_code']+fda_df['prod_code']

fda_df.drop_duplicates(subset = ['product_ndc'], inplace = True) #drop duplicate NDC listings

In [20]:
#Use only the first 9 characters of the ndc in the Medicaid data
rx_df['ndc'] = rx_df['ndc'].str[:9]

Create a smaller data frame of just 'ndc', 'mat', and 'generic_name' from fda_df. Use merge to join rx_df to opioid_df using 'ndc'. The 'MAT' field in the fda_df serves as an indicator if the NDC in rx_df is an opioid and if so, if it is used for medication-assisted therapy

In [21]:
opioid_df = fda_df[['ndc','mat','dea_schedule','generic_name']]


In [22]:
merged_df = rx_df.merge(opioid_df, how = "left", on = 'ndc')
print(merged_df['opioid'].value_counts())
print(merged_df['mat_x'].value_counts())     

0    198610
1      4987
Name: opioid, dtype: int64
0    203131
1       466
Name: mat_x, dtype: int64


To capture the drugs in the NDC list identified as opioids that were missed by comparing label names, look for the prescriptions with a '0' for 'opioid', which were derived from the Medicaid RX drug records, then the records with '0' or '1' for 'mat_y', which were derived from the FDA records. The resulting drugs will also need to be flagged as opioids. Apply similar methodology to the 'mat_x' column to incorporate the MAT drugs identified in the FDA list that may not have matched on names in the prescription list. 

In [23]:
merged_df.loc[((merged_df['opioid'] == 0 )& ~(merged_df['mat_y'].isnull())),'opioid']=1 
merged_df.loc[((merged_df['mat_x'] == 0 )& (merged_df['mat_y']==1)),'mat_x']=1 

In [24]:
print(merged_df['opioid'].value_counts())
print(merged_df['mat_x'].value_counts())     

0    196230
1      7367
Name: opioid, dtype: int64
0    202936
1       661
Name: mat_x, dtype: int64


While number of prescriptions is tempting to use as the target variable, this dataset does not give any indication of the number of doses distributed per prescription. A better target is "units_reimbursed", as this accounts for each pill/injection/liquid dose paid for by Medicaid. 

In [26]:
#select only prescriptions of opioid medications
#drop fields that are not necessary for modeling

merged_df = merged_df[['_quarter_begin','period_covered', 'dea_schedule',
                       'product_fda_list_name','total_amount_reimbursed',
                       'units_reimbursed','mat_x','opioid']]
merged_df = merged_df.loc[merged_df['opioid']== 1].reset_index(drop = True)

In [27]:
#concatenate _quarter_begin and period_covered, convert to datetime
merged_df["qtr_begin_dt"] = pd.to_datetime(merged_df["_quarter_begin"] + "/" + merged_df["period_covered"])

In [29]:
#drop unnecessary columns
merged_df = merged_df[['qtr_begin_dt','mat_x','opioid','dea_schedule',
                      'product_fda_list_name','total_amount_reimbursed','units_reimbursed']]

In [30]:
merged_df.head()

Unnamed: 0,qtr_begin_dt,mat_x,opioid,dea_schedule,product_fda_list_name,total_amount_reimbursed,units_reimbursed
0,2015-07-01,0,1,CII,OXYCODONE,3534.01,6415
1,2015-04-01,0,1,CII,HYDROMORPH,9513.74,10502
2,2015-04-01,0,1,CIV,TRAMADOL H,3147.76,22643
3,2015-10-01,0,1,CII,PERCOCET,627.58,326
4,2015-10-01,0,1,CII,OXYCODONE-,21844.56,114101


In [31]:
%store merged_df

Stored 'merged_df' (DataFrame)
