<h3>Data Description</h3>
<p>
    The `admissions_processed_morphine_sulfate.csv` file is processed as follows, and is a combination of the `PRESCRIPTIONS.csv`, `ADMISSIONS.csv` and `PATIENTS.csv` files found from the MIMIC-III database,
    <ul>
        <li> There are `6618` unique patients. Each patient could have had multiple hospital stays, but we only considered the first hospital stay that the patient had. The rationale is that we wanted a first impression of the patient.
        <li> These 6618 patients comprise four ethnicities: [WHITE, BLACK, ASIAN, HISPANIC] </li>
        <li> The diagnosis that were selected for consideration were only those that were shared by all four ethnic groups, there is a distribution of these diagnostics among each group in the other jupyter notebook. </li>
        <li> Ages were calculated by taking the difference between birthdate and admittime, for ages that were negative due to HIPAA compliance, we readjusted them to all be 89. </li>
        <li> 122 covariates are considered: [age, HOSPITAL_EXPIRE_FLAG, DIAGNOSIS:%s (114 of them), hosp_duration, INSURANCE (5 types)] </li>
        <li> Only patients that were administered morphine sulfate were then considered, we looked at the total amount they were administered for their single hospital stay duration by taking the FORM_VAL_RX value of the drug.
    </ul>
</p>
<p> Covariates are described above, there are 122 of them, e.g. age and different diagnosis types </p>
<p>Treatment is done by comparing one ethnic group vs the rest, e.g. (WHITE vs [ASIAN, BLACK, HISPANIC]) or (BLACK vs [ASIAN, WHITE, HISPANIC) </p>
<p>Output is the amount of the morphine sulfate the patient is administered</p>


In [None]:
import pandas as pd
import numpy as np
import sys
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
df = pd.read_csv("./data/admissions_processed_morphine_sulfate.csv")

In [3]:
def df_to_X(df):
    
    # include age and hospital expire flag
    covariates = ['age', 'HOSPITAL_EXPIRE_FLAG']
    X = df[covariates]
    
    # include onehots for diagnosis
    diagnosis = pd.get_dummies(df.DIAGNOSIS)
    diagnosis.columns = ['DIAGNOSIS:%s' %d for d in diagnosis.columns]
    X = pd.concat([X, diagnosis], axis=1)
    
    # include duration of hosptial stay
    hosp_duration = (df['DISCHTIME'].astype('datetime64[ns]') - df['ADMITTIME'].astype('datetime64[ns]')).dt.days
    X['hosp_duration'] = hosp_duration
    
    # include onehots for insurance
    insur = pd.get_dummies(df.INSURANCE)
    insur.columns = ['INSURANCE:%s' %i for i in insur.columns]
    X = pd.concat([X, insur], axis=1)  

    
    # normalize duration because it is non-categorical
    d_mu = X['hosp_duration'].mean()
    d_std = X['hosp_duration'].std()
    X['hosp_duration'] = X['hosp_duration'].apply(lambda dp: (dp-d_mu)/d_std)

    # normalize age because non-categorical
    age_mu = X['age'].mean()
    age_std = X['age'].std()
    X['age'] = X['age'].apply(lambda age: (age-age_mu)/age_std)

    return X

def df_to_T(df, eth):
    return df['ETHNICITY'].apply(lambda x: int(x==eth))

def df_to_Y(df):
    return df['TOTAL_FORM_VAL_DISP_MAX']

In [10]:
X = df_to_X(df)
T = df_to_T(df, 'WHITE')
Y = df_to_Y(df)
print('X: ', X.shape)
print("T: ", T.shape)
print("Y: ", Y.shape)

X:  (6618, 122)
T:  (6618,)
Y:  (6618,)


In [9]:
T = df_to_T(df, 'WHITE')

clf = LogisticRegression().fit(X, T)

treated = np.where(T==1)[0]
no_treated = np.where(T==0)[0]

predict = list(range(len(X)))
for i in range(len(X)):
    predict[i] = clf.predict_proba([X.iloc[i]])
    
ATE = 0
for i in treated:
    ATE += Y[i]/predict[i][0][1]
ans = ATE/len(df)
print("treated ATE with inverse propensity: ", ans)

ATE2 = 0
for i in no_treated:
    ATE2 += Y[i]/predict[i][0][0]
ans2 = ATE2/len(df)
print("no treated ATE with inverse propensity: ", ans2)

print("difference between treated and no treated: ", ans - ans2)



treated ATE with inverse propensity:  1.9243670273957083
no treated ATE with inverse propensity:  1.7621489194068773
difference between treated and no treated:  0.16221810798883096


In [11]:
T = df_to_T(df, 'BLACK')

clf = LogisticRegression().fit(X, T)

treated = np.where(T==1)[0]
no_treated = np.where(T==0)[0]

predict = list(range(len(X)))
for i in range(len(X)):
    predict[i] = clf.predict_proba([X.iloc[i]])
    
ATE = 0
for i in treated:
    ATE += Y[i]/predict[i][0][1]
ans = ATE/len(df)
print("treated ATE with inverse propensity: ", ans)

ATE2 = 0
for i in no_treated:
    ATE2 += Y[i]/predict[i][0][0]
ans2 = ATE2/len(df)
print("no treated ATE with inverse propensity: ", ans2)

print("difference between treated and no treated: ", ans - ans2)



treated ATE with inverse propensity:  1.7163226794520907
no treated ATE with inverse propensity:  1.9228039884806178
difference between treated and no treated:  -0.20648130902852713


In [12]:
T = df_to_T(df, 'ASIAN')

clf = LogisticRegression().fit(X, T)

treated = np.where(T==1)[0]
no_treated = np.where(T==0)[0]

predict = list(range(len(X)))
for i in range(len(X)):
    predict[i] = clf.predict_proba([X.iloc[i]])
    
ATE = 0
for i in treated:
    ATE += Y[i]/predict[i][0][1]
ans = ATE/len(df)
print("treated ATE with inverse propensity: ", ans)

ATE2 = 0
for i in no_treated:
    ATE2 += Y[i]/predict[i][0][0]
ans2 = ATE2/len(df)
print("no treated ATE with inverse propensity: ", ans2)

print("difference between treated and no treated: ", ans - ans2)



treated ATE with inverse propensity:  1.7837247968392604
no treated ATE with inverse propensity:  1.9154913152470474
difference between treated and no treated:  -0.13176651840778697


In [13]:
T = df_to_T(df, 'HISPANIC')

clf = LogisticRegression().fit(X, T)

treated = np.where(T==1)[0]
no_treated = np.where(T==0)[0]

predict = list(range(len(X)))
for i in range(len(X)):
    predict[i] = clf.predict_proba([X.iloc[i]])
    
ATE = 0
for i in treated:
    ATE += Y[i]/predict[i][0][1]
ans = ATE/len(df)
print("treated ATE with inverse propensity: ", ans)

ATE2 = 0
for i in no_treated:
    ATE2 += Y[i]/predict[i][0][0]
ans2 = ATE2/len(df)
print("no treated ATE with inverse propensity: ", ans2)

print("difference between treated and no treated: ", ans - ans2)



treated ATE with inverse propensity:  1.5354823597319571
no treated ATE with inverse propensity:  1.9241439205935043
difference between treated and no treated:  -0.38866156086154713


In [18]:
X.columns

Index(['age', 'HOSPITAL_EXPIRE_FLAG', 'DIAGNOSIS:abdominal aortic aneurysm',
       'DIAGNOSIS:abdominal aortic aneurysm/sda', 'DIAGNOSIS:abdominal pain',
       'DIAGNOSIS:acute coronary syndrome',
       'DIAGNOSIS:acute myocardial infarction',
       'DIAGNOSIS:acute renal failure', 'DIAGNOSIS:acute subdural hematoma',
       'DIAGNOSIS:airway obstruction',
       ...
       'DIAGNOSIS:upper gastrointestinal bleed', 'DIAGNOSIS:upper gi bleed',
       'DIAGNOSIS:urosepsis', 'DIAGNOSIS:weakness', 'hosp_duration',
       'INSURANCE:Government', 'INSURANCE:Medicaid', 'INSURANCE:Medicare',
       'INSURANCE:Private', 'INSURANCE:Self Pay'],
      dtype='object', length=122)

In [21]:
len(df.INSURANCE.value_counts())

5