<h3>Data Description</h3>
<p>
    The `admissions_processed_morphine_sulfate.csv` file is processed as follows, and is a combination of the `PRESCRIPTIONS.csv`, `ADMISSIONS.csv` and `PATIENTS.csv` files found from the MIMIC-III database,
    <ul>
        <li> There are `6618` unique patients. Each patient could have had multiple hospital stays, but we only considered the first hospital stay that the patient had. The rationale is that we wanted a first impression of the patient.
        <li> These 6618 patients comprise four ethnicities: [WHITE, BLACK, ASIAN, HISPANIC] </li>
        <li> The diagnosis that were selected for consideration were only those that were shared by all four ethnic groups, there is a distribution of these diagnostics among each group in the other jupyter notebook. </li>
        <li> Ages were calculated by taking the difference between birthdate and admittime, for ages that were negative due to HIPAA compliance, we readjusted them to all be 89. </li>
        <li> 122 covariates are considered: [age, HOSPITAL_EXPIRE_FLAG, DIAGNOSIS:%s (114 of them), hosp_duration, INSURANCE (5 types)] </li>
        <li> Only patients that were administered morphine sulfate were then considered, we looked at the total amount they were administered for their single hospital stay duration by taking the FORM_VAL_RX value of the drug.
    </ul>
</p>
<p> Covariates are described above, there are 122 of them, e.g. age and different diagnosis types </p>
<p>Treatment is done by comparing one ethnic group vs the rest, e.g. (WHITE vs [ASIAN, BLACK, HISPANIC]) or (BLACK vs [ASIAN, WHITE, HISPANIC) </p>
<p>Output is the amount of the morphine sulfate the patient is administered</p>


In [1]:
import pandas as pd
import numpy as np
import sys
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
df = pd.read_csv("./data/admissions_processed_morphine_sulfate.csv")

In [3]:
df.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,...,MARITAL_STATUS,ETHNICITY,EDREGTIME,EDOUTTIME,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA,age,TOTAL_FORM_VAL_DISP_MAX,drug
0,10,11,194540,2178-04-16 06:18:00,2178-05-11 19:00:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Private,...,MARRIED,WHITE,2178-04-15 20:46:00,2178-04-16 06:53:00,brain mass,0,1,50,1.25,Morphine Sulfate
1,12,13,143045,2167-01-08 18:43:00,2167-01-15 15:15:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,Medicaid,...,,WHITE,,,coronary artery disease,0,1,39,2.0,Morphine Sulfate
2,18,20,157681,2183-04-28 09:45:00,2183-05-03 14:45:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME,Medicare,...,WIDOWED,WHITE,,,coronary artery disease\coronary artery bypass...,0,1,75,3.0,Morphine Sulfate
3,19,21,109451,2134-09-11 12:17:00,2134-09-24 16:15:00,,EMERGENCY,EMERGENCY ROOM ADMIT,REHAB/DISTINCT PART HOSP,Medicare,...,MARRIED,WHITE,2134-09-11 09:22:00,2134-09-11 22:30:00,congestive heart failure,0,1,87,2.0,Morphine Sulfate
4,22,23,152223,2153-09-03 07:15:00,2153-09-08 19:10:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Medicare,...,MARRIED,WHITE,,,coronary artery disease\coronary artery bypass...,0,1,71,0.4,Morphine Sulfate


In [4]:
def df_to_X(df):
    
    # include age and hospital expire flag
    covariates = ['age', 'HOSPITAL_EXPIRE_FLAG']
    X = df[covariates]
    
    # include onehots for diagnosis
    diagnosis = pd.get_dummies(df.DIAGNOSIS)
    diagnosis.columns = ['DIAGNOSIS:%s' %d for d in diagnosis.columns]
    X = pd.concat([X, diagnosis], axis=1)
    
    # include duration of hosptial stay
    hosp_duration = (df['DISCHTIME'].astype('datetime64[ns]') - df['ADMITTIME'].astype('datetime64[ns]')).dt.days
    X['hosp_duration'] = hosp_duration
    
    # include onehots for insurance
    insur = pd.get_dummies(df.INSURANCE)
    insur.columns = ['INSURANCE:%s' %i for i in insur.columns]
    X = pd.concat([X, insur], axis=1)  

    
    # normalize duration because it is non-categorical
    d_mu = X['hosp_duration'].mean()
    d_std = X['hosp_duration'].std()
    X['hosp_duration'] = X['hosp_duration'].apply(lambda dp: (dp-d_mu)/d_std)

    # normalize age because non-categorical
    age_mu = X['age'].mean()
    age_std = X['age'].std()
    X['age'] = X['age'].apply(lambda age: (age-age_mu)/age_std)

    return X

def df_to_T(df, eth):
    return df['ETHNICITY'].apply(lambda x: int(x==eth))

def df_to_Y(df):
    return df['TOTAL_FORM_VAL_DISP_MAX']

In [5]:
X = df_to_X(df)
T = df_to_T(df, 'WHITE')
Y = df_to_Y(df)
print('X: ', X.shape)
print("T: ", T.shape)
print("Y: ", Y.shape)

X:  (6618, 122)
T:  (6618,)
Y:  (6618,)


### Computing ATE with Check for Balance using GBM

In [7]:
from sklearn.ensemble import GradientBoostingClassifier as GradBoost

In [41]:
def compute_ate_for_treatment_pair(t1, t2):
    T1 = df_to_T(df, t1)
    T2 = df_to_T(df, t2)

    clf1 = GradBoost(n_estimators=500).fit(X, T1)
    clf2 = GradBoost(n_estimators=500).fit(X, T2)
    
    # in class, the formula was (sum_{treated} weight * outcome) / n, where n is the size of the dataset
    # in the paper, the formula is (sum_{treated} weight * outcome) / (sum_{treated} weight)
    # gonna go with class approach here, but note that the other approach might give better results
    # edit: checked and class approach gives ATE of 0.6 but other gives just 0.02
    treated1 = (T1 == 1)
    X1 = X[treated1]
    prop_weights1 = (len(X1) / len(X)) * np.reciprocal(clf1.predict_proba(X1)[:,1])
    reciprocal1 = len(X1) # np.sum(prop_weights1)
    weighted_mean1 = sum(np.multiply(Y[treated1], prop_weights1)) / reciprocal1

    treated2 = (T2 == 1)
    X2 = X[treated2]
    prop_weights2 = (len(X2) / len(X)) * np.reciprocal(clf2.predict_proba(X2)[:,1])
    reciprocal2 = len(X2) # np.sum(prop_weights2)
    weighted_mean2 = sum(np.multiply(Y[treated2], prop_weights2)) / reciprocal2

    print('weighted mean for treatment 1: {}'.format(weighted_mean1))
    print('weighted mean for treatment 2: {}'.format(weighted_mean2))
    print('ATE: {}'.format(weighted_mean1 - weighted_mean2))
    
    # compute unweighted mean and standard deviation of each covariate for the pooled sample across all treatments
    population_covariate_means = np.array(X.mean(axis=0))
    population_covariate_stds = np.array(X.std(axis=0))
    
    # compare the population that got treatment 1 after weighting to the unweighted full population
    covariates1 = np.array(X1)
    weights1 = prop_weights1.reshape((len(prop_weights1), 1))
    weighted_covariates1 = np.multiply(covariates1, weights1)
    covariate_means1 = np.array(weighted_covariates1.mean(axis=0))

    PSB1 = np.divide(np.abs(covariate_means1 - population_covariate_means), population_covariate_stds)
    bad_covariates1 = []
    for i in range(len(PSB1)):
        if PSB1[i] > 0.2:
            bad_covariates1.append((i, PSB1[i]))
    
    print("Covariates for Group 1 with Standad Bias > 0.2: {}".format(bad_covariates1))
    
    covariates2 = np.array(X2)
    weights2 = prop_weights2.reshape((len(prop_weights2), 1))
    weighted_covariates2 = np.multiply(covariates2, weights2)
    covariate_means2 = np.array(weighted_covariates2.mean(axis=0))
    
    PSB2 = np.divide(np.abs(covariate_means2 - population_covariate_means), population_covariate_stds)
    bad_covariates2 = []
    for i in range(len(PSB2)):
        if PSB2[i] > 0.2:
            bad_covariates2.append((i, PSB2[i]))
    
    print("Covariates for Group 2 with Standad Bias > 0.2: {}".format(bad_covariates2))

In [43]:
def compute_ate_for_all_treatment_pairs(treatments):
    # compute unweighted mean and standard deviation of each covariate for the pooled sample across all treatments
    population_covariate_means = np.array(X.mean(axis=0))
    population_covariate_stds = np.array(X.std(axis=0))
    
    all_means = {}
    
    for t in treatments:
        # compute the weighted mean outcome for this treated subpopulation
        T = df_to_T(df, t)
        clf = GradBoost(n_estimators=500).fit(X, T)
        
        treated = (T == 1)
        X_treated = X[treated]
        prop_weights = (len(X_treated) / len(X)) * np.reciprocal(clf.predict_proba(X_treated)[:,1])
        reciprocal = len(X_treated) # np.sum(prop_weights)
        weighted_mean = sum(np.multiply(Y[treated], prop_weights)) / reciprocal

        print('weighted mean for treatment {}: {}'.format(t, weighted_mean))
        all_means[t] = weighted_mean
    
        # compare the population that got this treatment after weighting to the unweighted full population
        covariates = np.array(X_treated)
        weights = prop_weights.reshape((len(prop_weights), 1))
        weighted_covariates = np.multiply(covariates, weights)
        covariate_means = np.array(weighted_covariates.mean(axis=0))

        PSB = np.divide(np.abs(covariate_means - population_covariate_means), population_covariate_stds)
        bad_covariates = []
        for i in range(len(PSB)):
            if PSB[i] > 0.2:
                bad_covariates.append((i, PSB[i]))

        print("Covariates for {} with Standad Bias > 0.2: {}".format(t, bad_covariates))
    
    # compute all pairwise ate's
    for i in range(len(treatments)):
        for j in range(i + 1, len(treatments)):
            t1 = treatments[i]
            t2 = treatments[j]
            print('ATE {} - {}: {}'.format(t1, t2, all_means[t1] - all_means[t2]))

In [42]:
# for our first ate, let's compute the average treatment effect of white vs black
t1 = 'WHITE'
t2 = 'BLACK'

compute_ate_for_treatment_pairs(t1, t2)

weighted mean for treatment 1: 1.882741468900802
weighted mean for treatment 2: 1.2359359246622876
ATE: 0.6468055442385143
Covariates for Group 1 with Standad Bias > 0.2: []
Covariates for Group 2 with Standad Bias > 0.2: [(119, 0.3716897646245216), (120, 0.2569842745485764)]


In [44]:
compute_ate_for_all_treatment_pairs(['WHITE', 'BLACK', 'HISPANIC', 'ASIAN'])

weighted mean for treatment WHITE: 1.882741468900802
Covariates for WHITE with Standad Bias > 0.2: []
weighted mean for treatment BLACK: 1.2245252087272338
Covariates for BLACK with Standad Bias > 0.2: [(119, 0.3803623025755178), (120, 0.26393998484133774)]
weighted mean for treatment HISPANIC: 0.9775012519644581
Covariates for HISPANIC with Standad Bias > 0.2: [(1, 0.3519019067740821), (119, 0.5408790938493004), (120, 0.39187445702862267)]
weighted mean for treatment ASIAN: 0.9314788026914298
Covariates for ASIAN with Standad Bias > 0.2: [(1, 0.25976600921747234), (119, 0.611231660362579), (120, 0.3367548537278799)]
ATE WHITE - WHITE: 0.0
ATE WHITE - BLACK: 0.6582162601735682
ATE WHITE - HISPANIC: 0.9052402169363438
ATE WHITE - ASIAN: 0.9512626662093722
ATE BLACK - BLACK: 0.0
ATE BLACK - HISPANIC: 0.24702395676277566
ATE BLACK - ASIAN: 0.293046406035804
ATE HISPANIC - HISPANIC: 0.0
ATE HISPANIC - ASIAN: 0.04602244927302834
ATE ASIAN - ASIAN: 0.0
