## Tutorial on Treatment Effect

This tutorial will introduce to you ways to measure treatment effect in data. 
<br>The components of this tutorial are:
1. Introduction
2. Data Pre-processing
3. Measuring Treatment Effect
    * Naive method 
    * Nearest Neighbor Matching
    * Propensity Score Matching
4. Conclusion and Other Resources
5. Reference
    

### Introduction
In layman's term, measuring treatment effect is trying to measure the differences in outcomes of same subject that receive different treatments. This is especially important when anaylizing data related to medical services or treatments. For example, we would like to know whether a medicine is effective in improving the survival rate of patients. One way is to compare the mean survival rate of two groups of people -- the group that takes the medicine (treated) and the group that does not (control). If the mean survival rate of the treated group is higher, we might infer that the medicine is effective.<br>

In this tutorial, I will use the Right heart catheterization dataset from http://biostat.mc.vanderbilt.edu/DataSets to analyze the treatment effect of patients receiving a treatment called RHC. The outcome is either death or survival of the patient. The detailed explanation regarding the dataset can be found [here](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/rhc.html).

First, we need to load the csv data into a pandas dataframe:

In [1]:
import pandas as pd
import numpy as np

In [47]:
data = pd.read_csv('rhc.csv')
print(data.head(5))

   Unnamed: 0               cat1           cat2   ca  sadmdte  dschdte  \
0           1               COPD            NaN  Yes    11142  11151.0   
1           2      MOSF w/Sepsis            NaN   No    11799  11844.0   
2           3  MOSF w/Malignancy  MOSF w/Sepsis  Yes    12083  12143.0   
3           4                ARF            NaN   No    11146  11183.0   
4           5      MOSF w/Sepsis            NaN   No    12035  12037.0   

    dthdte  lstctdte death  cardiohx  ...   meta  hema  seps  trauma  ortho  \
0      NaN     11382    No         0  ...     No    No    No      No     No   
1  11844.0     11844   Yes         1  ...     No    No   Yes      No     No   
2      NaN     12400    No         0  ...     No    No    No      No     No   
3  11183.0     11182   Yes         0  ...     No    No    No      No     No   
4  12037.0     12036   Yes         0  ...     No    No    No      No     No   

   adld3p   urin1   race      income  ptid  
0     0.0     NaN  white  Under $11

There are 63 variables and 5735 entries in total. 
The treatment that we would like to determine is in the column 'swang1', which have  either "RHC" or "no RHC". The outcome is in the column 'death'. 

### Data Pre-processing
To simplify the example, and also to ensure that the code does not run for too long, I will only keep the first 1500 individuals. Also, for the analysis below, I will only use 7 variables as features/covariates for each individual. They are: mean blood pressure ('meanbp1'), heart rate ('hrt1'), respiration rate ('resp1'), temperature ('temp1'), PH ('ph1'), weight ('wtkilo1') and cancer ('ca'). Before evaluating the treatment effect, first, we need to do some feature pre-processing. 

In [3]:
#use only the first 1500 data entries
data = data.iloc[0:1500]

Both the outcome (death = Yes/No) and the treatment (swang1 = RHC/No RHC) are binary categorical variables. In this case, it is adequate to use only 1 column with 0/1 to represents the feature, instead of having two dummy variable columns for each value. The following method returns only 1 column that we choose. 

In [52]:
# this function converts binary categorical column into a 0/1 column
def convert_cat_to_binary(col, chosen): 
    #get dummy variable
    dummies = pd.get_dummies(col)
    #choose the column that we need
    binary_col = dummies.loc[:, chosen]
    return binary_col

In [50]:
#Use 0/1 to represents the treatment: RHC = 1, no RHC = 0
treated = convert_cat_to_binary(data.loc[:,'swang1'], 'RHC')
#Use 0/1 to represents the outcome: death No = 1, Yes = 0
survived = convert_cat_to_binary(data.loc[:,'death'], 'No')

In [51]:
print(survived.head(10))

0    1
1    0
2    1
3    0
4    0
5    1
6    1
7    0
8    1
9    1
Name: No, dtype: uint8


We can see that we have converted a two-value categorical column to a 0/1 binary column.

In [9]:
#selected numerical features
selected_numerical = data[['meanbp1', 'hrt1', 'resp1', 'temp1', 'ph1', 'wtkilo1']]
#selected categorical feature: cancer and convert to numerical
cancer_one_hot = pd.get_dummies(data.loc[:,'ca'], prefix = 'cancer')
#combine the features
feats = pd.concat([cancer_one_hot, selected_numerical], axis = 1)

Now, we are ready for the analysis!

### Measure Treatment Effect
#### Naive Method
A naive way of evaluating the treatment effect is comparing the survival rate for people in the dataset that either receives or not receives the treatment. 

In [10]:
#outcome of the treated group
treated_outcome = survived.loc[treated == 1]
#outcome of the control group
control_outcome = survived.loc[treated == 0]
print("Number of treated individuals is %d, and %d survived" %(len(treated_outcome), treated_outcome.sum()))
print("Number of un-treated individuals is %d and %d survived." %(len(control_outcome), control_outcome.sum()))

Number of treated individuals is 569, and 194 survived
Number of un-treated individuals is 931 and 353 survived.


In [11]:
#percentage survived of the treated group
treated_surv_rate = treated_outcome.sum()/len(treated_outcome)
#percentage survived of the control group
control_surv_rate = control_outcome.sum()/len(control_outcome)
print("treated survival rate %f, untreated survival rate %f" %(treated_surv_rate, control_surv_rate))

treated survival rate 0.340949, untreated survival rate 0.379162


Average treatment effect (ATE) refers to the differences in mean outcomes between the treatment and control group. We can see that in this case, $ATE = 0.341-0.379 = -0.038$. However, how do we know whether this difference is significant, and not by chance? We will use the Pearson's chi-squared test for independence to verify that.

According to the [Wikipedia page](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test), Pearson's chi-squared test is applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. 

By hypothesis testing: 

H0: in the population, the two categories, survival and treatment, are independent. 
<br>
H1: in the population, the two categories are dependent. 

The value of the test statistics is $$\chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i}$$
Where $O_i$ are the actual values, and $E_i$ are the expected frequencies if we assume that treatment and survival are independent to each other. For example, if we assume independence, the number of survived treated patients should = population \* percentage survived \* percentage treated, which is $\frac{547}{1500} * \frac{569}{1500} * 1500 = 207$. We can get the expected values for all 4 categories in this way: 

The actual values are: 

|      | treated | untreated | total |
|--------|---------|--------|---------|
|survived | 194|353|547|
|dead   |    375  |     578  |   953  |
|total  |    569 |      931  |  1500|

and the expected values:

|      | treated | untreated | total |
|--------|---------|--------|---------|
|survived | 207|340|547|
|dead   |    362  |     591  |   953  |
|total  |    569 |      931  |  1500|

Therefore, according to formula, the chi-squared value is 

$$ \frac{(194-207)^2}{207} + \frac{(353-340)^2}{340} + \frac{(375-362)^2}{362} + \frac{(578-591)^2}{591} = 2.066
$$


We also formulate this calculation in the following method, so that we can call this function to calculate the chi-squared value later in this tutorial:

In [53]:
#This function takes 4 parameters, the number of people survived and dead in both the 
#treated and control group. It returns the chi-square value statistics.
def chi_sqr(treated_survived, treated_dead, control_survived, control_dead):
    #get the 4 totals
    treated_total = treated_survived + treated_dead
    control_total = control_survived + control_dead
    survived = treated_survived + control_survived
    dead = treated_dead + control_dead
    total = survived + dead
    
    #get the 4 expected values
    e_ts = int(round((survived / total) * (treated_total / total) * total))
    e_cs = survived - e_ts
    e_td = treated_total - e_ts
    e_cd = dead - e_td
    
    #use the formula to get the chi_sqr value
    chi_sqrt = (treated_survived - e_ts)**2/e_ts +\
               (control_survived - e_cs)**2/e_cs + \
               (treated_dead - e_td)**2/e_td +\
               (control_dead - e_cd)**2/e_cd
    return chi_sqrt

In [15]:
print(chi_sqr(194,375,353,578))

2.06629077979983


How does this value prove our hypothesis? Recall that in order to do hypothesis testing, we need to get the p-value of the test. In this case, we can pass the value into a chi-square p test to get its p-value. 

Before that, we also need to know the degree of freedom (df) of this test. For the chi-squared test, the df is the number of grid cells that are actually independent given the set of totals, in this case, 1 (once you feed in 1 value, with the totals, the other 3 values can be calculated). 

With the p-value and the df, we can now use the scipy chi-squared p value method: 

In [16]:
from scipy.stats import chi2
p_value = chi2.sf(2.066, 1)
print(p_value)

0.1506161007428045


As we learn during class, p-value is the probability of observing the data under the null hypothesis. Therefore, if the cutoff is 0.05, we cannot reject the null hypothesis. Therefore, our result is not statistically significant.

#### Nearest-Neighbor Matching

This above analysis seems reasonable, but there is an underlying problem with this naive comparison method -- the 'treatment' and 'control' groups are not 'standardized'. When we do control experiment, we need to ensure that other than the feature we analyze, all other features should be kept the same. However, this is not the case here. The patients' physical conditions, medical histories can all be different. For example, the reason a patient receives RHC might be his very serious condition. At the same time, the serious condition, instead of RHC, might result in his death. Therefore, the differences in survival rate might not be due to the treatments, but the factors that determine the treatments. 

A way to reduce this bias is through various matching methods. We can pair up individuals with similar attributes but with opposite treatment and form two matched groups. We then compare the survival rate of the two groups. In this way, we will have a 'peudo' control experiment where other than the factor we analyze, all other features are very similar.

A simple but effective matching method is nearest-neighbor matching. We will provide a distance metric. Using this distance metric, we find matched pairs with smallest distance from these two groups. If the number of entries in both groups differ, we will use entries from smaller group to find matched individual from the larger group, and discard the unmatched ones. 

In this example, we will use squared difference as the distance metric. Also, in order to ensure that the difference in each attribute contribute fairly, we will normalize the features first. 

In [54]:
#Normalize the features using mean and standard deviation
normalized_feats=(feats-feats.mean())/feats.std()

In [55]:
#get the features of the two groups
control_feats = normalized_feats.loc[treated == 0]
treated_feats = normalized_feats.loc[treated == 1]

In [19]:
# This method returns the number of survived individuals in both groups after 1-1 matching.
# Since the treated group has a smaller size, we will find the 1-1 match for each of 
# the treated entry and discard the rest of unmatched entries in the control group
def nearest_neighbor_matching(treated_outcome, control_outcome, treated_feats, control_feats):
    #the number of survived individuals in the treated group
    treated_survived = treated_outcome.sum()
    control_survived = 0
    for i in range(len(treated_feats)):
        min_dist = float("inf")
        min_ind = -1
        # iterate through the control group to find the one with least distance
        for j in range(len(control_feats)):
            #calcuate the squared distance between the treated and control entries
            cur_dist = ((treated_feats.iloc[i] - control_feats.iloc[j])**2).sum()
            if cur_dist < min_dist: 
                min_dist = cur_dist
                min_ind = j
        
        #update the num_survived for both groups with this current matched pair
        control_survived += control_outcome.iloc[min_ind]
        
        #remove the matched control entry from both the feature matrix and the outcome array
        control_feats = control_feats.drop(control_feats.index[min_ind], axis = 0)
        control_outcome = control_outcome.drop(control_outcome.index[min_ind], axis = 0)
    
    return treated_survived, control_survived

In [20]:
treated_survived, control_survived = nearest_neighbor_matching(treated_outcome, control_outcome, treated_feats, control_feats)

In [21]:
num_each_group = len(treated_outcome) #since the two groups now have the same size
ATE = treated_survived/num_each_group - control_survived/num_each_group
print("ATE is %f" %ATE)

treated_dead = num_each_group - treated_survived
control_dead = num_each_group - control_survived
chi = chi_sqr(treated_survived, treated_dead, control_survived, control_dead)
print(chi)
p_value_matched = chi2.sf(chi, 1)
print(p_value_matched)

ATE is -0.035149
1.528337362342197
0.21636214860657207


We can see that the ATE is slightly less negative from what we got without matching. However, the large p value shows that this ATE is still not significant (by chance).

#### Propensity Score Matching (with Nearest Neighbor)

Besides matching directly on the covariate (feature) values, another effective matching method is propensity score matching. Since we suspect that there are some other features might affect both the treatment and the outcome, we will run a logistic regression to estimate the probability of getting the treatment given these other features. This is the propensity score. 

So we will match those entries in both groups that have same or similar propensity scores. In other words, for both entries in a matched pair, their probability of getting the treatment is the same (similar). However, they are given different treatment. This will allow that their outcomes are solely based on the treatment, and not the other confounding features. 

There is a python package "CausalInference" that can help us generate propensity scores. We will need to first install the package using the following command.

$ pip install causalinference

In [23]:
#import the package
from causalinference import CausalModel

Now, we can construct the causal model using our data. Notice that the model only takes in matrices as parameters, not dataframes. So we need to change them to matrices before feeding in the function.

In [24]:
#construct the causal model
causal = CausalModel(survived.as_matrix(), treated.as_matrix(), normalized_feats.as_matrix())

The 'summary_stats' attribute gives the summary of the statistics, including mean and standard deviation of both the outcome and all the features of each group. In the summary below, we can see that the raw-diff = -0.038, same as what we calculated before.

In [25]:
print(causal.summary_stats)


Summary Statistics

                       Controls (N_c=931)         Treated (N_t=569)             
       Variable         Mean         S.d.         Mean         S.d.     Raw-diff
--------------------------------------------------------------------------------
              Y        0.379        0.485        0.341        0.474       -0.038

                       Controls (N_c=931)         Treated (N_t=569)             
       Variable         Mean         S.d.         Mean         S.d.     Nor-diff
--------------------------------------------------------------------------------
             X0        0.023        1.038       -0.038        0.934       -0.062
             X1       -0.039        1.025        0.064        0.955        0.104
             X2        0.029        1.027       -0.047        0.953       -0.077
             X3        0.165        1.018       -0.270        0.908       -0.452
             X4       -0.071        1.002        0.116        0.986        0.188
      

There are two functions in this library that calculates propensity score. est_propensity() and est_propensity_s(). For the first one, we need to specify the covariates to include linearly/quadratically; the second one will decide that automatically. We will use the second one for simplicity.

In [28]:
causal.est_propensity_s()
print(causal.propensity)


Estimated Parameters of Propensity Score

                    Coef.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
     Intercept     -0.673      0.084     -8.031      0.000     -0.837     -0.509
            X3     -0.590      0.071     -8.374      0.000     -0.729     -0.452
            X8      0.258      0.059      4.337      0.000      0.141      0.374
            X4      0.309      0.063      4.940      0.000      0.187      0.432
            X5     -0.225      0.060     -3.751      0.000     -0.343     -0.107
            X1      0.144      0.059      2.449      0.014      0.029      0.260
            X6     -0.090      0.058     -1.560      0.119     -0.203      0.023
         X3*X1     -0.149      0.063     -2.364      0.018     -0.273     -0.025
         X3*X3      0.167      0.064      2.611      0.009      0.042      0.292
         X4*X5     -0.134      0.055     -2.449      0.014     -0.

We can see that there are some linear variable terms and some quadratic terms. The Coef. shows how each covariate(feature) contributes to the propensity score. Running the following function will gives an array of size len(input), which specifies the propensity score for each entry. 

In [32]:
prop_scores = causal.propensity['fitted']
print(prop_scores)

[0.43603559 0.35859778 0.17740404 ... 0.22427367 0.29045841 0.19459546]


So now we got the propensity score of each entry. We could again do nearest neighbor matching on those that have similar scores but different outcome for our analysis of treatment effect.

In [33]:
#Get the propensity score for the treated and control groups
propensity_treated = prop_scores[treated == 1]
propensity_control = prop_scores[treated == 0]

Since the score of each entry is pre-calculated, we can do the matching in a more efficient way. We could sort the scores of both groups, and do greedy matching along the two sorted arrays. Since it is very hard for two entries to have the exact same score, we will provide an $\epsilon$ such as as long as the difference between two scores are within this range, we can treat them as a matched pair. 

In [41]:
def propensity_nearest_neighbor(eps, treated_outcome, control_outcome, propensity_treated, propensity_control):
    
    ts_total = 0 #sum of outcome of matched treated entries
    cs_total = 0 #number of treated entries matched
    ts_count = 0 #sum of outcome of matched control entries
    cs_count = 0 #number of treated entries matched

    #sort the treated group by propensity score
    treated_sort_ind = np.argsort(propensity_treated)
    sorted_t_outcome = treated_outcome.iloc[treated_sort_ind]
    sorted_t_scores = propensity_treated[treated_sort_ind]

    #sort the control group by propensity score
    control_sort_ind = np.argsort(propensity_control)
    sorted_ctr_outcome = control_outcome.iloc[control_sort_ind]
    sorted_ctr_scores = propensity_control[control_sort_ind]
    
    i = 0 #ind of treated grouop
    j = 0 #ind of control group
    while (i<len(sorted_t_outcome)): 
        store_prev_j = j
        cur_score = sorted_t_scores[i]
        #for each treated entry, try to match an entry in control
        while(j<len(sorted_ctr_outcome)):
            dist = sorted_ctr_scores[j]- cur_score
            abs_dist = abs(dist)
            #as long as absolute difference between two scores is within epsilon, we will match the pair
            if abs_dist <= eps:
                #update the sum and number
                ts_total += sorted_t_outcome.iloc[i]
                cs_total += sorted_ctr_outcome.iloc[j]
                ts_count += 1
                cs_count += 1
                #increment j
                j+=1
                break
            else:
                #if the absolute distance is larger than epsilon
                if dist > 0:
                    #if score[i] < score[j], 
                    #this means that we cannot find a match for the current treated entry. 
                    #We will discard the entry, and return to the previous j position.
                    j = store_prev_j
                    break
                else:
                    #if score[i] > score[j]
                    j+=1
            
        i+=1
    return ts_total, ts_count, cs_total, cs_count

In [42]:
treated_outcome_sum, treated_count, control_outcome_sum, control_count = propensity_nearest_neighbor(0.001, treated_outcome, control_outcome, propensity_treated, propensity_control)

In [43]:
ATE = treated_outcome_sum/treated_count - control_outcome_sum/control_count
print(ATE)

-0.05200945626477543


In [46]:
chi = chi_sqr(treated_outcome_sum, treated_count-treated_outcome_sum, control_outcome_sum, control_count-control_outcome_sum)
print(chi)
p_value_matched = chi2.sf(chi, 1)
print(p_value_matched)

2.477995642701525
0.11544929422372713


Again, a p-value of 0.115 shows that the null hypothesis still could not be rejected. So
we may still conclude that the differences are not statistically significant.  

Another useful function of this library that makes use of propensity score is stratify. With propensity scores, we could stratify the data into several sub-groups, whose scores are more similar to each other, and measure the treatment effect within each subgroup.  

In [45]:
causal.stratify_s()
print(causal.strata)


Stratification Summary

              Propensity Score         Sample Size     Ave. Propensity   Outcome
   Stratum      Min.      Max.  Controls   Treated  Controls   Treated  Raw-diff
--------------------------------------------------------------------------------
         1     0.074     0.253       307        69     0.196     0.197    -0.027
         2     0.253     0.309       140        48     0.281     0.280    -0.052
         3     0.309     0.374       114        73     0.341     0.339    -0.053
         4     0.374     0.482       217       158     0.422     0.426     0.005
         5     0.483     0.574        89        98     0.523     0.526    -0.079
         6     0.574     0.966        64       123     0.641     0.658    -0.057



After we split the data into sub-groups, we can see that the raw-diff of different groups are quite different. causal.strata is a list of causalModel object. Therefore, we can also use "for stratum in causal.strata" to access each model and get more insight into each group if necessary. 

### Conclusion and Other Resources
In conclusion, measuring treatment effect is a very important topic in data science. The methods introduced in this tutorial are relatively simple. For example, instead of using squared distance in nearest neighbor matching, there are more effective distance metrics. Also, there has always been a discussion on whether propensity score is an unbiased and useful strategy. Nevertheless, they are good starting point in understanding the concept of causality and treatment effect. 

If you are interested, the following resources will be very useful to learn more deeply on this topic. Also, the CausalInference library provides a wide range of other methods in estimating treatment effect. 

1. [More information on the CausalInference package](https://github.com/laurencium/causalinference/blob/master/docs/tex/vignette.pdf)
2. [A detailed explanation of the propensity score matching method](http://bayes.cs.ucla.edu/BOOK-09/ch11-3-5-final.pdf)
3. [A detailed introduction of Pearson's chi-squared test](http://www.ling.upenn.edu/~clight/chisquared.htm)
4. [A detailed introduction of various nearest matching method](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4285163/)
4. [A R package (MatchIt) that provides more matching functionalities](https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html)

### Other References
1. Connors, Alfred F., et al. "The effectiveness of right heart catheterization in the initial care of critically III patients." Jama 276.11 (1996): 889-897.
2. Wikipedia contributors. (2018, March 27). Pearson's chi-squared test. In Wikipedia, The Free Encyclopedia. Retrieved 15:03, March 29, 2018, from https://en.wikipedia.org/w/index.php?title=Pearson%27s_chi-squared_test&oldid=832669688
3. Huber, Chuck, and D. Drukker. "Introduction to treatment effects in Stata: Part 2." The STATA Blog 7 (2015).
4. Austin, Peter C. "A comparison of 12 algorithms for matching on the propensity score." Statistics in medicine 33.6 (2014): 1057-1069.
5. Wang, Laurence, CausalInference, (2017), GitHub repository, https://github.com/laurencium/Causalinference
6. Caitlin, Light. “Tutorial: Pearson's Chi-Square Test for Independence.” Department of Linguistics, https://www.ling.upenn.edu/~clight/chisquared.htm.
