Proof of propensity score delivering balance

$$ P(X=x \mid \pi(x) = p, A=1) = \frac{P(A=1 \mid \pi(x)=p, X=x)P(X=x \mid \pi(x)=p)}{P(A=1 \mid \pi(x) = p)}$$

$$ P(A = 1 \mid \pi(x)=p) = p $$
$$ P(A = 1 \mid \pi(x)=p, X=x) = p $$

$$ P(X=x \mid \pi(x) = p, A=1) = P(X=x \mid \pi(x)=p) \text{ e.g. independent of treatment}$$


In [None]:
options(repr.plot.width=6, repr.plot.height=4)

In [None]:
install.packages(c("tableone", "Matching", "MatchIt", "optmatch"))

In [None]:
library(tableone)
library(Matching)
library(MatchIt)
library(ggplot2)

In [None]:
std_diff <- function(dfr, tcol, gcol='treat'){
    xt <- dfr[dfr[[gcol]]==1,tcol]
    xc <- dfr[dfr[[gcol]]==0,tcol]
    nt = 2; nc = 2;  
    #nt = length(xt);nc = length(xc);
    (mean(xt) - mean(xc))/(sqrt( ((nt-1)*var(xt) + (nc-1)*var(xc))/(nt+nc-2)))
}

In [None]:
data(lalonde)

The outcome is re78 – post-intervention income.

The treatment is treat – which is equal to 1 if the subject received the labor training and equal to 0 otherwise.

The potential confounding variables are: age, educ, black, hispan, married, nodegree, re74, re75.

#### q1
Find the standardized differences for all of the confounding variables (pre-matching). What is the standardized difference for married (to nearest hundredth)?

In [None]:
xt = lalonde[lalonde$treat==1,]$married
xc = lalonde[lalonde$treat==0,]$married
#nt = length(xt);nc = length(xc);
nt = 2; nc = 2;
(mean(xt) - mean(xc))/(sqrt( ((nt-1)*var(xt) + (nc-1)*var(xc))/(nt+nc-2)))

> q1: 0.719

#### q2
What is the raw (unadjusted) mean of real earnings in 1978 for treated subjects minus the mean of real earnings in 1978 for untreated subjects?



In [None]:
mean(lalonde[lalonde$treat==1,]$re78) - mean(lalonde[lalonde$treat==0,]$re78)

>q2: -$635

#### q3
What are the minimum and maximum values of the estimated propensity score?

Fit a propensity score model. Use a logistic regression model, where the outcome is treatment. Include the 8 confounding variables in the model as predictors, with no interaction terms or non-linear terms (such as squared terms). Obtain the propensity score for each subject.

In [None]:
data(lalonde)
fit <- glm(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
           data=lalonde, family = binomial(link = "logit"))
prop_hat <- predict(fit, newdata = lalonde, type="response")
lalonde$pscore <- prop_hat
min(prop_hat); max(prop_hat)

> q3: pscore (0.009, 0.853)

```g1 = ggplot(lalonde, aes(x=lalonde$pscore, group=treat, fill=treat)) + 
    geom_histogram(position="dodge",binwidth=0.025) + theme_bw()
multiplot(g1,  cols=2)```

#### q4

Now carry out propensity score matching using the Match function.

Before using the Match function, first do:

>set.seed(931139)

Setting the seed will ensure that you end up with a matched data set that is the same as the one used to create the solutions.

Use options to specify pair matching, without replacement, no caliper.

Match on the propensity score itself, not logit of the propensity score. Obtain the standardized differences for the matched data.

What is the standardized difference for married?

One alternative would be
```
m.out <- matchit(treat ~ age + educ + black + hispan + married + nodegree + re74 + re75,
                 data=lalonde, method = "nearest")
p1<- plot(m.out, type='jitter')
p2<- plot(m.out, type='hist')
```


In [None]:
set.seed(931139)
pmatch <- Match(Tr = lalonde$treat, M=1, X=lalonde$pscore, replace=FALSE, caliper=NaN)
matched <- lalonde[unlist(pmatch[c('index.treated', 'index.control')]), ]

In [None]:
std_diff(matched, 'married', 'treat')

> q4: -0.027

#### q5
For the propensity score matched data:
Which variable has the largest standardized difference?

In [None]:
lapply(c('age', 'nodegree', 're74', 'black'), function(v) std_diff(matched, v, 'treat'))

> q5: black, 0.85

#### q6
Re-do the matching, but use a caliper this time. Set the caliper=0.1 in the options in the Match function.
Again, before running the Match function, set the seed: 931139

How many matched pairs are there?

In [None]:
set.seed(931139)
pmatch <- Match(Tr = lalonde$treat, M=1, X=lalonde$pscore, replace=FALSE, caliper=0.1)
matched <- lalonde[unlist(pmatch[c('index.treated', 'index.control')]), ]

In [None]:
nrow(matched)/2

> q6: 111

#### q7
Use the matched data set (from propensity score matching with caliper=0.1) to carry out the outcome analysis.

For the matched data, what is the mean of real earnings in 1978 for treated subjects minus the mean of real earnings in 1978 for untreated subjects?

In [None]:
mean(matched[matched$treat==1,]$re78) - mean(matched[matched$treat==0,]$re78)

> q7: 1246.81

#### q8

Use the matched data set (from propensity score matching with caliper=0.1) to carry out the outcome analysis.

Carry out a paired t-test for the effect of treatment on earnings. What are the values of the 95% confidence interval?

In [None]:
t.test(matched[matched$treat==1,c('re78')],
       matched[matched$treat==0,c('re78')],
       conf.level = 0.95, paired=TRUE)

> q8: (-420.03, 2913.64)

In [None]:
tbone <- CreateTableOne(data=matched, strata='treat', test=TRUE, smd=TRUE )

--- 

In [None]:
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}