<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 24</h1>

<b>Regression discontinuity (RD)</b> is an elegant, artful method of causal inference. Also, RD methods require 

* Good data wrangling skills

* Luck with public data sources, or

* Cultivated relationships with guardians of restricted data


Learning objectives:

1. A successful RD often requires lots of data, and lots of data wrangling
2. Running an RD regression might be straightforward
3. Graphing an RD plot might be harder

Special thanks to [Baylor Economics Prof. Scott Cunningham](https://www.scunning.com/), who provides a very nice overview of RD methods with applications in [Chapter 6 of *Causal Inference: The Mixtape*](https://mixtape.scunning.com/06-regression_discontinuity). Below I have adapted the __R__ code he lists, which I believe he wrote himself, perhaps inspired by the Stata code supplied by [CSPH Prof. Marcelo Perraillon](https://coloradosph.cuanschutz.edu/resources/directory/directory-profile/Perraillon-Marcelo-UCD6000072694).

<h3>Broader discussion points</h3>

* [Card, Dobkin, and Maestas (2008)](https://www-aeaweb-org.libproxy.berkeley.edu/articles?id=10.1257/aer.98.5.2242) report significant effects of Medicare eligibility on survival among severely ill patients
* How should we think about these results alongside those of the RAND Health Insurance Experiment?
* Hint: the key concept rhymes with "Focal Bandage Sweetened Insect"

<h2>Medicare eligibility at age 65</h2>

[Card, Dobkin, and Maestas (2008)](https://www-aeaweb-org.libproxy.berkeley.edu/articles?id=10.1257/aer.98.5.2242) examine health effects associated with the discontinuous jump in Medicare eligibility on one's 65th birthday in the United States. As they remark in their opening paragraph, however, population-level or average measures outcome do not show much of a jump at age 65, despite the jump in Medicare eligibility:

> Medicare pays nearly one-fifth of total health care costs in the United States. Yet evidence on the health effects of the program is limited. Studies of aggregate death rates before and after the introduction of Medicare show little indication of a program impact (Finkelstein and McKnight 2005). The age profiles of mortality and self-reported health in the population as a whole are likewise remarkably smooth around the eligibility threshold at age 65 (Card, Dobkin, and Maestas 2004; Dow 2004). Although existing research has shown that the utilization of health care services increases once people become eligible for Medicare (e.g., Decker and Rapaport [2002], McWilliams et al. [2003, 2007], Card, Dobkin, and Maestas [2004]), the health impact of these additional services remains uncertain.

First, CDM show the stark discontinuity in Medicare coverage using pooled extracts of the 1999-2003 waves of the National Health Interview Survey (NHIS), which are public and can be drawn from [IPUMS](https://healthsurveys.ipums.org/). Then they examine restricted-access datasets covering hospital admissions via emergency departments in California and elsewhere. We can examine the former but not the latter.

<h3>The RD graphic</h3>

Below is their Figure 1, which shows big discontinuous jumps in Medicare coverage and in coverage by any insurance, a slump in multiple-policy coverage, and a big drop in managed-care coverage. 

(For more on "managed care," take ECON 157. But the short of it is that most Americans under age 65 are covered by employer-based insurance, which typically takes the form of "managed care," meaning the insurance company intermediates and sets policies in addition to setting rates of coinsurance (out-of-pocket payment) per treatment or service.)

<img src="images/cdm-fig1.png" width="700" />

In [None]:
library(tidyverse)
library(haven)
install.packages("rdrobust")
library(rdrobust)
install.packages("estimatr")
library(estimatr)

Here is the NHIS data extract that CDM provide in the paper's replication package on the AEA website. The extract contains a subset of the 1999-2003 waves of the NHIS. A key variable is age measured in years and quarters, because the NHIS reported month of birth in public files prior to 2015, and it also reported quarter of interview.

In [None]:
his99p <- read_dta("data/his99p.dta")

# What are the dimensions of this dataset?
# Big although not super humanly large. Small enough for datahub
dim(his99p)

Many AEA replication packages contain Stata or SAS datasets, and a big pitfall can be the way different programs treat missing values. Here, it is useful to remove the NA's. There might be a way to do this for only a subset of variables and thus keep more rows, but I leave it to the careful reader to ascertain whether that's possible. (My first ask of ChatGPT resulted in bad advice!)

In [None]:
# Remove the NAs
his99p0 <- na.omit(his99p)
# The code below was what Chat suggested, 
# but it only keeps the mcare column, which is no good
#his99p0 <- na.omit(his99p[, "mcare", drop = FALSE])
dim(his99p0)

Let's generate an RD plot that replicates part of CDM's Figure 1. There is a slicker way of doing this using the `rdplot()` function inside the `rdrobust` package, but a brute force way of proceeding is generating averages of the $y$-variable over a collapsed or "cut" $x$-variable. 

In [None]:
# Generate 201 cuts, which matches the argument in the next call below,
# and measure the mean in mcare across those cuts
mcaremeans <- split(his99p0$mcare, 
                  cut(his99p0$age, 201)) %>% 
  lapply(mean) %>% 
  unlist()

In [None]:
# Create a data frame with the average levels of mcare (Y)
# and the age variable running from -10 to 10 with increments of 0.1,
# which are 201 cuts, matching what is in the call above
agg_his99p0_data <- data.frame(mcare = mcaremeans, 
                           age = seq(-10,10, by = 0.1))

In [None]:
dim(agg_his99p0_data)

In [None]:
# Generate a column gg_group that measures being past the cutoff
his99p <- his99p %>% 
  mutate(gg_group = case_when(age > 0 ~ 1, TRUE ~ 0))

In [None]:
# Call ggplot with overlapping stuff
ggplot(his99p, aes(age, mcare)) +
  geom_point(aes(x = age, y = mcare), data = agg_his99p0_data) +
  stat_smooth(aes(age, mcare, group = gg_group), method = "lm") +
  xlim(-10,10) + ylim(0,1) +
  geom_vline(xintercept = 0) +
  xlab("Age in quarters relative to 65") +
  ylab("Percent with Medicare")

That was a lot of work, and `rdplot()` inside `rdrobust` can do it for us, it turns out. A problem with off-the-shelf routines, however, is that it's hard to know what they're doing. I'm not sure `rdplot()` can be told to run a linear RD; what it spits out appears to be a high-order polynomial: 

In [None]:
rdplot(his99p$mcare,his99p$age,
         x.lim = c(-10,10),
         y.lim = c(0,1),
         x.lab="Age in quarters relative to 65",
         y.lab="Percent with Medicare", title = "")

<h3>The RD regression</h3>
    
In the Minimum Legal Drinking Age (MLDA) example of section 4.1 in *Mastering Metrics*, we ran regressions using collapsed data on death rates within bins defined by months of age, within ± 2 years of age 21. There were 48 bins; in the textbook, this is called *bandwidth*.

Here, in the replication package, Card, Dobkin, and Maestas (2008) run regressions with the original data. 

An open issue is how to estimate the standard errors in a situation like this. CDM cluster on the running variable. But later work, as described by [Cunningham in Section 6.2.6 of the *Mixtape*](https://mixtape.scunning.com/06-regression_discontinuity#inference) calls that into question. In *Mastering Metrics*, Angrist and Pischke seem only  to suggest that clustering on the unit in panel data might be important. Here, we do not have repeated observations.

I recommend proceeding with caution. Let's look at how the standard errors can wiggle around.

In [None]:
# Linear model with heteroscedasticity-robust standard errors (Stata)
mcare_reg1r <- lm_robust(mcare ~ d65 +
                         age,
                         data = his99p,
                         se_type = "stata"
                        )
summary(mcare_reg1r)

<font color = "blue">
    What do you see here? What is the effect of turning 65yo on the probability of reporting Medicare coverage?
    </font>

<hr>

Here is a specification with "dueling quadratics" like we saw in Class 23, with math similar to what is shown at the bottom of p. 156 in *Mastering Metrics*:

$$
Medicare_{a} = \alpha + \rho \ D_a + \gamma_1 (a - a_0) + \gamma_2 (a - a_0)^2
+ \delta_1 \left[ (a - a_0) D_a
\right]
+ \delta_2 \left[ (a - a_0)^2 D_a
\right]
+ e_a
$$

In [None]:
# Dueling quadratics with heteroscedasticity-robust standard errors (Stata)
mcare_reg2r <- lm_robust(mcare ~ d65 +
                         age + agesq +
                         age_d65 + agesq_d65,
                         data = his99p,
                         se_type = "stata"
                        )
summary(mcare_reg2r)

In [None]:
# Duel-quad with heteroscedasticity-robust standard errors (Stata) with clustering on age
mcare_reg2rc <- lm_robust(mcare ~ d65 +
                         age + agesq +
                         age_d65 + agesq_d65,
                         data = his99p,
                         se_type = "stata",
                         clusters = age
                        )
summary(mcare_reg2rc)

Hm. Well, the standard errors clustered on the running variable are mostly larger, but not across the board.

We can also include covariates on the right-hand side. We could also do this with averages of these $x$-variables within the binds, but it is considerably easier here with the microdata.

In [None]:
mcare_reg2rc <- lm_robust(mcare ~ d65 +
                          age + agesq +
                          age_d65 + agesq_d65 +
                          female + bnh + onh + hispanic +
                          dropout + somecoll + college +
                          region2 + region3 + region4 +
                          y2000 + y2001 + y2002 +y2003,
                          data = his99p,
                          se_type = "stata",
                          clusters = age
                         )
summary(mcare_reg2rc)

<font color = "blue">
What has inclusion of these covariates done, if anything, to the main result regarding the effect of Medicare eligibility on Medicare coverage?
    Do you see anything striking about the effects of the covariates? Do you see racial/ethnic inequality in Medicare uptake?
</font>

<hr>

Finally, here is a call to `rdrobust()`, one of the off-the-shelf packages for RD estimation. I'm not a huge fan because I like to know what's going on. But the results are qualitatively similar, I guess.

In [None]:
rdr_model <- rdrobust(his99p$mcare,his99p$age)
summary(rdr_model)

Meh. Where's the instruction manual? ChatGPT?!?!?!?!

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>