<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 23</h1>

<b>Regression discontinuity (RD)</b> is an elegant, artful method of causal inference that invokes visualizations and the search for a <i>discontinuous jump</i> in some outcome when a treatment is applied at one or more <b>cutoff</b> points in a <b>running variable</b>. It is the subject of Chapter 4 in <i>Mastering Metrics</i>.

Learning objectives:

1. Running a basic RD using MLDA, and the dataset has been set up
2. Adding some extensions, like quadratic and interactions

In [None]:
library(haven)
library(dplyr)

Now let us load in the dataset `AEJfigs.dta` that Angrist and Pischke examine in Section 4.1. These data are similar to the minimum legal drinking age panel data we saw in chapter 5.

In [None]:
AEJfigs <- read_dta("data/AEJfigs.dta")
AEJfigs

I think those weird rows with no data but with `-fitted` variables are a clever means to an end, at least in Stata: they create fitted segments than run all the way up to the jump at the cutoff.

Further below, they become troublesome in __R__, and I created a cleaned dataset using the `na.omit()` function.

Note how the `agecell` variable is kind of funny looking. It is close to age in years plus half of 1/12, or something like a midpoint of a month. But its decimal value jumps around, where we see `19.06849` in row 1 and then `20.05479` in row 13.

I suspect the authors constructed the age variable by measuring the average age within each of 12 "monthly" bins of individuals observed in their dataset, which are deaths measured between 1997 and 2003 (MM p. 149). Small deviations in the sizes of birth cohorts and deaths year-to-year are probably what shift the average age slightly up or down.

We will need to generate some variables for the RD estimation.

In [None]:
# Create a recentered "age" variable that measures 
# "months" before or after age 21 
AEJfigs <- mutate(AEJfigs, 
                  age = agecell - 21)

# Create an indicator variable for over age 21
AEJfigs <- mutate(AEJfigs, 
                  over21 = as.integer(agecell >= 21))

# Age-squared, a quadratic term
AEJfigs <- mutate(AEJfigs, 
                  age2 = age^2)

# Age interacted with over-21
AEJfigs <- mutate(AEJfigs, 
                  over_age = over21*age)

# Age-squared interacted with over-21
AEJfigs <- mutate(AEJfigs, 
                  over_age2 = over21*age2)

# "Other external causes," a residual shown in the 5th row
# of Table 4.1
AEJfigs <- mutate(AEJfigs, 
                  ext_oth = external - homicide - suicide - mva)

head(AEJfigs)

The dataset already appears to contain fitted values for the "dueling quadratic" specification, where the pre and post periods are allowed to be separate quadratics. We will discuss this further below. These fitted values are inside the `allfitted` column.

<h2>Linear specification</h2>

Figure 4.2 is literally the "killer chart" of Section 4.1 in *Mastering Metrics.* Let's reproduce it:

<img src="images/MMfig42.png" width="500" />

Below is the basic RD estimation, of equation (4.2) appearing on page 152:

$$
\bar{M}_{a} = \alpha + \rho \ D_a + \gamma \ a + e_a
$$
where
* $\bar{M}_{a}$ is the total death rate in month $a$
* $D_a = 1$ when age $a \geq 21$ and $0$ when $a < 21$

In addition to $D_a$, the indicator for age being 21 and over, we have a constant term, which is something like the average of the data minus the estimated effect of $D_a$, and we have a linear term in age. In the text on page 152, Angrist and Pischke cite $\rho = 7.7$ around an average death rate of about 95.

In [None]:
rd_reg1 <- lm(all ~ over21 
              + agecell,
             data = AEJfigs)
summary(rd_reg1)

A call to `summary` reveals the average death rate in the sample, which here is 95.67 per 100,000 or within spitting distance of the 95 cited by Angrist and Pischke on the middle of page 152.

In [None]:
summary(AEJfigs$all)

We get essentially the same results, except that the constant term is different, if we switch from `agecell` to `age`, which equals age minus the cutoff age $a_0$:

In [None]:
rd_reg2 <- lm(all ~ over21 
              + age,
             data = AEJfigs)
summary(rd_reg2)

With either regression, we can create fitted values of the linear RD model using the code below. The first thing I'll do is remove the lines with `NA` values. Then I'll call `predict()`, and finally, I'll call `lines()` twice in order to show two separate linear segments on the graph.

(There is probably also a way to coax __R__ into creating fitted values for those funny rows with missing data. That can be an extension for intrepid students to explore.)

In [None]:
# Remove the lines without data
AEJfigs0 <- na.omit(AEJfigs)

# Create a new column with predicted values from rd_reg1
AEJfigs0$allfittedline <- predict(rd_reg1)

In [None]:
plot(AEJfigs0$agecell, 
     AEJfigs0$all, ylim = c(80,115))
lines(AEJfigs0$agecell[1:24], 
      AEJfigs0$allfittedline[1:24], col = "red")
lines(AEJfigs0$agecell[25:48], 
      AEJfigs0$allfittedline[25:48], col = "red")

This compares nicely with Figure 4.2 above.

<font color="blue">Our eyes definitely see a jump in the death rate to all causes at age 21 here. How big is this jump? Is it statistically significant?</font>

<hr>

<h2>Quadratic specification</h2>

Here is the "dueling quadratics" or "quadratic on each side" specification, which is quite an eyesore:

$$
\bar{M}_{a} = \alpha + \rho \ D_a + \gamma_1 (a - a_0) + \gamma_2 (a - a_0)^2
+ \delta_1 \left[ (a - a_0) D_a
\right]
+ \delta_2 \left[ (a - a_0)^2 D_a
\right]
+ e_a
$$

This beast appears at the bottom of p. 156 in *Mastering Metrics*.

Breaking it down, what we have above are basically two (complicated) pieces. The first is a quadratic in the age variable $a$, with parameters $\alpha$, $\gamma_1$, and $\gamma_2$:

$$
\alpha + \gamma_1 (a - a_0) + \gamma_2 (a - a_0)^2
$$

And the second is another quadratic, but with parameters $\rho$, $\delta_1$, and $\delta_2$, and which only switches on when $D_a = 1$, or after age $a_0 = 21$:

$$
\rho \ D_a +
\delta_1 \left[ (a - a_0) D_a
\right]
+ \delta_2 \left[ (a - a_0)^2 D_a
\right]
$$

Here is the code that runs that model, with the familiar $D_a$ indicator represented as `over21`, with `age` and its square, `age2`; and with similar components `over_age` and `over_age2` that we generated earlier and that switch on over age 21.

In [None]:
rd_reg_q1 <- lm(all ~ over21
                + age + age2
                + over_age + over_age2,
                data = AEJfigs)
summary(rd_reg_q1)

Finally, here are the fitted values from the dueling quadratics, and then we'll plot them:

In [None]:
AEJfigs0$allfittedquad <- predict(rd_reg_q1)

In [None]:
plot(AEJfigs0$agecell, 
     AEJfigs0$all, ylim = c(80,115))
lines(AEJfigs0$agecell[1:24], 
      AEJfigs0$allfittedquad[1:24], col = "red")
lines(AEJfigs0$agecell[25:48], 
      AEJfigs0$allfittedquad[25:48], col = "red")

This compares favorably with Figure 4.4 on page 158, reproduced below:

<img src="images/MMfig44.png" width="500" />

<font color="blue">As before, our eyes definitely see a jump in the death rate to all causes at age 21 here. How big is this jump? Is it statistically significant? How does it compare to the estimate from the linear RD model above?</font>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>