<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 04</h1>

Learning objectives:

1.  Get more experience with real data

<p>
    
2. Notice that ordinary least squares regression with `lm()` can test average differences across MULTIPLE subgroups    
<p>

3. With an outcome variable $y_i$ and group identity indicator variables $D^d_i$, $D^c_i$, and $D^f_i$, for example, then this regression:
$$
y_i = \alpha + \beta^d \cdot D^d_i + \beta^c \cdot D^c_i + \beta^f \cdot D^f_i + \epsilon_i
$$
provides a very convenient way of testing the average differences in $y$:
<ul>
    <li> between the control group and group $d$: $\beta^d$
    <li> between the control group and group $c$: $\beta^c$
    <li> between the control group and group $f$: $\beta^f$
</ul>

<p>
   
4. In <i>Mastering Metrics</i> Table 1.4, Angrist and Pischke use multivariate OLS to model 5 measures of health care use and 4 health outcomes (9 separate $y$ variables) on three indicator variables $D$ that measure assignment to treatment arms in the RAND Health Insurance Experiment, all of which have more generous insurance than the control group ("catastrophic plan")
    
<p>

5. The results in Table 1.4 show that the <i> causal effect</i> of health insurance on health care use appears to be positive and statistically significant, but the causal effect of health insurance on health outcomes appears to be statistically insignificant

<p>
    
6. A "fine print" detail is that Angrist and Pischke are doing what's called <i>clustering standard errors at the family level</i>. This last point will definitely not be on any exams.

Data from the 1974-1982 RAND Health Insurance Experiment (HIE) were unearthed by Aviva Aron-Dine, Liran Einav, and Amy Finkelstein (J. Econ. Perspect., 2013). Josh Angrist and J&#246;rn-Steffen Pischke provide an extract online at [Mastering Metrics](https://www.masteringmetrics.com/resources/).

Let's examine the data behind Panel A in Table 1.4, which reveals average levels of health care utilization across 5 types of care (the rows) for the "control group," people with catastrophic health insurance only (the leftmost column). In subsequent columns, the authors show us the average difference in the utilization measure in that row between one of the three "treatment arms" they argue are useful to consider (deductible, coinsurance, free), and the control group.

Here's a clean PNG of Table 1.4:

<img src="MMtbl14.png" width="800" />

Let's load up <b>haven</b> and <b>tidyverse</b>

In [None]:
library(haven)
library(tidyverse)

I have prepared an extract of the RAND HIE data underneath Table 1.4 Panel A in <i>Mastering Metrics</i>. These data include health care utilization outcomes across the four groups that Angrist and Pischke argue are usefully distinguishable, ordered here from least generous to most generous:

* Catastrophic plan
* Deductible plan
* Coinsurance plan
* Free plan

We have the five utilization measures shown in Table 1.4A here: `ftf` is face-to-face visits; `out_inf` are outpatient expenses; `totadm` is total hospital admissions; `inpdol_inf` are inpatient expenses, and `tot_inf` are total expenses.

In [None]:
table1_4a <- read_dta("data/table1_4.dta")

In [None]:
head(table1_4a, n = 100)

Let's create new data frames for each of the four groups using `filter()`. The shortened group names are:

* `plan_catas` = Catastrophic plan 
* `plan_deduc` = Deductible plan   
* `plan_coins` = Coinsurance plan  
* `plan_free`  = Free plan   

Copy and paste this code below and run it:

`table1_4a_catas <- filter(table1_4a, plan_catas == 1)`

`table1_4a_deduc <- filter(table1_4a, plan_deduc == 1)`

`table1_4a_coins <- filter(table1_4a, plan_coins == 1)`

`table1_4a_free  <- filter(table1_4a, plan_free  == 1)`

In [None]:
table1_4a_catas <- filter(table1_4a, plan_catas == 1)

table1_4a_deduc <- filter(table1_4a, plan_deduc == 1)

table1_4a_coins <- filter(table1_4a, plan_coins == 1)

table1_4a_free  <- filter(table1_4a, plan_free  == 1)

What we now have are 4 separate data frames for the 4 groups assigned to different insurance plans.

In STAT 20, you might have used `t.test()` to run a comparison between two groups. Let's run `t.test()` on the face-to-face visits `ftf` in the deductible group versus the catastrophic group. This should get us something like the two numbers in the table at upper left.

`t.test(table1_4a_deduc$ftf, table1_4a_catas$ftf)`

In [None]:
t.test(table1_4a_deduc$ftf, table1_4a_catas$ftf)

Not exactly clear, is it? The $t$-statistic is 1.53, which in words means that this difference is about 1.5 times its standard error. That's not big enough for us to reject the null hypothesis that the true difference is zero. 

There's probably an option to `t.test()` that will show us this, but we can also just type it into __R__. Here is the difference between those last two numbers in the output:

In [None]:
2.976766 - 2.784103

This is indeed the point estimate (0.19) of the average difference that appears at the upper left of Table 1.4A.

And then this, the difference divided by the $t$-stat, has to be the estimated standard error:

In [None]:
(2.976766 - 2.784103)/1.5318

Unfortunately this is not the standard error (.25) that appears under the .19 at the upper left of Table 1.4A. What's going on? Stay tuned. Let's load in a new library, which will let us run a special version of `lm()` that will help reveal what's going on.

In [None]:
library(estimatr)

First, let's run `lm_robust()` with options set to the baseline. The syntax is the same as it is for `lm()`, and we should recover the same results, as long as we set the standard errors to "classical" type.

`reg_toprow <- lm(ftf ~ plan_deduc + plan_coins + plan_free, data = table1_4a)`

`summary(reg_toprow)`

`reg_toprowrob <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                           data = table1_4a, se_type = "classical")`

`summary(reg_toprowrob)`

In [None]:
reg_toprow <- lm(ftf ~ plan_deduc, data = table1_4a)

summary(reg_toprow)

reg_toprowrob <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free,                             data = table1_4a, se_type = "classical")

summary(reg_toprowrob)

Now let's explore what <i>clustering our standard errors at the family level</i> does to our estimates of the standard errors. Because there are families in these data, indexed by the `famid` variable, we might expect that the $\epsilon$'s that shock a person one way or another within a family might shock the rest of the family as well. Imagine a family car that breaks down, so nobody keeps their checkup appointments.

`reg_toprowcluster <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                                data = table1_4a, clusters = famid)`
                              
`summary(reg_toprowcluster)`

In [None]:
reg_toprowcluster <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                               data = table1_4a, clusters = famid)

summary(reg_toprowcluster)

Compare these results to the top row in Table 1.4A. What do you see?

<font color="red">
    This is exactly what we see in the top row of Table 1.4A, with the exception of the bracketed standard deviation [5.50] below the far left-hand side number, 2.78, which is the intercept term here. The SD we could probably get from `summary()` by conditioning on just the control group (`plan_catas`). The rest of the numbers in the row are the estimates shown here and their standard errors.
    </font>

Compare these results to the results without clustering standard errors at the family level. Which ones are larger?

<span style="color: red;">These standard errors, obtained when clustering errors at the family level, are larger than what we saw when we didn't cluster. This is all you need to be able to do for ECON 140: answer a question like this about something that you're observing.
It's also OK to speculate a little, too. Clustering like this is similar to but of course not exactly the same as reducing sample size. Here it raises the standard errors, like a reduced sample size would have also.
    </span> 

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>