<img src="images/econ157.png" width="200" />

<h1>ECON 157 Class 05</h1>

Data from the 1974-1982 RAND Health Insurance Experiment (HIE) were unearthed by Aviva Aron-Dine, Liran Einav, and Amy Finkelstein (J. Econ. Perspect., 2013). Josh Angrist and J&#246;rn-Steffen Pischke provide an extract online at [Mastering Metrics](https://www.masteringmetrics.com/resources/).

Earlier we looked at Panel A of Table 1.4; here, let's examine the data behind <b>Panel B in Table 1.4</b>, which reveals average levels of health status (the rows) across 5 types of care. The "control group" consist of people with catastrophic health insurance only (the leftmost column). In subsequent columns, the authors show us the average difference in the utilization measure in that row between one of the three "treatment arms" they argue are useful to consider (deductible, coinsurance, free), and the control group.

Learning objectives
* Get more experience with real data
* Notice that OLS regression with `lm()` and useful $x$ variables can test average differences across subgroups
* Health status varies with characteristics. These are known as health inequalities

With an outcome variable $y_i$ and treatment group indicator variables $D^d_i$, $D^c_i$, and $D^f_i$, for example, then this regression:

$$
y_i = \alpha + \beta^d \cdot D^d_i + \beta^c \cdot D^c_i + \beta^f \cdot D^f_i + \epsilon_i
$$

provides a very convenient way of testing the average differences:
* between the control group and the "deductible" group $d$: $\beta^d$
* between the control group and the "coinsurance" group $c$: $\beta^c$
* between the control group and the "free care" group $f$: $\beta^f$

The reason is we have omitted the indicator variable for the control group, those in the "catastrophic" plan.

Here's a clean PNG of Table 1.4:

<img src="MMtbl14.png" width="800" />

In [None]:
library(haven)
library(tidyverse)
library(estimatr)

Here is an extract containing the information underneath Table 1.4B.

In [None]:
table1_4b <- read_dta("table1_4b.dta")

In [None]:
head(table1_4b)

Variables that measure "ending" characteristics are labeled with a "-x" suffix. Here we have:
* `ghindxx` = the general health index, where more is better health. Probably self-reported
* `cholstx` = total cholesterol in mg/dL, where more is bad
* `systolx` = systolic blood pressure (when your heart beats) in mm Hg
* `mhix` = a mental health index, where more is better mental health. Probably based on self-reports

Sometimes __R__ returns things in unfortunate formats. Here is an option that helps for the next command:

In [None]:
options(scipen=0)

The call below estimates our simple OLS model with the $y$-variable set to be the general health index. Results should parallel first row in Table 1.4B above.

In [None]:
reg_t1_4b_ghindxx <- lm_robust(ghindxx ~ plan_deduc + plan_coins + plan_free,
                               data = subset(table1_4b, famid != "NA"), clusters = famid)
summary(reg_t1_4b_ghindxx)

<hr>

We can also look at the pre-study characteristics of the people in the sample using the data underneath Table 1.3, which is shown below.

<img src="MMtbl13.png" width="800" />

The data are in this extract. (In principle, we could merge these various files using the `person` variable, which is the individual identifier.)

In [None]:
table1_3 <- read_dta("table1_3.dta")

In [None]:
head(table1_3)

Here is a replication of the top row in Table 1.3B, an examination of how the general health index measured at baseline, before the study, may have varied across insurance groups assigned in the study.

In [None]:
reg_t1_3b_ghindx <- lm_robust(ghindx ~ plan_deduc + plan_coins + plan_free, 
                    data = subset(table1_3, famid != "NA"), clusters = famid)
summary(reg_t1_3b_ghindx)

We see scant evidence of any statistically significant differences in baseline health.

Now let us run a different inquiry. <i>How does baseline health vary with individual characteristics</i>, like age, gender, race/ethnicity, education, and income? To answer that question, we can model the baseline general health index as a function of those right-hand side variables:

$$
ghindx_i = \alpha + \beta_a age_i + \beta_f female_i + \beta_b blackhisp_i + \beta_e educ_i + \beta_i income1cpi + \nu_i
$$

In [None]:
reg_ghindx <- lm_robust(ghindx ~ age + female + blackhisp + educper + income1cpi, 
                        data = subset(table1_3, famid != "NA"), clusters = famid)
summary(reg_ghindx)

We see telltale signs of health inequalities here at baseline, in the full sample of individuals. Older people report worse health, as do females, people who are Black or Hispanic, and people with less education or income. 

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>