<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 02</h1>

Learning objectives:

1. Examine the real data underneath <i>Mastering Metrics</i> Table 1.1
<p>

2. Use `lm()` to run ordinary least squares, regressing $y$ on an $x$ where $x$ is a binary 0/1 indicator of group status
<ul>
  <li> Members of the group get a 1
  <li> Others (the "default category") get a 0
</ul>

<p>
    
3. Under these conditions, `lm()` shows you
<ul>
    <li>$y = \alpha + \beta x$
    <li>The average $y$ for the default category = constant term $\alpha$
    <li>The difference in average $y$ between groups = slope term $\beta$
    <li>Standard errors for each of these, in addition to point estimates
</ul>
    
<h3>For example</h3>

Suppose $y$ is a numeric self-reported health index with these values:

* $y = 1$ is poor health
* $y = 2$ is fair health
* $y = 3$ is good health
* $y = 4$ is very good health
* $y = 5$ is excellent health

(Note that in most survey data, the values associated with these answers are switched so that they run from 5 to 1 rather than 1 to 5.)

Suppose also that $x$ is a binary measure of whether a person has any health insurance:

* $x = 0$ means the person reports NO health insurance
* $x = 1$ means the person reports any health insurance

Running `lm()` with the formula `y ~ x` estimates this equation:

$$
y = \alpha + \beta x + \epsilon
$$

where the constant term $\alpha$ is the average level of health ($y$) among people who have NO health insurance ($x = 0$); and the slope term $\beta$ is the difference in average levels of health ($y$) between people with any health insurance ($x = 1$) and people with no health insurance ($x = 0$). The standard error of $\beta$ helps us understand whether the difference itself, $\beta$, is statistically significant.

<hr>

Here is Table 1.1 in <i>Mastering Metrics</i>:
<img src="images/MMtbl11.png" width="800" />

Let's load up <b>haven</b> and <b>tidyverse</b>

In [None]:
library(haven)
library(tidyverse)

Angrist and Pischke provide us these two data files, drawn from the 2009 wave of the National Health Interview Survey (NHIS).

In [None]:
husbands <- read_dta("data/table_1_1_husbands.dta")
wives <- read_dta("data/table_1_1_wives.dta")

In [None]:
head(husbands)

In [None]:
head(wives)

There is a lot here. Let's focus on a few small parts of these datasets

<h2>Look at differences in self-reported health</h2>

There are lots of ways of coding this. Here is an elegant one that ChatGPT suggested to me:

In [None]:
# Use tidyverse "pipes." Thanks, ChatGPT
# 
# Use filter() to look at the average health among husbands with any health insurance
avg_hlth_h_hi <- husbands %>%  
    filter(hi == 1) %>% 
    summarize(average = mean(hlth, na.rm = TRUE))
avg_hlth_h_hi

In [None]:
# Use filter() to look at the average health among husbands with NO health insurance
avg_hlth_h_nohi <- husbands %>%  
    filter(hi == 0) %>% 
    summarize(average = weighted.mean(hlth, perweight, na.rm = TRUE))
avg_hlth_h_nohi

In [None]:
# Difference between these averages
# = health benefits associated with health insurance
avg_hlth_h_hi - avg_hlth_h_nohi

This is the difference in average health between the two groups we're considering: husbands with health insurance and husbands without.

<hr>

<h3>Some unfortunate details</h3>

An unfortunate truth is that this difference, roughly $0.28$, is not what appears in Table 1.1, row 1 and column (3). The number there is 0.31, and that is the <i>weighted average</i> difference, it turns out.

<h3>Survey weights and sampling</h3>

The NHIS is a survey of people in households that is designed to be representative of the civilian noninstitutionalized population. It typically includes about 85,000 people, or in the ballpark of 0.02 percent of the population.

In order to provide good measurements of key subgroups, modern surveys typically include what are called <b>oversamples</b> of interesting subpopulations. For this and other reasons, modern surveys include <b>survey weights</b> or <b>sample weights</b> that translate the measured sample into something that is nationally representative. It gets more complicated than this, but the basic idea is that you might want to sample twice as many people from an interesting minority group, and then those observations get half the weight of the other observations. 

<h3>What is the bottom line for ECON 140?</h3>

Angrist and Pischke use sample weights throughout their textbook, so when we look at their examples, chances are we will have to use survey weights to reproduce their results exactly.

We will include ``R`` code that does this.

But you will not have to learn or use sample weights to answer exam questions or complete the term project.

<h3>Redoing the averages with weights</h3>

It turns out that the variable `perweight` in the dataset is exactly what we need to adjust our estimates, and we can use the ``R`` function `weighted.mean()` to use sample weights. Its second argument is the weights variable `perweight`:

In [None]:
# Use filter() to look at the WEIGHTED average health among husbands with any health insurance
avg_hlth_h_hi_w <- husbands %>%  
    filter(hi == 1) %>% 
    summarize(average = weighted.mean(hlth, perweight, na.rm = TRUE))
avg_hlth_h_hi_w

In [None]:
# Use filter() to look at the WEIGHTED average health among husbands with NO health insurance
avg_hlth_h_nohi_w <- husbands %>%  
    filter(hi == 0) %>% 
    summarize(average = weighted.mean(hlth, perweight, na.rm = TRUE))
avg_hlth_h_nohi_w

In [None]:
# Difference between these WEIGHTE averages
# = health benefits associated with health insurance
avg_hlth_h_hi_w - avg_hlth_h_nohi_w

Once we have applied the weights, now we see exactly the same statistics that are shown along the top row in Table 1.1, in columns (1), (2), and (3).

<hr>

<h2>Regression with indicator variables gives you conditional averages</h2>

Hold on to your hats.

It turns out that a linear regression of $y$ on an $x$ that is dichotomous (equal to 0 or 1) will tell us the average $y$ for the group whose $x = 0$, that's the constant term $\alpha$; and it will tell us the difference between that group's average $y$ and the other group's (indicated by $x = 1$).

In other words, when we run
$$y = \alpha + \beta x + \epsilon$$

when $x$ is a dichtomous "indicator variable" (sometimes called "dummy variable"), $\alpha$ is the average $y$ for the $x = 0$ group, and $\alpha + \beta$ is the average $y$ for the $x = 1$ group.

Observe. Recall that `husbands` is the data frame that includes folks with and without health insurance. Let's run this regression:
$$hlth = \alpha + \beta \cdot hi + \epsilon$$

where `hlth` is shorthand for self-reported health (where 1 = poor and 5 = excellent), and `hi` is the indicator variable having any health insurance.

In [None]:
health_h <- lm(hlth ~ hi, data = husbands)
summary(health_h)

Again, the unfortunate detail here is that we haven't used the sample weights and thus haven't fully replicated Angrist and Pischke. Here is code that replicates Table 1.1, using the "weights" option inside `lm()`:

In [None]:
health_h_w <- lm(hlth ~ hi, data = husbands,
                            weights = perweight)
summary(health_h_w)

Here now we have replicated columns (2) and (3) in the first row of Table 1.1:

* The estimate of the intercept term is $\hat{\alpha} = 3.70$ when rounding to the nearest hundredth, and that is the average level of health among the husbands with `hi` equals to $0$. 

* The estimate of the slope term is $\hat{\beta} = 0.31$, and that is the difference in average levels of health between the husbands with `hi` equals to $1$ and the husbands with `hi` equal to $0$. 

* The key innovation here is that we also have a standard error of $\hat{\beta}$, which we can write as $SE[\hat{\beta}] = 0.03$. 

The key question is whether $\hat{\beta} = 0.31$ is greater than twice its standard error. If it is, then the 95% confidence interval around $\hat{\beta}$ does NOT include the number zero, and thus we reject the <b>null hypothesis</b> that $\hat{\beta} = 0$ and conclude that $\beta$ is statistically significant.

<h3>Null hypotheses</h3>

Remember that a <b>null hypothesis</b> is easily conceptualized as <i>an effect that is null or zero</i>.

<hr>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>