<img src="images/econ140R_logo.png" width="200" />

<h1>ECON 140R Class 06a</h1>

Let's spend time to cement a few key takeaways from Chapter 1 of <i>Mastering Metrics</i>. In the book, Angrist and Pischke show us a simple example with 2 individuals. Here, let's examine a simple example with 20 individuals, 10 each in the control and treatment groups.

<h2>Learning Objectives</h2>

* Run an ordinary least squares (OLS) regression $y_i = \alpha + \beta \ D_i + \epsilon_i$ using `lm()`
* See that when $y$ is an outcome, and if the indicator variable $D = 1$ measures treatment group assignment in an RCT, then OLS reveals:
    * $\alpha$ = average $y$ for the control group
    * $\beta$ = average difference in outcomes between treatment and control
* See a brief example of a "recode" in __R__ using `ifelse()`

In [None]:
library(tidyverse)

I've learned some __R__ to create a fictional dataset containing study participants in a randomized controlled trial (RCT). Here is code that ultimately creates a data frame for the 10-person control group that shows their first names; an (old-school) binary gender identity$^{\dagger}$; RCT group membership; and a <u>bad health outcome</u>, first coded numerically and then again as a string.

Zeros and ones are common codings for bad health outcomes in medicine and in health economics. You could think of $y = 1$ meaning that the participant catches COVID-19, for example. Another, more extreme example is that the bad health outcome could be death. Here, I've coded "poor health" as `outcomestr == 1` with "good health" being the other state. (This is a common way of collapsing what is usually a 5-point scale for self-reported health: "excellent," "very good," "good," "fair," and "poor," with the first three categories usually mapped to "good" and the latter two categories mapped to "poor."

Because of this coding, note that we are looking for a treatment that has a <b>negative</b> or protective effect: $\beta < 0$. A positive effect in this context would mean that the treatment is actually worsening health.

In [None]:
names   <- c("Alison", "Bradley", "Catherine", "David", "Esme", 
             "Frank", "Georgina", "Henry", "Inez", "James")

gender  <- c("female", "male", "female", "male", "female",
             "male", "female", "male", "female", "male")

group   <- c("control", "control", "control", "control", "control",
             "control", "control", "control", "control", "control")

outcome <- c(0,1,0,0,1,
             1,1,0,1,0)

outcomestr <- c("good", "poor", "good", "good", "poor",
                "poor", "poor", "good", "poor", "good")

# data.frame() constructs the data frame and labels the columns with the variable names
# Parentheses around the command also ask R to show it to us

(control_df <- data.frame(names, gender, group, outcome, outcomestr))


Can you eyeball the average of `outcome` here for the control group? There are 10 people, and 5 of them have `outcome == 1`, so ...

In [None]:
# List the proportion in the control group who have outcome == 1 below:
outcome_avg_control = 

# If you can't eyeball it, perhaps uncomment and try this code:
#outcome_avg_control = mean(control_df$outcome)

outcome_avg_control

Run the code below to generate the treatment group:

In [None]:
names   <- c("Kate", "Larry", "Mallory", "Niles", "Olivia", 
             "Peter", "Quincy", "Rutger", "Stephanie", "Troy")

gender  <- c("female", "male", "female", "male", "female",
             "male", "female", "male", "female", "male")

group   <- c("treatment", "treatment", "treatment", "treatment", "treatment",
             "treatment", "treatment", "treatment", "treatment", "treatment")

outcome <- c(0,0,0,1,0,
             1,1,0,0,0)

outcomestr <- c("good", "good", "good", "poor", "good",
                "poor", "poor", "good", "good", "good")

(treatment_df <- data.frame(names, gender, group, outcome, outcomestr))


Can you eyeball the average of `outcome` here? There are 10 people, and 3 of them have `outcome == 1`, so therefore ...



In [None]:
# List the proportion in the control group who have outcome == 1 below:
outcome_avg_treatment = 

# If you can't eyeball it, perhaps uncomment and try this code:
#outcome_avg_treatment = mean(treatment_df$outcome)

outcome_avg_treatment

The randomization and placebo might be rocket science, but otherwise we are done with any rocket science. All we are really looking for is the average difference between control and treatment, which you can eyeball in this simple example. Remember that if the treatment is protective against bad health, we expect to find a <i>negative</i> treatment effect here:

In [None]:
treatment_effect = outcome_avg_treatment - outcome_avg_control
treatment_effect

Now we have two separate data frames for treatment and control. In order to run OLS using `lm()`, with a new indicator variable `treatment` for $D_i$, we need to append or add the datasets to one another. In your mind's eye, what we want to do is create a new matrix from these two existing matrices by stacking them vertically. Here's a way to do that with data frames in __R__:

In [None]:
fake_rct_df <- rbind(control_df, treatment_df)
fake_rct_df

Now let's create that indicator variable `treatment` that will serve as the right-hand side variable $D_i$ in the regression equation shown at the top of this notebook. Here is one way to do that by using `mutate()` to add a column for the variable `treatment`, which we create with a call to `ifelse()`. Here, `ifelse()` is told to return a 1 if `group == "treatment"` and a 0 otherwise.

In [None]:
fake_rct_df <- mutate(fake_rct_df, treatment = ifelse(group == "treatment", 1, 0))
fake_rct_df

Now let's run the OLS regression from above. I'll write it in its generic form first, and then with variable names, and then the code field will show its equivalent in __R__ using `lm()`

$$
y_i = \alpha + \beta \ D_i + \epsilon_i \\
outcome_i = \alpha + \beta \ treatment_i + \epsilon_i
$$

In [None]:
fake_rct_reg <- lm(outcome ~ treatment, data = fake_rct_df)
summary(fake_rct_reg)

Examine these results and compare to what you have seen earlier. Below is another way of extracting this information from the data, without using OLS:

In [None]:
mean(treatment_df$outcome)

mean(control_df$outcome)

mean(treatment_df$outcome) - mean(control_df$outcome)


<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>

<hr>

<i>ABANDON ALL HOPE, ye who enter here.</i>

<h2>Logit models</h2>

When the $y$ variable is binary, some researchers often use `logit` or `probit` models to estimate parameters rather than ordinary least squares (OLS) such as `lm()`. Often, OLS returns very similar results, so the difference between them can be [pedantic](https://www.merriam-webster.com/dictionary/pedantic) sometimes. But if somebody tells you to use a `logit` instead of OLS, go for it!

The code below estimates the simple model using a `logit` instead of OLS:

In [None]:
fake_rct_logit <- glm(outcome ~ treatment, 
                      data = fake_rct_df, 
                      family = binomial(link = "logit")
                      )
summary(fake_rct_logit)

Confused by the output? That's because the logit coefficient here is approximately the percentage change in the odds of success, not the percentage point change in the probability of success. The exponentiated logit coefficient is the odds ratio. 

The <b>odds</b> of an event (O) and the probability of an event (P) are related in the following way:

$$
O = \frac{P}{1-P}
$$

In [None]:
fake_rct_logit_coef = coef(fake_rct_logit)
fake_rct_logit_coef[2]
exp(fake_rct_logit_coef[2])

Often the probability of an event is a little easier to discuss than the odds of an event. That might be one reason to use OLS, simply because it makes the exposition clearer. But it's also true that one can convert the `logit` results back into a more familiar marginal effect on the probability of success. To do so, we need another ``R`` package:

In [None]:
install.packages("mfx")
library(mfx)

In [None]:
(fake_rct_logitmfx <- logitmfx(outcome ~ treatment, data = fake_rct_df))

Lo and behold, we find the same marginal effect here that we found earlier using OLS, only after much contortions.

The takeaways: 
* Logit and probit are reasonable alternatives. So are generalized linear models (GLM), which might be where logit sits in a package, like it does in ``R``
* For ECON 140, all this is strictly "bonus" material

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived somewhat less happily ever after. The End.</span></div>

<hr>

${\dagger}$ To learn more about 21-century methods of measuring gender identity and related concepts, see the National Academies of Sciences, Engineering, and Medicine. 2022. <i>Measuring Sex, Gender Identity, and Sexual Orientation.</i> Washington, DC: The National Academies Press. (https://doi.org/10.17226/26424)[https://doi.org/10.17226/26424].