In [2]:
library(tidyverse)
library(nhanesA)

# STATS 504
## Week 9: Causal inference

### Smoking and lung cancer: observational data

In [12]:
demo <- nhanes("DEMO_J")
mcq <- nhanes("MCQ_J")
smq <- nhanes("SMQ_J")

In [14]:
table(mcq$MCQ230A)


                    Bladder                       Blood 
                         12                           1 
                       Bone                       Brain 
                          1                           4 
                     Breast           Cervix (cervical) 
                         89                          26 
                      Colon      Esophagus (esophageal) 
                         46                           6 
                Gallbladder                      Kidney 
                          1                          19 
           Larynx/ windpipe                    Leukemia 
                          1                           6 
                      Liver                        Lung 
                          3                          18 
Lymphoma/ Hodgkin's disease                    Melanoma 
                         15                          35 
           Mouth/tongue/lip             Ovary (ovarian) 
                          4   

In [20]:
nhanesCodebook("SMQ", "SMQ020")

Code or Value,Value Description,Count,Cumulative,Skip to Item
<chr>,<chr>,<int>,<int>,<chr>
1,Yes,2299,2299,
2,No,2566,4865,SMQ120
7,Refused,4,4869,SMQ120
9,Don't know,8,4877,SMQ120
.,Missing,3,4880,


In [18]:
mcq %>% filter(MCQ230A == "Lung") %>% left_join(smq) %>% xtabs(~ SMQ020, data = .)

[1m[22mJoining with `by = join_by(SEQN)`


SMQ020
Yes  No 
 17   1 

In [41]:
demo %>% left_join(mcq) %>% mutate(lung_cancer = MCQ230A == "Lung") %>%
    replace_na(list(lung_cancer = F)) %>% 
    xtabs(~ INDHHIN2 + lung_cancer, data = .) %>% prop.table(2)

[1m[22mJoining with `by = join_by(SEQN)`


                    lung_cancer
INDHHIN2                  FALSE       TRUE
  $ 0 to $ 4,999     0.03224331 0.00000000
  $ 5,000 to $ 9,999 0.02881317 0.00000000
  $10,000 to $14,999 0.04539218 0.11764706
  $15,000 to $19,999 0.06105648 0.11764706
  $20,000 to $24,999 0.06037046 0.05882353
  $25,000 to $34,999 0.10919277 0.29411765
  $35,000 to $44,999 0.10187514 0.11764706
  $45,000 to $54,999 0.06928882 0.05882353
  $55,000 to $64,999 0.06551566 0.00000000
  $65,000 to $74,999 0.05030871 0.05882353
  $20,000 and Over   0.03750286 0.00000000
  Under $20,000      0.01349188 0.11764706
  $75,000 to $99,999 0.09478619 0.00000000
  $100,000 and Over  0.18557055 0.05882353
  Refused            0.01943746 0.00000000
  Don't know         0.02515436 0.00000000

## Cholera in 19th century England
- Cholera first arrived to England in 1831.

- 31,000 people died of cholera between 1831-1832.

- A second outbreak in 1848 killed 62,000.

![father thames](https://scpoecon.github.io/ScPoEconometrics/images/father-thames.jpg)

Source: Punch (1858), cited in https://scpoecon.github.io/ScPoEconometrics/IV.html.

![cholera charicature](https://www.sciencemuseum.org.uk/sites/default/files/styles/smg_carousel_zoom/public/2065329612.jpg)

Source: https://www.sciencemuseum.org.uk/objects-and-stories/medicine/cholera-victorian-london

## What causes Cholera?

- Germs had not yet been discovered, and were only one (unpopular) theory.

* Miasmas (poisonous particles floating in the air):
    - Rapid industrialisation created filthy, unsanitary neighborhoods that tended to be the focal points of disease and epidemics.
    - By improving sanitation and cleanliness, levels of disease were seen to fall, which seemed to support the miasma theory.

- Other theories
    * Imbalances in the humors of the body (black bile, yellow bile, blood, phlegm)
    * Poison in the ground

<img src="images/advice.jpg" width="80%">

## John Snow

* John Snow was a physician in London who, by watching the course of the disease, came to believe believe that Cholera was caused by a living organism that is ingested (with water or food), multiplies within the body, and is expelled back into the environment.</li>

* Snow developed arguments to support his theory, for example:
  * Cholera spreads along trading routes
  * When sailors went somewhere with Cholera, they wouldn't get sick if no one left the boat.

* Nonetheless, there remained considerable skepticism.

<img src="https://upload.wikimedia.org/wikipedia/en/3/30/Jon_Snow_Season_8.png" style="float: right" width=300/>

## John Snow

* John Snow was a physician in London who, by watching the course of the disease, came to believe believe that Cholera was caused by a living organism that is ingested (with water or food), multiplies within the body, and is expelled back into the environment.</li>

* Snow developed arguments to support his theory, for example:
  * Cholera spreads along trading routes
  * When sailors went somewhere with Cholera, they wouldn't get sick if no one left the boat.

* Nonetheless, there remained considerable skepticism.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cc/John_Snow.jpg/1200px-John_Snow.jpg" style="float: right" width=300/>

## Epidemic of 1854

<img src="images/map.jpg" width="80%">

## Snow's methodology

* Snow identified the first case in London: a man named John Harnold

* Snow also identified the second case: the man who took Harnold's room after Harnold died.

* John Harnold had newly arrived by the *Elbe* steamer from Hamburg, where there was an outbreak.

* Snow also found several adjacent apartment buildings, with one hit by cholera, one not.  He showed that in each case, the affected building had water contaminated by sewage, but the other building had relatively pure water.

### Exceptions

* There was a brewery near the pump, but none of the workers got sick -- it turned out the brewery had its own private pump on-site.

* A woman in Hampstead got Cholera -- Snow discovered that she had water from the Broad Street pump delivered to her, because she liked the taste.

According to legend, Snow lobbied the local council to remove the pump handle, at which time the epidemic receded.

![Removal of the Broad Street pump handle](https://scpoecon.github.io/ScPoEconometrics/ScPoEconometrics_files/figure-html/snow-TS-1.png)

(Cited in: https://scpoecon.github.io/ScPoEconometrics/IV.html)

![broad street pump](https://i0.wp.com/livinglondonhistory.com/wp-content/uploads/2022/07/Optimized-IMG_2622.jpg?w=800&ssl=1)

(Source: https://livinglondonhistory.com/the-story-of-the-john-snow-pump-in-soho/)

## A natural experiment

> Although the facts shown in the above table afford very strong evidence of the powerful influence which the drinking of water containing the sewage of a town exerts over the spread of cholera, when that disease is present, **yet the question does not end here**; for the intermixing of the water supply of the Southwark and Vauxhall Company with that of the Lambeth Company, over an extensive part of London, admitted of the subject being sifted in such a way as to yield the most incontrovertible proof on one side or the other.  In the subdistricts enumerated in the above table as being supplied by both Companies, the mixing of the supply is of the most intimate kind.  The pipes of each Company go down all the streets, and into nearly all the courts and alleys.  **A few houses are supplied by one Company and a few by the other**, according to the decision of the owner or occupier at that time when the Water Companies were in active competition.  In many cases a single house has a supply different from that on either side.  **Each company supplies both rich and poor, both large houses and small**; there is no difference either in the condition or occupation of the persons receiving the water of the different Companies.

![sno's map of water supply](https://scpoecon.github.io/ScPoEconometrics/images/snow-supply.jpg)


(Cited in: https://scpoecon.github.io/ScPoEconometrics/IV.html)

## Snow's reasoning

* There were several water companies in London, and their service areas overlapped.

* On a single street, some people would have one company, others another company.

* The pipes had been laid many years before, when the water companies were still in active competition.

* In 1852, one of the companies (Lambeth) moved its intake pipe upstream to get purer water

* Snow compared the death rates between those who got water from Lambeth, and those who got water from Southwark and Vauxhall

## Snow's data
<div class="xtable"><table frame="hsides" rules="groups" class="rendered small default_table"><thead><tr><th rowspan="1" colspan="1">Water Supply Company</th><th rowspan="1" colspan="1">Number of Houses</th><th rowspan="1" colspan="1">Deaths From Cholera</th><th rowspan="1" colspan="1">Cholera Deaths per 10,000 Houses</th></tr></thead><tbody><tr><td rowspan="1" colspan="1">Southwark and Vauxhall</td><td rowspan="1" colspan="1">40,046</td><td rowspan="1" colspan="1">1,263</td><td rowspan="1" colspan="1">315</td></tr><tr><td rowspan="1" colspan="1">Lambeth</td><td rowspan="1" colspan="1">26,107</td><td rowspan="1" colspan="1">98</td><td rowspan="1" colspan="1">37</td></tr><tr><td rowspan="1" colspan="1">Rest of London</td><td rowspan="1" colspan="1">256,423</td><td rowspan="1" colspan="1">1,422</td><td rowspan="1" colspan="1">59</td></tr></tbody></table></div>

In [46]:
# test of significance
library(causaldata)
data(snow)
snow

year,supplier,treatment,deathrate
<dbl>,<chr>,<chr>,<dbl>
1849,Non-Lambeth Only,Dirty,134.9
1849,Lambeth + Others,Mix Dirty and Clean,130.1
1854,Non-Lambeth Only,Dirty,146.6
1854,Lambeth + Others,Mix Dirty and Clean,84.9


## Snow's analysis in modern terms

- $y_i=1$ if individual $i$ dies of cholera, 0 otherwise.

- $w_i=1$ if $i$'s water supply is impure, 0 otherewise.

- Just compute $\operatorname{cor}(y, w)$?

### As regression
- Equivalently, fit the model: $y_i = \alpha + \beta w_i + u_i$
    - $u_i$ are all the other unobservable factors that influence death (poverty, lifestyle, competing hypotheses like miasma, etc.)
    - $\beta$ is the increase in mortality if $w_i:0\to 1$.
- What could go wrong?

$$\mathbb{E}(y_i \mid w_i=1) - \mathbb{E}(y_i \mid w_i=0) = ?$$

### Second attempt

- Let $z_i=1$ if the person drank water from Southwark/Vauxhall and $z_i=0$ if they drank water from Lambeth.
- Now take expectation with respect to $z_i=1/0$.

$$\mathbb{E}(y_i \mid z_i=1) - \mathbb{E}(y_i \mid z_i=0) = ?$$

### Assumptions

What assumptions did we make (implicitily or explicitly) when deriving this estimator? 

- $\mathbb{E}(w_i \mid z_i=1) \neq \mathbb{E}(w_i \mid z_i=0)$
- $\mathbb{E}(u_i \mid z_i=1) = \mathbb{E}(u_i \mid z_i=0)$
- Water company only affects mortality through water purity.

Concretely, what do these assumptions mean for the data that Snow analyzed?

## Yule on pauperism

- Poverty rates in England increased dramatically during the Victorian era.

- Emerging industrial economy lowered wages, increased population growth, and decreased the prospects for stable employment.

### Victorian poor houses

- Poor people/"paupers" in England were supported either:
    - Inside “poor-houses”;
    - Outside poor-houses, according to local policy.

- There was a debate about whether poor houses increased or decreased pauperism.

- Yule studied how these policies affected rates of pauperism.

In [47]:
# yule's regression


Call:
lm(formula = paup ~ outrelief + old + pop, data = yule)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.475  -5.311  -1.829   3.132  25.335 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 63.18774   27.14388   2.328   0.0274 *  
outrelief    0.75209    0.13499   5.572 5.83e-06 ***
old          0.05560    0.22336   0.249   0.8052    
pop         -0.31074    0.06685  -4.648 7.25e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.547 on 28 degrees of freedom
Multiple R-squared:  0.6972,	Adjusted R-squared:  0.6647 
F-statistic: 21.49 on 3 and 28 DF,  p-value: 2.001e-07


## Discussion 🙋‍♀️
- Does Yule establish causation or association?
- What sort of factors could confound the causal interpretation?
- Can you think of a way to measure the *causal* effect?