-
Notifications
You must be signed in to change notification settings - Fork 7
/
overview_causal_inference.Rmd
2076 lines (1653 loc) · 70.8 KB
/
overview_causal_inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "A Brief Overview of Causal Inference"
subtitle: ""
author: 'Todd R. Johnson, PhD <br> Professor<br><br>'
institute: "The University of Texas School of Biomedical Informatics at Houston"
date: "`r Sys.Date()`<br><span class=footnote>Use arrow keys to navigate, press h for help</span>"
output:
xaringan::moon_reader:
css: ["default", "my-fonts.css", "my-theme.css"]
lib_dir: libs
self-contained: false
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
ratio: '16:9'
beforeInit: "https://platform.twitter.com/widgets.js"
navigation:
scroll: false # disable slide transitions by scrolling
includes:
in_header: header.html
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
library(dagitty)
library(ggdag)
library(tidyverse)
library(ggpmisc)
#library(hexbin)
library(plotly)
library(widgetframe)
library(DT)
library(kableExtra)
library(pcalg)
library(Rgraphviz)
library(rcausal)
library(DOT)
library(DiagrammeR)
library(corrplot)
library(lavaan)
```
# Notes
Use the arrow keys to navigate. Press `h` for help. Press `p` to see speaker notes. These notes often elaborate on slide content and include additional material and references.
My primary objective for creating this is to help me understand the basics of causal infererence. I hope to use it to help others as well, especially machine learning researchers who tend to make predictive associative models, rather than causal models.
This presentation is hosted on Github Pages at: https://tjohnson250.github.io/overview_causal_infererence/overview_causal_infererence.html#1
All of these slides, including model outcomes, are generated in R using R Markdown and the [xaringan](https://github.com/yihui/xaringan) presentation package. The source code to generate the presentation is at: https://github.com/tjohnson250/overview_causal_infererence
A shorter version, suitable for a one hour seminar, may be checked out from the `short` branch: https://github.com/tjohnson250/overview_causal_inference/tree/short
If you find errors or have suggestions for improvement, open an issue on this project's Github repo.
---
# Pearl's Three Layer Causal Hierarchy.fn[1]
```{r, echo=FALSE}
text_tbl <- data.frame(
Level = c("1. Association<br>< >\\(\\mathbf{P(y|x)}\\)", "2. Intervention<br>\\(\\mathbf{P(y|do(x), z)}\\)", "3. Counterfactuals<br>\\(\\mathbf{P(y_x|x',y')}\\)"),
"Typical Activity" = c(
"Seeing,<br>Observing",
"Doing,<br>Intervening",
"Imagining,<br>Retrospecting"
),
"Typical Questions" = c("How would seeing X change my belief in Y?", "What if I do X?", "Was it X that caused Y?<br>What if I had acted differently?"),
Example = c("What is my chance of having prostate cancer if my PSA test is 3 ng/ml?", "Will taking aspirin daily reduce my risk of stroke?", "If I had stopped smoking 10 years earlier, could I have prevented lung cancer?")
)
ladder <- text_tbl %>% knitr::kable(format="html", booktabs = FALSE, escape=FALSE, booktabs = TRUE) %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, color = "#1c5253") %>%
column_spec(2, width = "8em") %>%
column_spec(3, width = "8em")
ladder
```
.footnote[[1] Pearl, 2019]
???
I have modified this with what I think are more illustrative questions. This is based on the paper below, but also appears in other publications by Pearl.
Pearl J. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM. 2019 Feb 21;62(3):54-60.
---
class: middle
### Part I: Causation
### Part II: Identifying and Estimating Causal Effects from Observational Data
### Part III: Causal Model Evaluation
### Part IV: Causal Discovery from Observational Data
### Part V: Counterfactual Reasoning
---
class: center, middle
# Part I: Causation
---
# Part I Learning Objectives: Causation
- Define and identify interventions, causal effects, and average causal effects
- Given a generative model, draw the corresponding causal DAG
- Given a causal DAG describe the DAG that would result from an intervention on one of the variables
- Given data on two interventions, calculate the average causal effect
---
# Example: Age, Exercise and Cholesterol
Suppose that the following model accurately describes how age, exercise, and cholesterol are related
.pull-left[
```{r codeCAEObs, tidy=FALSE}
n <- 10000
a <- rnorm(n, mean = 50, sd = 10)
e <- 0.3*a + rnorm(n)
{{c <- 0.5*a + -1*e + rnorm(n, mean = 100, sd = 5)}}
```
```{r, echo=FALSE, fig.height=4}
dobs <- tibble(E = e, A = a, C=c)
gobs <- dagitty('dag{
A [pos="1,1"]
E [pos="0,0"]
C [pos="2,0"]
A -> C A->E E->C }')
pgobs <- ggdag(gobs, text_size=11) + theme_dag_gray()
pgobs
```
]
.pull-right[
From the model, the causal effect of exercise on cholesterol is $-1e$:
- For every unit increase in excercise, cholesterol decreases by $1$
Causal DAG (Directed Acyclic Graph)
- Graphical representation of the data generating model
- Links represent *possible* direct causal effects
- Missing links indicate the strong assumption of no direct causal effect
- Must include all common causes of any pair of variables already in the DAG
]
---
## The Observational Model Produces Observational Data
```{r, echo=FALSE, out.width="80%", comment=NA}
DT::datatable(head(dobs, n=8), fillContainer = FALSE, class = "nowrap hover", options = list(pageLength=8, dom = "t")) %>% formatRound(c("E", "A", "C"), 2)
```
---
.pull-left[
### The observational data show that exercise increases cholesterol!
```{r, echo=FALSE, out.width="70%"}
formula <- y ~ x
pobs <- ggplot(dobs, aes(x = E, y = C)) + geom_point() + geom_smooth(method = "lm", formula = formula)
pobs
fitobs = lm(C ~ E, dobs)
Cobs <- coef(fitobs)
```
$c = `r round(Cobs[1], digits=2)`+`r round(Cobs[2], digits=2)`e \space `r ifelse(summary(fitobs)$coefficients[2,4] < .001, "(p<.001)", "")`$
]
--
.pull-right[
### But in the model that generated the data exercise decreases cholesterol!
```{r, eval=FALSE}
n <- 10000
a <- rnorm(n, mean = 50, sd = 10)
e <- 0.3*a + rnorm(n)
{{c <- 0.5*a + -1*e + rnorm(n, mean = 100, sd = 5)}}
```
```{r, echo=FALSE, fig.height=3}
pgobs
```
Why does the data show the opposite association?
]
---
# Definitions
- **Causal Infererence** is the process of inferring the **causal effect** of one random variable, $X$, on a second random variable. $Y$
- The **Causal Effect** of $X$ on $Y$ is the expected value of $Y$ when we **intervene** to directly set the value of $X$
- **Intervening** on $X$ means to force $X$ to take a specific value, independent of any of the variables that normally influence the value of $X$
- Suppose that the decision to treat is normally influenced only by gender, *intervening* means ignoring gender when deciding on treatment, such as by randomly assigning whether to treat or not
- This is why Randomized Controlled Trials (RCT's) are considered the gold standard for identifying causal effects
---
## Intervening Changes the Generative Model
Intervene by randomly setting $E$ from $0 - 20$
.pull-left[
### Observational Model
```{r, eval=FALSE}
n <- 10000
a <- rnorm(n, mean = 50, sd = 10)
{{e <- 0.3*a + rnorm(n)}}
c <- 0.5*a + -1*e + rnorm(n, mean = 100, sd = 5)
```
```{r, echo=FALSE, fig.height=3}
pgobs
```
]
.pull-right[
### Interventional Model
```{r codeCAEint}
n <- 10000
a <- rnorm(n, mean = 50, sd = 10)
{{e <- sample(0:20, n, replace = TRUE)}}
c <- 0.5*a + -1*e + rnorm(n, mean = 100, sd = 5)
```
```{r, echo=FALSE, fig.height=3}
dint <- tibble(E = e, A = a, C=c)
gint <- dagitty('dag{
A [pos="1,1"]
E [pos="0,0"]
C [pos="2,0"]
A -> C E->C }')
pgint <- ggdag(gint, text_size=11) + theme_dag_gray()
pgint
```
]
???
Intervening on $E$, means replacing the expression for $e$ from the observational model with a specific value, here one randomly assigned between $0$ and $20$. This changes the generative model and also changes the causal DAG by removing the link $A \rightarrow E$, since in the intervential model, $e$'s value no longer directly depends on $a$.
---
## Intervening Changes The Distribution
.pull-left[
### Observational Distribution
```{r, echo=FALSE, out.width="70%"}
formula <- y ~ x
pobs <- ggplot(dobs, aes(x = E, y = C)) + geom_point() + geom_smooth(method = "lm", formula = formula)
pobs
#p <- ggplotly(p)
#frameWidget(p, width='350', height='350')
fitobs = lm(C ~ E, dobs)
Cobs <- coef(fitobs)
```
$c = `r round(Cobs[1], digits=2)`+`r round(Cobs[2], digits=2)`e \space `r ifelse(summary(fitobs)$coefficients[2,4] < .001, "(p<.001)", "")`$
]
--
.pull-right[
### Interventional Distribution
```{r, echo=FALSE, out.width="70%"}
formula <- y ~ x
pint <- ggplot(dint, aes(x = E, y = C)) + geom_point() + geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(aes(label = paste(stat(eq.label), stat(rr.label), sep = "~~~~")),
formula = formula, parse = TRUE)
pint
fitint = lm(C ~ E, dint)
Cint <- coef(fitint)
```
$c = `r round(Cint[1], digits=2)`+`r round(Cint[2], digits=2)`e \space `r ifelse(summary(fitint)$coefficients[2,4] < .001, "(p<.001)", "")`$
]
???
The coefficient for $e$ on the observational distribution is way off (indicating that exercising increases cholesterol), but the coefficient for the interventional distribution is close to the true value. Both coefficents are statistically significant at the p = .001 level.
---
# Difference Between Seeing and Doing
.pull-left[
### Observational Association
$E(C|e)$: The expected value of Cholesterol given that we observe a specific level of exercise.
$E(C|e)$ is confounded, because age $A$ affects both the amount of exercise $E$ a person gets and cholesterol levels $C$.
In the observational data
$$E(C|e) \neq E(C|do(e))$$
]
.pull-right[
### Interventional Association (Causal Effect)
$E(C|do(e))$: The expected value of Cholesterol given that we intervene and set exercise to a specific value, independent of any of the variables that normally influence the value of $E$.
$do()$ is called the do-operator.
In the interventional data
$$E(C|e) = E(C|do(e))$$
]
---
### Average Causal Effect (ACE) for a Population
Contrast of the mean values of $Y$ (outcome) given two specific interventions on $X$ (treatment)
- Dichotomous (binary) treatment and outcomes:
$$P(Y=1|do(X = 1)-P(Y=1|do(X = 0)))$$
$$P(Survival|do(Treat))-P(Survival|do(Don't Treat))$$
- Continuous treatment and outcomes:
$$E(Y|do(X=x)) - E(Y|do(X=x^{\prime})$$
First equation is a special case, since for dichotomous outcomes
$$E(Y|do(X=x)) = P(Y=1|do(X = x))$$
???
ACE is also called the Average Treatment Effect (ATE)
Contrast ACE with the individual causal effect: the difference between counterfactual outcomes for a single patient--the outcome if treated vs. the outcome if not treated. Lack of an average causal effect does not imply lack of an individual causal effect. However, in most cases it is impossible to estimate individual causal effects from observational data because we only ever observe the effect of one level of the treatment for an individual.
A crossover randomized experiment in which we give different treatments over time to the same person can sometimes measure individual causal effects when treatment and outcome are short term and do not affect the next treatment period. This is commonly used in cognitive science and psychology in within-subject designs. For example, we might measure a subject's ability to monitor different numbers of patients using a telemonitoring system to determine the effect of patient load on time to respond to events, error rate, etc.
See Hernan and Robbins, Causal Inference, Chapter 1 and Fine Point 2.1 for more details in individual vs. population causal effects. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
---
## Estimating ACE of Exercise on Cholesterol Using the Interventional Dataset
Step 1: Estimate functional relationship between C and E using linear regression on the **interventional dataset**:
```{r, echo=FALSE}
fit = lm(C ~ E, dint)
new <- data.frame(E = c(1, 2))
est = predict(fit, new)
```
$$c = `r round(Cint[1], digits=2)`+`r round(Cint[2], digits=2)`e \space `r ifelse(summary(fit)$coefficients[2,4] < .001, "(p<.001)", "")`$$
$$E(C|do(E=e))=`r round(Cint[1], digits=2)`+`r round(Cint[2], digits=2)`e \space `r ifelse(summary(fit)$coefficients[2,4] < .001, "(p<.001)", "")`$$
Step 2: Use regression equation to calculate ACE
$$
\begin{align}
ACE &= E(Y|do(E=1))-E(Y|do(E=2)) \\
&= (`r round(Cint[1], digits=2)`+`r round(Cint[2], digits=2)`*1) - (`r round(Cint[1], digits=2)`+`r round(Cint[2], digits=2)`*2) \\
&= `r round(est[2], 2)` - `r round(est[1], 2)` \\
&= `r round(est[2]-est[1], 2)`
\end{align}
$$
Equivalently: The regression coefficent on $e$ directly tells us how much $C$ will change for each unit increase in $E$.
---
class: center, middle, Large
We say that $\mathbf{X}$ **causes** $\mathbf{Y}$ in a population when the **average causal effect** of $X$ on $Y$ is non-zero.
---
class: center, middle
## Part II: Identifying and Estimating Causal Effects from
## Observational Data
---
# Part 2 Learning Objectives: Identifying and Estimating Causal Effects from
## Observational Data
- Distinguish between total and direct causal effects with respect to a Causal DAG
- Define adjustment and show how to adjust for a variable when doing linear regression
- Given a causal DAG identify the causal and non-causal paths between two variables
- Apply the backdoor criterion to identify which variables must be adjusted for in order to identify direct and total causal effects
- Write a regression expression to estimate the causal effect of $X$ on $Y$ given a set of variables that meet the backdoor criterion
- Given the results of a linear regression that satisfies the backdoor criterion for the causal effect of $X$ on $Y$ give the causal effect estimate of $X$ on $Y$
---
## With Observational Data...
.pull-left[
#### using E to predict C gives an incorrect estimate,
```{r, echo=FALSE, out.width="70%"}
#formula <- y ~ x
#p <- ggplot(dobs, aes(x = E, y = C)) + geom_point() + geom_smooth(method = "lm", formula = formula)
pobs
#p <- ggplotly(p)
#frameWidget(p, width='350', height='350')
fit = lm(C ~ E, dobs)
Cobs <- coef(fit)
```
$c = `r round(Cobs[1], digits=2)`+`r round(Cobs[2], digits=2)`e \space `r ifelse(summary(fit)$coefficients[2,4] < .001, "(p<.001)", "")`$
]
--
.pull-right[
#### but controlling for A gives the correct estimate
```{r, echo=FALSE, out.width="70%") #, fig.width=8, fig.height=5}
#dobsstrat <- dobs %>% mutate(abin=cut(a, 10))
#ggplot(d, aes(E, C)) + geom_point(aes(color=abin))
ggplot(dobs, aes(E, C)) + geom_point() + geom_smooth(method="lm", aes(group = cut(A, 10), color=cut(A, 10)), se = FALSE)
fitobsstrat = lm(C ~ E + A, dobs)
Cintadjusted <- coef(fitobsstrat)
# Save coefficients for C = ci + ce*E + ca*A
ca <- round(Cintadjusted[3], 2)
ce <- round(Cintadjusted[2], 2)
ci <- round(Cintadjusted[1], 2) # intercept
```
$c = `r round(Cintadjusted[1], 2)`+`r round(Cintadjusted[2], 2)`e + `r round(Cintadjusted[3], 2)`a \space `r ifelse(summary(fitobsstrat)$coefficients[2,4] < .001, "(p<.001)", "")`$
]
???
The graph on the right shows the observational data with a regression line for each of several age groups. As you can see, most of the lines show the appropriate negative correlation between $C$ and $E$. Lines near the left and right are incorrect due to sparse data. Stratifying by age (here age group) is one method for controlling or adjusting for a confounder, such as $A$.
The regression equation also shows a very close estimate of the correct causal effect of $E$ on $C$. Regressing $C$ on both $E$ and $A$ controls or adjusts for $A$, returning the close estimate of the true causal effect of $E$ on $C$.
---
## Adjustment
Stratifying by age $(A)$ or including $A$ as a covariate in the regression $(C = {\beta_1}+{\beta_2}A + {\beta_3}E)$ is called **adjusting**, **conditioning**, or **controlling for A**
In R code, regression using the observational dataset (`dobs`):
.pull-left[
Unadjusted for $A$
```{r}
lm(C ~ E, dobs)
```
]
.pull-right[
Adjusted for $A$
```{r}
lm(C ~ E + A, dobs)
```
]
---
### What is the Causal Effect of Age on Cholesterol?
Using the observational dataset, regressing $C$ on $A$ and $E$ gave us: $c = `r round(Cintadjusted[1], 2)`+`r round(Cintadjusted[2], 2)`e + `r round(Cintadjusted[3], 2)`a$
.pull-left[
```{r, tidy=FALSE, eval=FALSE}
n <- 10000
a <- rnorm(n, mean = 50, sd = 10)
e <- 0.3*a + rnorm(n)
{{c <- 0.5*a + -1*e + rnorm(n, mean = 100, sd = 5)}}
```
```{r, echo=FALSE, fig.height=4}
pgobs
```
]
.pull-right[
From the model, the **direct causal effect** of $A$ on $C$ is $0.5a$
- Regressing $C$ on $A$ and $E$ closely estimates this effect.
- However $A$ also affects $C$ through $E$
The **total causal effect** of $A$ on $C$ includes the effect of $A$ through both paths, which from the model is:
$$
\begin{align}
c &= .5a + -1\mathbf{e} \\
&= 0.5a + -1\mathbf{(0.3a)} \\
&= 0.5a\mathbf{- 0.3a} \\
&= 0.2a \\
\end{align}
$$
]
???
To get the total causal effect of A on C, we must regress C on A alone (`lm(C ~ A, dobs)`), controlling for E in the regression (`lm(C ~ A + E, dobs)`) gives us the direct effect of A on C.
---
### A Single Analysis is Usually Insuffient to Measure All Effects
.pull-left[
```{r, echo=FALSE, out.width="70%"}
formula <- y ~ x
p <- ggplot(dobs, aes(x = A, y = C)) + geom_point() + geom_smooth(method = "lm", formula = formula)
p
fitCAobs = lm(C ~ A, dobs)
CAobs <- coef(fitCAobs)
```
`lm(C ~ A, dobs)`
$c = `r round(CAobs[1], digits=2)`+`r round(CAobs[2], digits=2)`a \space `r ifelse(summary(fitCAobs)$coefficients[2,4] < .001, "(p<.001)", "")`$
]
.pull-right[
This simple 3 variable dataset requires two different regression analyses to estimate the causal effects of $A$ on $C$ and $E$ on $C$
- Total effect of $E$, Direct effect of $A$:
`lm(C ~ A + E, dobs)`
$c = `r round(Cintadjusted[1], 2)`+`r round(Cintadjusted[2], 2)`e + `r round(Cintadjusted[3], 2)`a \space `r ifelse(summary(fitobsstrat)$coefficients[2,4] < .001, "(p<.001)", "")`$
- Total effect of $A$:
`lm(C ~ A, dobs)`
$c = `r round(CAobs[1], digits=2)`+`r round(CAobs[2], digits=2)`a \space `r ifelse(summary(fitCAobs)$coefficients[2,4] < .001, "(p<.001)", "")`$
Simply throwing all covariates into a single regression is insufficient... but this is what most machine learning models do
]
???
To get the direct effect of $A$ and $E$ (thereby estimating all path coefficients) we would have to regress $E$ on $A$. Later slides show that we don't need to adjust for $C$ because it is a collider. Adjusting for $C$ would give an incorrect estimate of the effect of $A$ on $E$.
---
class: center, middle
.Large[How do we know when and what to adjust for?]
---
## Paths in Causal DAGs
.pull-left[
Observational Causal DAG
```{r echo=FALSE, out.width="70%"}
pgobs
```
]
.pull-right[
A **path** is a sequence of non-intersecting adjacent edges
- Non-intersecting means a path cannot cross a node more than once
- Direction of the arrows doesn't matter
Examples:
- $A \rightarrow E$
- $A \rightarrow E \rightarrow C$
- $E \leftarrow A \rightarrow C$
All associations are transmitted along paths,
but not all paths transmit association!
]
???
Source: https://www.ssc.wisc.edu/~felwert/causality/wp-content/uploads/2013/06/2-elwert_dags.pdf
See here for how to cite sources using refmanager package: https://github.com/yihui/xaringan/wiki/Bibliography-and-citations
---
class: center, middle
.Large[Two Kinds of Paths: **Causal** and **Non-Causal**]
---
## Causal Paths
.pull-left[
Observational Causal DAG
```{r echo=FALSE, out.width="70%"}
pgobs
```
]
.pull-right[
**Causal Path:** A path in which all arrows point away from the treatment to the outcome
- Causal path for Treatment $E$ and Outcome $C$
* $E \rightarrow C$
- Causal paths for Treatment $A$ and Outcome $C$
* $A \rightarrow E \rightarrow C$
* $A \rightarrow C$
The **total causal effect** of a treatment on an outcome consists of association transmitted along all causal paths connecting them
]
---
## Non-Causal Paths
.pull-left[
Observational Causal DAG
```{r echo=FALSE, out.width="70%"}
pgobs
```
]
.pull-right[
**Non-Causal Path:** A path from the treatment to the outcome in which at least one arrow points back to the treatment
- Non-Causal path for Treatment E and Outcome C
* $E \leftarrow A \rightarrow C$
- Non-Causal paths for Treatment $A$ and Outcome C
* None
Non-causal paths are potential spurious sources of association between treatment and outcome
]
???
Technically $A$ cannot be a treatment, because we cannot externally manipulate age. Body Mass Index, Obesity, and Cholesterol levels are similar kinds of variables, because none are directly manipulable. This has led to a bit of feud between different groups of causal inference researchers on the question of whether it even makes sense to estimate the causal effect of non-manipulable causes. For some background on this, see the papers below, but also check out Pearl and Hernan's twitter feeds for a look at their different perspectives. Much of this centers on the need to test well-defined interventions. Obesity can only be manipulated indirectly, so why not just define a precise intervention and measure the causal effect of that intervention on some outcome of interest? Pearl argues that we still want to know the effect of obesity, because if we find that indirect changes in obesity has no effect on say life expectancy, then we might as well stop worrying about obesity, at least with respect to life expectancy.
Pearl J. Does obesity shorten life? Or is it the soda? On non-manipulable causes. Journal of Causal Inference. 2018 Sep 25;6(2).
Hernán MA, Taubman SL. Does obesity shorten life? The importance of well-defined interventions to answer causal questions. International journal of obesity. 2008 Aug 11;32(S3):S8.
Hernán MA. Does water kill? A call for less casual causal inferences. Annals of epidemiology. 2016 Oct 1;26(10):674-80.
---
### Causal and Non-Causal Paths for Treatment E and Outcome C
.pull-left[
Observational Causal DAG
```{r echo=FALSE, out.width="70%"}
pgobs
```
]
.pull-right[
Causal path:
- $E \rightarrow C$
Non-Causal path:
- $E \leftarrow A \rightarrow C$
The non-causal path is a source of confounding because it transmits spurious association between $E$ and $C$
Adjusting for $A$ in the observational data blocks this path, allowing us to use the observational data to estimate the causal effect of $E$ on $C$
]
---
## Identifiability and Estimation
A causal effect is **identifiable** if the controlled (post-intervention) distribution can be estimated from data drawn from the observational (pre-intervention) distribution.
- The **Backdoor Criterion** is one set of constraints for determining whether an effect is identifiable:
- The causal effect of $T$ on $Y$ is **identifiable** if
we can adjust for a set of variables that
- blocks all non-causal paths between $T$ and $Y$
- without blocking any causal paths between $T$ and $Y$
**Estimation** is the act of quantifying an identifiable causal effect from finite data
- Uses statistical estimation techniques
---
### Blocking and Unblocking Paths
All Causal DAGs are a combination of these three patterns:
.pull-left[
```{r, echo=FALSE, out.width="70%", fig.height=2}
glinear <- dagitty('dag{
A [pos="0,0"]
E [pos="1,0"]
C [pos="2,0"]
A -> E E->C}')
ggdag(glinear, text_size=11) + theme_dag_gray()
```
```{r, echo=FALSE, out.width="70%", fig.height=3}
gdiverging <- dagitty('dag{
E [pos="1,2"]
A [pos="0,1"]
C [pos="1,0"]
A -> E A->C}')
ggdag(gdiverging, text_size=11) + theme_dag_gray()
```
```{r, echo=FALSE, out.width="70%", fig.height=3}
gconverging <- dagitty('dag{
A [pos="0,2"]
E [pos="0,0"]
C [pos="1,1"]
A -> C E->C}')
ggdag(gconverging, text_size=11) + theme_dag_gray()
```
]
.pull-right[
**Linear Path (or Chain)**
- Adjusting for $E$ (a mediator) *blocks* the path between $A$ and $C$
**Common Cause**
- Adjusting for $A$ *blocks* the path between $E$ and $C$
$$\space$$
**Common Effect**
- $C$ is called a collider
- Unadjusted colliders *block* the path
- Adjusting for $C$ (or a descendent of $C$) **unblocks or opens** the path
]
???
In the linear chain, adjusting for a child of the mediator $E$ will block part of the causal path from $A$ to $C$. A child of a mediator is sometimes called a descending proxy. Here $F$ is a descending proxy:
```{r, echo=FALSE, out.width="70%", fig.height=3}
gdescproxy <- dagitty('dag{
A [pos="0,0"]
E [pos="1,0"]
F [pos="1, -1"]
C [pos="2,0"]
A -> E E->C E->F}')
ggdag(gdescproxy, text_size=11) + theme_dag_gray()
```
If $F$ was a parent of $E$ adjusting for $F$ would not affect the path from $A$ to $C$.
---
## Identifying the Total and Direct Effects of $A$ on $C$
.pull-left[
Observational Causal DAG
```{r echo=FALSE, out.width="70%"}
pgobs
```
]
.pull-right[
**Total Effect of $A$ on $C$**
- No adjustments needed
- All paths from $A$ to $C$ are causal
- $A \rightarrow C$
- $A \rightarrow E \rightarrow C$
- `lm(C ~ A, dobs)`
**Direct Effect of $A$ on $C$**
- Adjusting for $E$ blocks the path $A \rightarrow E \rightarrow C$ leaving only the direct path $A \rightarrow C$
- `lm(C ~ A + E, dobs)`
]
```{r, echo=FALSE}
AEadjustedfit = lm(E ~ A, dobs)
AEadjusted <- coef(AEadjustedfit)
# Save coefficients for E = ei + ea*A
ea <- round(AEadjusted[2], 2)
ei <- round(Cintadjusted[1], 2) # intercept
```
---
## Identifying the Total Effect of $A$ on $E$
.pull-left[
Observational Causal DAG
```{r echo=FALSE, out.width="70%"}
pgobs
```
]
.pull-right[
**Total (and Direct) Effect of $A$ on $E$**
- No adjustments needed
- Causal path
- $A \rightarrow E$
- Non-causal path, blocked by collider $C$
- $A \rightarrow C \leftarrow E$
- Adjusting for $C$ would distort the effect estimate
- `lm(E ~ A, dobs)`
]
???
Show the effects and how adjusting distorts the effect
---
## Backdoor Criterion (Pearl 1995)
A set of variables $\{Z\}$ (possibly empty) satisfies the **backdoor criterion** relative to an ordered pair of variables $\{X, Y\}$ in a DAG if:
1. no node in $\{Z\}$ is a descendent of $X$, and
2. $\{Z\}$ blocks every path between $X$ and $Y$ that contain an arrow into $X$ (the "backdoor paths")
If $\{Z\}$ meets the backdoor criterion, the total causal effect of $X$ on $Y$ is non-parametrically identifiable given $\{Z\}$, such that:
$$ P(Y|do(X)) = \sum_{Z} P(Y|X,Z)P(Z)$$
The backdoor criterion recognizes that all paths that descend from $X$ are either causal or blocked
---
## Multiple Adjustment Sets
.pull-left[
```{r, echo=FALSE, out.width="70%"}
g1 <- dagitty( "dag {
Y <- X <- Z1 <- V -> Z2 -> Y
Z1 <- W1 -> W2 -> Z2
X <- W1 -> Y
X <- W2 -> Y
}")
pg1 <- ggdag(g1, text_size=11) + theme_dag_gray()
pg1
```
]
.pull-right[
Adjustment sets for $X \rightarrow Y$:
```{r, echo=FALSE}
print( adjustmentSets(g1, 'X', 'Y'))
```
Adjusting for any one of these sets of variables eliminates confouding for the causal effect of $X$ on $Y$
For a linear model this means regressing $Y$ on $X$, including as covariates only one set of these variables
In R, any of these statements will estimate the total effect of $X$ on $Y$ from observational data `obsdata`:
```{r, eval=FALSE}
lm(obsdata, Y ~ X + W1 + W2 + Z2)
lm(obsdata, Y ~ X + V + W1 + W2)
lm(obsdata, Y ~ X + W1 + W2 + Z1)
```
]
---
# Adjustment Sets for $X \rightarrow Y$
```{r, echo=FALSE, fig.width=15}
ggdag_adjustment_set(g1, exposure='X', outcome='Y') + theme_dag_gray()
```
???
Bold edges are unadjusted, light edges are adjusted
The backdoor criterion implicitly prevents adjusting for children of mediators. For instance, suppose we want to determine the causal effect of $X$ on $Y$ in the model:
$$X \rightarrow Z \rightarrow Y$$
$$Z \rightarrow Z_2$$
where $Z$, a mediator, causes $Z_2$. It is clear that adjusting for $Z$ would block the path between $X$ and $Y$, but adjusting for $Z_2$ would partially block the path.
---
exclude: FALSE
# Frontdoor Adjustment Criterion
.pull-left[
U unmeasured: Cannot block backdoor path from $X \rightarrow Y$
```{r, echo=FALSE, out.width="70%"}
gfront <- dagitty( 'dag {
X [pos="0,0"]
Z [pos="1,0"]
Y [pos="2,0"]
U [pos="1,1"]
X -> Z -> Y
U -> X
U -> Y
}')
pgfront <- ggdag(gfront, text_size=11) + theme_dag_gray()
pgfront
```
]
.pull-right[
$X \rightarrow Y$ is still identifiable using a two step process:
Step 1. Identify $X \rightarrow Z$:
* Backdoor path blocked by collider $Y$
$$X \leftarrow U \rightarrow Y \leftarrow Z$$
* In R: `lm(obsdata, Z ~ X)`
Step 2. Identify $Z \rightarrow Y$:
* Backdoor path blocked by adjusting for $X$
$$Z \leftarrow X \leftarrow U \rightarrow Y$$
* In R: `lm(obsdata, Y ~ Z + X)`
]
???
Since $U$ is unmeasured, we cannot adjust for it. However, in this case, the frontdoor adjustment criterion allows us to identify the causal effect of $X$ on $Y$ in two steps.
---
exclude: false
## Frontdoor Adjustment Criterion (Pearl 1995)
A set of variables $\{Z\}$ satisfies the **frontdoor criterion** relative to an ordered pair of variables $\{X, Y\}$ in a DAG if:
1. $\{Z\}$ intercepts all directed paths from $X$ to $Y$,
2. there is no backdoor path between $X$ and $Z$,
3. every backdoor path between $Z$ and $Y$ is blocked by $X$
If $\{Z\}$ meets the frontdoor criterion, the causal effect of $X$ on $Y$ is identifiable given $\{Z\}$, such that:
$$P(y|do(x)) = \sum_{z} P(z|x)\sum_{x'}P(y|x', z)P(x')$$
$x$ is the value of $X$ in $do(x)$, $x'$ refers to all values of $X$
---
exclude: true
## Instrumental Variables
Instrumental variables are those that only affect the treatment, with no relationship to anything else
---
exclude: false
# Identifying Covariate Specific Effects
Suppose a drug $D$ affects $C$, and that gender $G$ affects who takes the drug.
.pull-left[
What is the effect of $E$ on $C$ when we *observe* $D = d$?
```{r, echo=FALSE, out.width="70%"}
gaecdg <- dagitty('dag{
A [pos="1,1"]
E [pos="0,0"]
C [pos="2,0"]
D [pos="1, -1"]
G [pos="2, -1"]
A -> C A->E E->C D->C G->D G->C}')
pgaecdg <- ggdag(gaecdg, text_size=11) + theme_dag_gray()
pgaecdg
```
]
.pull-right[
Estimate $P(C=c|do(E=e), D=d)$
Must use an adjustment set $S$, such that $S \cup D$ blocks all backdoor paths from $E$ to $C$.
No need to block backdoor path from $D$ to $C$.
For this graph, $S = \{A\}$ and and $S \cup D = \{A, D\}$
Regression: `lm(obsdata, C ~ E + A + D)`
Mathematically:
$$\begin{aligned} P(C=c&|do(E=e), D=d) = \\ &\sum_S P(C=c|E=e, D=d, S=s)P(S=s|D=d) \end{aligned}$$
]
???
Note that the summation in the adjustment formula is over $S$, not $S \cup D$. $S$ is the set of all the other variables needed to block the backdoor paths from $E$ to $C$.
As a notational note, when you see a statement such as $P(S=s)$ and $S$ contains more than one variable, it is just shorthand for $P(S_1=s_1, S_2=s_2...)$. In this example, because $S=\{A,D\}$ $P(S=s) = P(A=a, D=d)$.
We don't want to block the backdoor path from $D$ to $C$ here, because the query is asking for the effect of $E$ on $C$ when we intervene on $E$ and then in the resulting interventional distribution observe $D=d$.
In contrast, if we just want to identify $P(C=c|do(E=e))$ we have:
$$P(C=c|do(E=e)) = P(C=c|E=e, S=s)P(S=s)$$
where $S$ is a valid adjustment set that blocks the backdoor paths from $E$ to $C$.
The basic idea is that when we look at the interventional distribution (resulting from adjusting for $S$), we must stratify by $D$ to compute the covariate-specific effect, which means we are also adjusting for $D$. Thus we have to make sure that adjusting for $D$ does not open any backdoor paths for the effect from $E$ to $C$.
To see this, consider adding high blood pressure $B$ to the graph with the paths $E \rightarrow B \leftarrow D$, meaning that both $E$ and $D$ affect $B$. If we ask how $D$ affects $C$ among those with high blood pressure, we must adjust for $G$ to block the backdoor path $D \leftarrow G \rightarrow C$ and also $B$, a collider, which opens the backdoor path $D \rightarrow B \leftarrow E \rightarrow C$. As a result, we need to add $E$ to the adjustment set, giving
$$S = \{G, E\}$$
and
$$P(C=c|do(D=d), B=b) = P(C=c|D=d, B=b, G=g, E=e)P(B=b, G=g|B=b)$$
---
exclude: false
## Identifying the Effects of Multiple Interventions
What is the effect of simultaneously intervening on $E$ and $D$?
.pull-left[
```{r, echo=FALSE, out.width="70%"}
pgaecdg
```
]
.pull-right[
Estimate $P(C=c|do(E=e, D=d))$
Must use an adjustment set $S$ that blocks all backdoor paths from $E$ to $C$ and $D$ to $C$. For this graph:
```{r echo=FALSE}
s <- adjustmentSets(gaecdg, c("E", "D"), "C")
s <- s[[1]]
```
$$S = \{`r s`\}$$
Regression: `lm(C ~ E + A + D + G)`
Mathematically:
$$\begin{aligned} P(C=c&|do(E=e, D=d)) = \\ &\sum_S P(C=c|E=e, D=d, S=s)P(S=s) \end{aligned}$$
]
---
## Confounding is a Causal Concept
* The data alone are insufficient to identify confounders or control for confouding
* Without a causal model, observational data is limited to associational models that do not tell us how to intervene
+ This includes all common machine learning and statistical approaches
+ Example: If an associational predictive model finds a few highly predictive "risk factors" we cannot assume that intervening on those risk factors will decrease risk.
* Associational predictive models are still useful if we already have interventions that are known to work
+ If we know that a treatment works if given earlier in the course of a disease, better predictive models can tell us who to treat
+ Variables that increase the predictive power of an associational model may include effects of the outcome, such as symptoms of a disease you are trying to predict
---
class: center, middle, Large
Causal Inference Requires a Causal Model,
But...
How Do we Know if Our Model is Correct?
---
class: center, middle
# Part III: Model Evaluation
---
# Part III Learning Objectives: Model Evaluation
- Identify two different approaches for evaluating causal models
- Describe the advantages and disadvantages of each approach
- Define what constitutes an equivalence class of causal DAGs
---
# Two Approaches to Model Evaluation
Local fit tests
- Tests of model-implied conditional independence assumptions
Global fit tests
- Model chi-square: comparison of model vs. observed covariance matrices
- RMSEA: Root Mean Square Error of Approximation
- CFI: Comparative fit index
- SRMSR: Standaridzed Root Mean Square Residual
???
From: Kline RB. Principles and practice of structural equation modeling. Guilford publications; 2015 Nov 3.
---
## Model Evaluation: Local Fit Tests
Causal DAG's imply independence relations
.pull-left[
```{r, echo=FALSE, out.width="70%", fig.height=2}
ggdag(glinear, text_size=11) + theme_dag_gray()
```
```{r, echo=FALSE, out.width="70%", fig.height=3}
ggdag(gdiverging, text_size=11) + theme_dag_gray()
```
```{r, echo=FALSE, out.width="70%", fig.height=3}
ggdag(gconverging, text_size=11) + theme_dag_gray()
```
]
.pull-right[
**Linear Path (or Chain)**
- $A \perp\!\!\!\perp C | E$