-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathcovars-interpr-causal.Rnw
More file actions
1945 lines (1463 loc) · 78.9 KB
/
covars-interpr-causal.Rnw
File metadata and controls
1945 lines (1463 loc) · 78.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%% ;;; -*- mode: Rnw; -*-
\synctex=1
\documentclass[a4paper,11pt]{article}
\usepackage{graphics}
\usepackage{amssymb,amsfonts,amsmath,amsbsy}
\usepackage{geometry}
\geometry{verbose,a4paper,tmargin=28mm,bmargin=28mm,lmargin=30mm,rmargin=30mm}
\usepackage{setspace}
\singlespacing
\usepackage{url}
\usepackage{nameref}
\usepackage[english]{babel}
%% FIXME: I should be using UTF-8. Fix this annoying thing!
\usepackage[latin1]{inputenc}
\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage{cancel}
\usepackage{MnSymbol} %% for upmodels, cond, indep.
\usepackage{wasysym} %% smileys
%% see
%% https://tex.stackexchange.com/questions/3631/is-there-a-standard-symbol-for-conditional-independence
%% https://tex.stackexchange.com/questions/3631/is-there-a-standard-symbol-for-conditional-independence
%% for alternatives
\usepackage{enumitem}
\usepackage[small]{caption}
\usepackage{hyperref}
\hypersetup{
colorlinks = true,
citecolor= black,
linkcolor = {blue},
filecolor = cyan %% controls color of external ref, if used
}
%% I do not understand why I keep using Burl. Oh well.
\usepackage{color}
\newcommand{\cyan}[1]{{\textcolor {cyan} {#1}}}
\newcommand{\blu}[1]{{\textcolor {blue} {#1}}}
\newcommand{\Burl}[1]{\blu{\url{#1}}}
\newcommand{\red}[1]{{\textcolor {red} {#1}}}
\newcommand{\green}[1]{{\textcolor {green} {#1}}}
\newcommand{\mg}[1]{{\textcolor {magenta} {#1}}}
\newcommand{\og}[1]{{\textcolor {PineGreen} {#1}}}
\newcommand{\code}[1]{\texttt{#1}} %From B. Bolker
\newcommand{\myverb}[1]{{\footnotesize\texttt {\textbf{#1}}}}
\newcommand{\Rnl}{\ +\qquad\ }
\newcommand{\Emph}[1]{\emph{\mg{#1}}}
\usepackage[begintext=\textquotedblleft,endtext=\textquotedblright]{quoting}
\newcommand{\activities}{{\vspace*{10pt}\LARGE \textcolor {red} {Activities:\ }}}
\newcommand{\R}{R}
\newcommand{\flspecific}[1]{{\textit{#1}}}
\newcommand*{\qref}[1]{\hyperref[{#1}]{\textit{``\nameref*{#1}'' (section \ref*{#1})}}}
\newcounter{exercise}
\numberwithin{exercise}{section}
\newcommand{\exnumber}{\addtocounter{exercise}{1} \theexercise \thinspace}
\usepackage[copyright]{ccicons}
%% color of links, so it is pink or whatever, and not the kind
%% of boxed with lilght blue, is given by hypersetup
\usepackage[authoryear, round, sort]{natbib}
%% \usepackage[square,numbers,sort&compress]{natbib}
\usepackage{gitinfo2}
\setlength{\parskip}{0.35em}
%% For using listings, so as to later produce HTML
%% uncommented by the make-knitr-hmtl.sh script
%% listings-knitr-html%%\usepackage{listings}
%% listings-knitr-html%%\lstset{language=R}
<<setup,include=FALSE,cache=FALSE>>=
require(knitr)
opts_knit$set(concordance = TRUE)
opts_knit$set(stop_on_error = 2L)
## next are for listings, to produce HTML
##listings-knitr-html%%options(formatR.arrow = TRUE)
##listings-knitr-html%%render_listings()
@
% %% BiocStyle needs to be 1.2.0 or above
% <<packages,echo=FALSE,results='hide',message=FALSE>>=
% require(BiocStyle, quietly = TRUE)
% @
% <<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
% BiocStyle::latex()
% ## or latex(use.unsrturl = FALSE)
% ## to use arbitrary biblio styles
% @
\begin{document}
% \bioctytle
\title{Choosing covariates, interpreting coefficients, and causal inference}
\author{Ramon Diaz-Uriarte\\
Dept. Biochemistry, Universidad Aut\'onoma de Madrid \\
Instituto de Investigaciones Biom\'edicas Sols-Morreale, IIBM (UAM-CSIC)\\
Madrid, Spain{\footnote{r.diaz@uam.es, rdiaz02@gmail.com}} \\
%% {\footnote{rdiaz02@gmail.com}} \\
{\small \Burl{https://ligarto.org/rdiaz}} \\
}
\date{\gitAuthorDate\ {\footnotesize (Rev: \gitAbbrevHash)}}
\maketitle
\tableofcontents
\clearpage
%% ZZ: FIXME
%% add numerical examples of everything, including categorical data ones
%% of Simpson's and Berkson's paradoxes
%% And probably teach this before the categorical data analysis part
\section*{License and copyright}\label{license}
This work is Copyright, \copyright, 2025, Ramon Diaz-Uriarte, and is
licensed under a \textbf{Creative Commons } Attribution-ShareAlike 4.0
International License:
\Burl{http://creativecommons.org/licenses/by-sa/4.0/}.
\centerline \ccbysa
All the original files for the document are available (again, under a Creative
Commons license) from \Burl{https://github.com/rdiaz02/BM-1}. (Note that in the
github repo you will not see the PDF, or R files, nor many of the data files,
since those are derived from the Rnw file). This file is called \texttt{covars-interpr-causal.Rnw}.
\vspace*{10pt}
Please, \textbf{respect the copyright and license}. This material is
provided freely. If you use it, I only ask that you use it according to the
(very permissive) terms of the license: acknowledging the author and
redistributing copies and derivatives under the same license. If you have any
doubts, ask me.
\clearpage
%% \part{The main stuff}
\section{Introduction}
\subsection{What is this about}
\label{sec:what-this-about}
The aim of these notes is to try to clarify questions about ``what
variables should we add to our models when our aim is interpretation''\footnote{The paper by \cite{keele2020} has a very similar objective: ``(...) we ask when we can justify interpreting two or more coefficients in a regression model as causal parameters. We demonstrate that analysts must appeal to causal identification assumptions to give estimates causal interpretations.'' After going through these notes, reading that paper should be easy.}.
As discussed in class, if our purpose when fitting statistical models is only
prediction, reversal of regression coefficients when a covariate is in the model
with or without other covariates, and other similar counterintuitive phenomena,
are not a problem. The problem can arise if we want to interpret what the
coefficients mean.
The interpretation we often want to give is a causal one. For example, in a model
where the dependent or outcome variable is cardiovascular health and one of the
predictor variables is red wine consumption we are, arguably, trying to
understand if consuming red wine affects (i.e., has an effect on) cardiovascular
health. We might want to do this so that we can make public health
recommendations or take personal action (should I start drinking a little bit of
wine?). Intuitively, if a variable, X, has a causal effect on a variable, Y,
manipulating X will change the value of Y\footnote{The philosopher James Woodward, in \citet{woodward_making_2003}, defends a manipulationist or interventionist account of causality; the informal ideas introduced in these notes fall under that umbrella.}.
And what do we mean by \textbf{causal effects}? ``(...) causal effects in
populations, that is, \textbf{numerical quantities} that measure changes in the
distribution of an outcome under different \textbf{interventions}.'' \citep[p.\
v]{hernan2020} (emphasis is mine)
\subsection{There is no such thing as ``spurious association''. But when people use this term, it suggests they are trying to obtain causal estimates}
In the above study, there is a possible obvious problem. Suppose you observe an
association between moderate red wine consumption and better cardiovascular
health. Maybe what is happening is that, in the sample you are using, people who
consume moderate amounts of red wine are also people who consume olive oil and
lots of fresh vegetables (the Mediterranean diet?). The observed association
between red wine consumption and better cardiovascular health is not a causal
association: the association is the result of both red wine consumption and
better cardiovascular health being effects (or consequences) of the type of
diet. But red wine has no direct effect on cardiovascular health (we will see
examples of this pattern below: Figure \ref{dag-structs}). A similar well known
phenomenon is the very real association between ice cream sales and homicide
rates (ice cream sales is positively correlated with homicide rates, at
least in some parts of the world: more ice cream, more homicides).
Some authors would say there is a ``spurious association'' or ``spurious
correlation''\footnote{Instead of ``spurious'' you might read ``illusory'' or
``fictitious'' or ``apparent''.} between red wine consumption
and health (or ice creams and homicides). Well, not really: the association is
not spurious. It is quite real. And there is nothing wrong with computing it, nor
is there anything false, invalid, or fake with it showing up as
positive\footnote{This point is emphasized, for example, by \citet[][section 7.1]{hernan2020} or \citet{sober2024}}. But the association is not causal.
And the reason we can tell ``there is something wrong going on here if we infer
an effect of red wine on health'' is, precisely, because we are trying to use a
model that has causal interpretations. We are trying to answer questions such as
``Is red wine really good for your health?'' or ``Is switching from not
consuming red wine to consuming moderate amounts of it a good idea?''. See
\citet[][p.\ 84, for similar comments]{hernan2020}
So someone using expressions like ``spurious associations'' is actually revealing
that they really want to say something about causes and effects.
\activities Think of at least two examples, ideally at least one related to your
own research, where you might have thought of ``spurious associations''.
\subsection{What should you read of this long pdf?}\label{whattoread}
I expect you to read all of it, except the Appendix. You can read section
\ref{modified-basic} very, very quickly, except for structure f). The main
objective of this document is to allow you to understand why we should or
shouldn't adjust for covariates in the examples we give (\ref{chol-exercise},
\ref{fungus-ex}, \ref{fungus-collider}, \ref{pretreat-no},
\ref{cause-of-cause}, \ref{amplification}, \ref{bwparadox}).
Sections \ref{dag-structs}, \ref{backdoor-basics} are there just to explain why.
More complex examples are provided in \ref{direct-indirect}, but you can skip
this if you want.
The Appendix (section \ref{appendix}) is optional (though not harmful).
<<load_libs, echo=FALSE, results='hide',message=FALSE>>=
## Plots, with dagitty
library(dagitty)
## library(rethinking) ## for drawdag
## Installing rethinking can be complicated just for a few graphs
## So have a fallback if rethinking not available
if(!suppressWarnings(require("rethinking", quietly = TRUE))) {
drawdag <- plot
}
library(car)
@
\section{Graphs, DAGs, notation}
Graphs, such as the one in Figure \ref{fig:plot_dag_1}, are very useful to represent
causal concepts. In that figure, Age is a \textbf{common cause} of Exercise and
Cholesterol, and Exercise has a direct effect on Cholesterol too. These are
\textbf{DAG}s, for Directed Acyclic Graphs; Directed because there is direction, and thus we see arrows,
not just edges; Acyclic because there are no cycles: you do not go through a variable twice if you follow arrows.
% (Yes, we do know that we should recommend people exercise if they want to lower
% their cholesterol)
<<dag_1, results='hide', echo=FALSE>>=
common_cause <- dagitty("dag {
Age -> Exercise
Age -> Cholesterol
Exercise -> Cholesterol
}")
@
%% <<plot_dags_1_2, out.width = '14cm', out.height='7cm'>>=
<<plot_dag_1, fig.width = 4, fig.height=2, fig.lp='fig:', results = 'hide', echo=FALSE, fig.align = 'center', fig.cap='Cholesterol, Exercise, Age example',out.width='10cm', fig.pos="!h">>=
coordinates(common_cause) <- list(x = c(Exercise = 1, Age = 2, Cholesterol = 3),
y = c(Exercise = 0, Age = -1, Cholesterol = 0))
drawdag(common_cause, xlim = c(0.5, 3.5), ylim = c(0, 1.3))
## text(x = 2, y = 1.2, labels ="a)", cex = 1.2)
@
Sometimes we will use variables denoted as $U$ (or $U$ with subindices): these
are unobserved variables.
% Total, direct and indirect effect
\activities Draw a DAG for the wine, diet, cardiovascular health example
discussed before.
\section{An introductory example of adjusting for covariates that are common
causes: Cholesterol, Exercise, Age}\label{chol-exercise}
Suppose we sample subjects from a population where, as people get older, they
both exercise more and have higher cholesterol levels. At the same time, for a
given age, the more people exercise, the lower their cholesterol level. Given a
sample of data where we have collected age, exercise patterns, and cholesterol
levels, how should we analyze the data? (This example is taken from chapter 1,
pp.\ 3 to 5, of \citealp{pearl_causal_2016}). The relationships between the
variables are shown in Figure \ref{fig:plot_dag_1}.
Here I simulate some data that follow the above relationships.
<<simul_1, results='hide', echo = TRUE>>=
N <- 1e4
################## Common_cause
common_cause <- data.frame(Age = rnorm(N, 30, 5))
common_cause$Exercise <- 2 * common_cause$Age + rnorm(N)
common_cause$Cholesterol <- 3 * common_cause$Age -
common_cause$Exercise +
rnorm(N)
@
Before turning the page, think what kind of relationship you expect between
Cholesterol and Exercise in the whole population and between Cholesterol and Exercise for people of age 10, and for people of age 20, \ldots.
\clearpage
The process above generate data that look like this:
\begin{figure}[h!]
\centering
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=0.9\textwidth]{chol-ex1-crop}% first figure itself
% \caption{first figure}
\end{minipage}\hfill
\begin{minipage}{0.45\textwidth}
\centering
\includegraphics[width=0.9\textwidth]{chol-ex2-crop} %second figure itself
% \caption{second figure}
\end{minipage}
% \includegraphics[width=0.50\paperwidth,keepaspectratio]{chol-ex1.pdf}
% \includegraphics[width=0.50\paperwidth,keepaspectratio]{chol-ex2.pdf}
\caption{\label{fig:chol-exercise-pearl-et-al} Relationships between
Cholesterol and Exercise, by age and over the complete population. From
\citet{pearl_causal_2016}, Figures 1.1. and 1.2 (which are the same as
Figure 6.6 in \citealp{pearl2018}; a preview of chapter 1 of
\citealp{pearl_causal_2016} is available from Pearl's page: \Burl{http://bayes.cs.ucla.edu/PRIMER/}). }
\end{figure}
\clearpage
Let's analyze the data with and without adjusting for Age.
<<code_1, results='show', echo = TRUE>>=
m_common_cause_adjust <- lm(Cholesterol ~ Exercise + Age,
data = common_cause)
m_common_cause_no_adjust <- lm(Cholesterol ~ Exercise,
data = common_cause)
## Function "S" is from the car library
S(m_common_cause_adjust)
S(m_common_cause_no_adjust)
@
We must adjust (or control) for Age, the common cause of Exercise and Age: if we
don't adjust for Age, the estimate of the effect of Exercise on Cholesterol is
confounded by Age having effects on both Exercise and
Cholesterol\footnote{When dealing with categorical variables, this pattern,
where the association of two variables changes, or even reverts its sign, when
we account for another, is also called Simpson's paradox.}. Age is a
\textbf{confounder}. And notice that ``confounder'' is something we define with respect to measuring causal effects, so the causal perspective is essential for identifying (and correcting for) confounders\footnote{Pearl elaborates on this at length, when he discusses the so-called ``Simpson's paradox'': it only seems paradoxical when viewed purely as a statistical phenomenon, without causal reasoning.}.
% : there is \textbf{confounding} due to Age in the estimate of
% the effect of Exercise on Cholesterol.
% Or, unless we adjust for Age, there will be \textbf{confounding}: the
% true causal relationship (or its absence) between Exercise and Cholesterol will
% be confounded by Cholesterol and Exercise both having Age as a common cause.
%% FIXME: maybe this lesson after the one on chi-squares?
\section{Should we always adjust for covariates? Fungus and the post-treatment variable}
\label{fungus-ex}
The following example is modified from \citet[][pp.\
170--174]{rethinking_2020}. An experiment is conducted to asses the effects of an
antifungal treatment (Treatment) on plant final size (Size). The amount of fungus
(a post-treatment variable, Fungus) is also measured. Since plots varied in
quality, in ways that could affect final plant Size, a variable (or variables)
Plot measure the plot quality (pretreament, though this is irrelevant since this
variable, or variables, are not affected by treatment). The DAG is shown in Fig.~\ref{fig:plot_fungus_1}.
<<dag_fungus_1, results='hide', echo=FALSE>>=
fungus1 <- dagitty("dag {
Treatment -> Fungus
Fungus -> Size
Plot -> Size
}")
@
<<plot_fungus_1, fig.width = 4, fig.height=1.4, fig.lp='fig:', results = 'hide', echo=FALSE, fig.align = 'center', fig.cap='Fungus, first example',out.width='9cm', fig.pos="!h">>=
coordinates(fungus1) <- list(x = c(Treatment = 1, Fungus = 2,
Plot = 2.5,
Size = 3),
y = c(Treatment = -0.5, Fungus = 0,
Plot = -1,
Size = 0))
drawdag(fungus1, xlim = c(0.5, 3.5), ylim = c(-.1, 1.3))
## text(x = 2, y = 1.2, labels ="a)", cex = 1.2)
@
%% Write R code for this!
We definitely want to adjust for Plot to reduce the variability in our
estimates.
What about Fungus: no, we do not want to adjust for it, as adjusting
for Fungus would actually prevent us from estimating the effect of Treatment:
Treatment affects plant size through Fungus. Once we know about Fungus, Treatment
says nothing about Size \footnote{Size is conditionally independent of Treatment
given Fungus, $Size \upmodels Treatment | Fungus$. The $\upmodels$ symbol
means independence. You can ignore this notation if it does not help you.
You might also read the expression ``Fungus screens-off Size from Treatment'': \(P(S|F \& T) = P(S|F) \ne P(S|T)\), where \(S, F, T\) denote Size, Fungus, Treatment, respectively; the ``screening-off'' expression is common in the philosophical literature.}.
Even if the true relationship was as shown in Figure \ref{fig:plot_fungus_2} we would not
want to use Fungus as a covariate. Here Treatment is not independent of Size
given Fungus, but Fungus mediates in the relationship. If we added Fungus in the
statistical model, we would not be estimating the total effect of Treatment on
Size (if you want more details, go to Appendix, \qref{direct-indirect}).
<<dag_fungus_2, results='hide', echo=FALSE>>=
fungus2 <- dagitty("dag {
Treatment -> Fungus
Treatment -> Size
Fungus -> Size
Plot -> Size
}")
@
<<plot_fungus_2, fig.width = 4, fig.height=1.4, fig.lp='fig:', results = 'hide', echo=FALSE, fig.align = 'center', fig.cap='Fungus, second example',out.width='9cm',fig.pos="!h">>=
coordinates(fungus2) <- list(x = c(Treatment = 1, Fungus = 2,
Plot = 2.5,
Size = 3),
y = c(Treatment = -0.5, Fungus = 0,
Plot = -1,
Size = 0))
drawdag(fungus2, xlim = c(0.5, 3.5), ylim = c(-.1, 1.3))
## text(x = 2, y = 1.2, labels ="a)", cex = 1.2)
@
\section{More fungus: a collider}\label{fungus-collider}
This example is modified from \citet[][p.\ 175]{rethinking_2020}. Suppose
now that Fungus does not affect final Size. But both Fungus and Size are
affected by moisture (moisture and Plot quality are different
variables). Moreover, and this is crucial, moisture has not been measured, so
there is no way for you to adjust for it, and it is shown as U in the DAG below,
Figure \ref{fig:fungus_collider}.
<<dag_fungus_3, results='hide', echo=FALSE>>=
fungus3 <- dagitty("dag {
Treatment -> Fungus
U -> Fungus
U -> Size
Plot -> Size
}")
@
<<fungus_collider, fig.width = 4, fig.height=1.4, fig.lp='fig:', results = 'hide', echo=FALSE, fig.align = 'center', fig.cap='Fungus, collider example',out.width='9cm',fig.pos="!h">>=
coordinates(fungus3) <- list(x = c(Treatment = 1,
Fungus = 2,
Plot = 1,
U = 1,
Size = 3),
y = c(Treatment = -1.5,
Fungus = -1,
Plot = 0,
U = -0.5,
Size = 0))
drawdag(fungus3, xlim = c(0.5, 3.5), ylim = c(-.1, 1.6))
## text(x = 2, y = 1.2, labels ="a)", cex = 1.2)
@
In Figure \ref{fig:fungus_collider}, Fungus is a descendant of both U and
Treatment. This is called a \textbf{collider}. Conditioning on Fungus will lead
to Treatment and moisture (U) being associated, even when they are really
independent of each other. And that would lead to our mistakenly estimating that
Treatment has an effect on Size (U affects Size and U is associated with
Treatment when we condition on Fungus), when Treatment really does not have an
effect on Size.
OK, this is getting complicated. Let us see the three basic DAG structures,
and a few derived ones.
(But before we leave this example: what should we have done? Include in the model
Treatment, which is the variable we are interested in, and also Plot, to decrease
variability of the estimates. If we could have adjusted for U, we would have adjusted for it, but we can't adjust for U; anyway, not including U in the model will increase
the variability, but will not lead to bias. The bad idea was adjusting for
Fungus.)
\clearpage
\section{Basic DAG structures}\label{dag-structs}
\subsection{The three basic DAG structures}\label{three-basic-structs}
The basic DAG structures with their names\footnote{This is fairly standard material; you can find it in,
for example, Figure 3.3 in \citet{morgan_counterfactuals_2015}, section 6.3 in
\citet{hernan2020}, sections 2.2. and 2.3 in \citet{pearl_causal_2016}, Figure
8.1 in \citet{kline2015}.} are presented in Figure
\ref{fig:plot_dag_struct}.
<<dag_struct, results='hide', echo=FALSE>>=
chain <- dagitty("dag {
X -> Z
Z -> Y
}")
fork <- dagitty("dag {
Z -> X
Z -> Y
}")
collider <- dagitty("dag {
X -> Z
Y -> Z
}")
@
<<plot_dag_struct, fig.width = 5, fig.height=2.5, fig.lp='fig:', results = 'hide', echo=FALSE, fig.align = 'center', fig.cap='Basic DAG structures',out.width='14cm',fig.pos="!h">>=
op <- par(mfrow = c(1, 3))
coordinates(chain) <- list(x = c(X = 1,
Y = 3,
Z = 2),
y = c(X = 0,
Y = 0,
Z = 0))
coordinates(fork) <- list(x = c(X = 1,
Y = 3,
Z = 2),
y = c(X = 0,
Y = 0,
Z = -.50))
coordinates(collider) <- list(x = c(X = 1,
Y = 3,
Z = 2),
y = c(X = -.5,
Y = -.5,
Z = 0))
drawdag(chain, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="a) Chain.\n Z mediates", cex = 1.1)
drawdag(fork, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="b) Fork.\n Z common cause", cex = 1.1)
drawdag(collider, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2.1, y = -.4, labels ="c) Inverted fork with collider.\n Z common effect: collider", cex = 1.1)
@
Each of those structures determines the pattern of association between variables.
For example, in the Chain case, we know that X causes Z that causes Y. Those are
the causal relationships. Now, what \textbf{associations} will we observe? You
can think of the DAGs as pipes, and association flows (or not) through these
pipes (see \citealp[][ch.~6]{hernan2020}; in Miguel Hern\'an's edX
course ``Causal Diagrams: Draw Your Assumptions Before Your Conclusions''\\
\Burl{https://www.edx.org/course/causal-diagrams-draw-your-assumptions-before-your},
you can see animations of this process). Both a mediator and a common cause, if
conditioned upon, block the flow of information because they close the pipe; in
contrast, a collider, if conditioned upon, opens the pipe. Make sure to play with the analogy of the pipe, and apply it to the figures, thinking about the effect of closing an imaginary tap in Z.
Let us reword the above. In the chain example in Figure \ref{fig:plot_dag_struct} a),
if we do not condition in Z, the pipe is open, so X and Y are
associated\footnote{Strictly: ``are very likely associated'' as it could happen
that they are not, but this would be rare.}. But if we condition on Z, we close
the pipe, the flow of association, and now X and Y are conditionally independent
given Z. The same thing happens in Figure \ref{fig:plot_dag_struct} b). But the
opposite thing happens in Figure \ref{fig:plot_dag_struct} c).
More systematically:
\begin{enumerate}[label=\alph*)]
\item Chain: X and Y are associated. But conditioning on Z will render X and Y
independent (there will be no association). The conditional and unconditional
dependencies are (you can ignore this if it does not help you):
$X \upmodels Y \ | \ Z$, \quad $X \nupmodels Y$.
\item Fork: X and Y are associated. But conditioning on Z will render X and Y
independent (there will be no association). $X \upmodels Y\ | \ Z$, \quad
$X \nupmodels Y$. %%$X \cancel{\upmodels} Y$
This is the usual example of \textbf{confounding}, where we see association
because of a common cause.
\item Inverted fork with collider: X and Y are independent. But conditioning on Z
will make X and Y associated\footnote{If these were numerical variables and we
assume a linear model, they would be correlated. Association and
non-independence are more general concepts than correlation.}:
$X \nupmodels Y\ | \ Z$, \quad $X \upmodels Y$.
\end{enumerate}
More about \textbf{confounding}. In Figure \ref{fig:plot_dag_struct} b) we must adjust for Z, or condition on Z, to
avoid confounding the estimate of the relationship between X and Y. It is the
presence of confounding that leads to the ``correlation is not necessarily
causation'' \citep[][p.\ 58]{westreich2019}. (For more details about confounding
see chapter 7 in \citealp{hernan2020}; a brief discussion in section 3.5.1, pp.\
58 and ff., in \citealp{westreich2019})
%% ZZ FIXME : give the example of Y must compensate whatever X has
And more about \textbf{colliders}. The consequences of the last structure, Figure \ref{fig:plot_dag_struct} c), where
we have a collider, sometimes seem counterintuitive. This structure is behind
``Berkson's paradox''\footnote{
\url{https://en.wikipedia.org/wiki/Berkson\%27s_paradox}, and it might explain
``Why are handsome men such jerks?'' ---and, I guess, a similar phenomenon with
women:
\url{http://www.slate.com/blogs/how_not_to_be_wrong/2014/06/03/berkson_s_fallacy_why_are_handsome_men_such_jerks.html}
.}. A simple example: suppose there is no association between bone fracture
and pneumonia in the general population. But if you only look at people who go to
the emergency room in a hospital, you are likely to find a negative association
between pneumonia and bone fracture (think about why people go to hospitals
---there must be some reason to be in the hospital to begin with, and either
severe pneumonia or a broken bone are enough to take you there).
Another way to understand it: suppose you only look at a value, or small set of
values, of Z (that is what conditioning is); now, if X has any value, the value
of Y has to compensate the value of X, so that the value of Z is the one you
conditioned on.
If these examples are not clear, look at the Wikipedia entry linked in the
footnotes (or the ``Why are handsome men such jerks'', linked in the footnote
too). Conditioning (and restricting) on colliders leads to \textbf{selection
bias} (see chapters 7 and 8 in \citealp{hernan2020}, and a brief account in
section 3.5.3, pp.\ 64--66 in \citealp{westreich2019})
\activities: Think of at least one example for each of the structures
above. \textbf{Really, do it}. Ideally, think of two examples, one from
``everyday life'' and one from you scientific work/TFM/etc.
\subsection{Terminology: ``condition on'', ``given'', ``adjusting for''}\label{condition-given}
The following three expressions are equivalent, and you will find them in the literature:
\begin{itemize}
\item X and Y are independent \textit{given} Z.
\item X and Y are independent \textit{if we condition on} Z (or
\textit{conditioning on} Z).
\item X and Y are independent \textit{if we adjust for} Z (or \textit{adjusting
for} Z).\footnote{X and Y are independent \textit{if we hold Z constant} can
be equivalent, though this is not as common as the above expressions with
``condition on'', ``given'', ``adjusting'', and we would need to be more
precise about the meaning of \textit{holding constant}: are we really,
physically holding Z constant via intervention, or adjusting statistically
for it, as in conditioning?}
\end{itemize}
\subsection{Descendants and ancestors in the basic DAG structures}\label{modified-basic}
To make sure we understand how the flow of association is blocked or unblocked,
let us add some descendants and ancestors to the previous structures and see what
happens. If you want, you can read this section very quickly and skip over all
the ``in addition''. The \textbf{most important structure to pay attention to is
f)}. The rest are here for completeness, but following it actually only requires
to use the rules we have explained above.
We have modified the basic structures as follows:
\begin{itemize}
\item In the top row, we have added a descendant of Z
\item In the bottom row, we have added an ancestor of Z
\end{itemize}
<<dag_struct2, results='hide', echo=FALSE>>=
c1 <- dagitty("dag {
X -> Z
Z -> Y
Z -> W
}")
c2 <- dagitty("dag {
X -> Z
Z -> Y
W -> Z
}")
f1 <- dagitty("dag {
Z -> X
Z -> Y
Z -> W
}")
f2 <- dagitty("dag {
Z -> X
Z -> Y
W -> Z
}")
co1 <- dagitty("dag {
X -> Z
Y -> Z
Z -> W
}")
co2 <- dagitty("dag {
X -> Z
Y -> Z
W -> Z
}")
@
<<plot_dag_struct2, fig.width = 5, fig.height=4.5, fig.lp='fig:', results = 'hide', echo=FALSE, fig.align = 'center', fig.cap='Descendants and ancestors in the basic DAG structures',out.width='12cm'>>=
op <- par(mfrow = c(2, 3))
coordinates(c1) <- list(x = c(X = 1,
Y = 3,
Z = 2,
W = 2),
y = c(X = 0,
Y = 0,
Z = 0,
W = -0.5))
coordinates(c2) <- list(x = c(X = 1,
Y = 3,
Z = 2,
W = 2),
y = c(X = 0,
Y = 0,
Z = 0,
W = -.5))
coordinates(f1) <- list(x = c(X = 1,
Y = 3,
Z = 2,
W = 2),
y = c(X = 0,
Y = 0,
Z = -.50,
W = -1))
coordinates(f2) <- list(x = c(X = 1,
Y = 3,
Z = 2,
W = 2),
y = c(X = 0,
Y = 0,
Z = -.50,
W = -1))
coordinates(co1) <- list(x = c(X = 1,
Y = 3,
Z = 2,
W = 2),
y = c(X = -1,
Y = -1,
Z = -.5,
W = 0))
coordinates(co2) <- list(x = c(X = 1,
Y = 3,
Z = 2,
W = 2),
y = c(X = -1,
Y = -1,
Z = -0.5,
W = 0))
drawdag(c1, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="d)", cex = 1.4)
drawdag(f1, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="e)", cex = 1.4)
drawdag(co1, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="f)", cex = 1.4)
drawdag(c2, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="g)", cex = 1.4)
drawdag(f2, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="h)", cex = 1.4)
drawdag(co2, xlim = c(0.5, 3.5), ylim = c(-1, 1.3))
text(x = 2, y = -.4, labels ="i)", cex = 1.4)
@
These are the consequences:
\begin{enumerate}[label=\alph*)]
\setcounter{enumi}{3}
\item Conditioning on W does not make X and Y independent, so X and Y are still
associated if you condition on W\footnote{If you cannot
measure Z, but you can measure W, and W and Z are very strongly associated,
conditioning on W can help you remove some bias if you need to make X and Y
conditionally independent. This is more advanced material.}. Why? Because the flow through Z has not been
interrupted.
In addition (but this is not new):
\begin{enumerate}[label=\arabic*.]
\item X and W are associated.
\item Y and W are associated.
\item X and W are independent if we condition on Z.
\item Y and W are independent if we condition on Z.
\end{enumerate}
% $X \nupmodels Y | W$.
\item As above: conditioning on W does not make X and Y independent.
In addition (but this is not new):
\begin{enumerate}[label=\arabic*.]
\item X and W are associated.
\item Y and W are associated.
\item X and W are independent if we condition on Z.
\item Y and W are independent if we condition on Z.
\end{enumerate}
% $X
% \nupmodels Y | W$.
\item \textbf{Pay attention here}: conditioning on W will make X and Y
associated. This is the general rule: \textbf{conditioning on a collider} (as
in c)) or \textbf{a descendant of a collider will make the ancestors
associated}. Why? Think about the hospital, broken bones and pneoumonia;
instead of looking at people at the door of the hospital you look at people
downstream (e.g., people who have been admitted to the hospital). Or think again about the pipes: put a tap on W, and close it.
In addition (but this is not new):
\begin{enumerate}[label=\arabic*.]
\item X and W are associated.
\item Y and W are associated.
\item X and W are independent if we condition on Z.
\item Y and W are independent if we condition on Z.
\end{enumerate}
% $X
% \nupmodels Y | W$ (and $X
% \nupmodels Y | Z$, \quad $X \upmodels
% Y$).
\item Conditioning on W will not render X and Y independent. Notice also that \textbf{now Z is
a collider with respect to X and W}. So conditioning on Z will induce an
association between X and W.
In addition:
\begin{enumerate}[label=\arabic*.]
\item X and W are independent.
\item Y and W are associated.
\item X and W are associated if we condition on Z (we just said this).
\item Y and W are independent if we condition on Z.
\end{enumerate}
\item Conditioning on W will not render X and Y independent.
In addition:
\begin{enumerate}[label=\arabic*.]
\item X and W are associated.
\item Y and W are associated.
\item X and W are independent if we condition on Z.
\item Y and W are independent if we condition on Z.
\end{enumerate}
\item Now Z is a collider with respect to all three variables X, Y, W.
\begin{enumerate}[label=\arabic*.]
\item X and Y are independent (we saw this before).
\item X and W are independent.
\item Y and W are independent.
\item X and W are associated if we condition on Z (Z is a collider).
\item Y and W are associated if we condition on Z (Z is a collider).
\item X and Y are associated if we condition on Z (Z is a collider), as it was before.
\end{enumerate}
\end{enumerate}
\activities Go back to the examples (\qref{chol-exercise}, \qref{fungus-ex},
\qref{fungus-collider}): it should now be clear why we should/should not adjust
for the different variables (i.e., pay attention to whether they are common
causes, mediators, or colliders).
\clearpage
\section{A systematic procedure to find what variables to condition on. An
overview of the backdoor criterion}
\label{backdoor-basics}
% \subsection{The backdoor criterion: basics}
Is there a systematic way to find what variables we should condition on when we
want to estimate the causal effect of X on Y, and avoid being confounded? Yes,
there is. The key ideas are:
\begin{itemize}
\item Block all paths that induce a non-causal association between X and Y (such
as those created by common ancestors).
\item Do not block directed paths between X and Y (those that are legitimate
causal connections from X to Y) that would ``rob'' some of the true causal flow.
\item Do not create associations between X and Y by conditioning on colliders or
descendants of colliders.
\end{itemize}
And this can be done using a systematic procedure. In the appendix, \qref{backdoor-crit}, I give the
procedure in full detail.
% \subsection{Variance and bias, or random and systematic error}
% Westreich, pp. 24 and 25. (relationship to notions of validity and precision,
% though I prefer to use bias and variance)
% Bias: ``sesgo'', in Spanish
\clearpage
\section{Additional examples}\label{add-examples}
\subsection{A pretreatment collider variable we should not adjust on}\label{pretreat-no}
Suppose the relationships in Figure \ref{fig:plot_dag_ptr} (from Figure 18.6,
section 18.2, p.\ 227 in \citealp{hernan2020}), and you are interested in the
effect of X on Y ($U_1$ and $U_2$ are two different unmeasured variables). Should
you adjust for L, which is a pre-treatment variable?
<<dag_ptr, results='hide', echo=FALSE>>=
ptr <- dagitty("dag {
X -> Y
U_1 -> L
U_2 -> L
U_1 -> Y
U_2 -> X
}")
@
<<plot_dag_ptr, fig.width = 4, fig.height=2.73, fig.lp='fig:', results = 'hide', echo=FALSE, fig.align = 'center', fig.cap='A pretreatment collider.',out.width='9cm', fig.pos= "!h">>=
coordinates(ptr) <- list(x = c(L = 1, X = 2, Y = 3, U_1 = 0.8, U_2 = 0.8),
y = c(L = 0, X = 0, Y = 0, U_1 = -.5, U_2 = .5))
drawdag(ptr, xlim = c(0.5, 3.5), ylim = c(-.51, .6))
## text(x = 2, y = 1.2, labels ="a)", cex = 1.2)