-
Notifications
You must be signed in to change notification settings - Fork 1
/
stats-analyses-basic.qmd
997 lines (828 loc) · 38 KB
/
stats-analyses-basic.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
# A general approach to doing statistical analyses {#sec-stats-analyses-basic}
```{r setup}
#| include: false
source(here::here("R/functions.R"))
extract_functions_from_qmd()
source(here::here("R/project-functions.R"))
library(tidyverse)
library(tidymodels)
lipidomics <- read_csv(here::here("data/lipidomics.csv"))
```
Running statistical analyses is a relatively methodical and well-defined
process, even if there is often a lot of trial and error involved.
Sometimes it may feel overwhelming and complicated, which it definitely
can be, but it doesn't have to be. By taking a structured approach to
running statistical analyses, you can make it easier on yourself and
feel more in control.
In R, statistical methods are often created and developed by researchers
with little to no training in software development and who often use
them differently. This has some advantages, like having the cutting edge
statistical methods available to us, but has a major disadvantage of
often having to learn a completely different way of running a
statistical analysis, even if it is fairly similar to ones you've used
before. So having a framework for running statistical analyses,
regardless of who created them, can provide that needed structure and
vastly simplify the analysis. This session will be covering a general
framework for running statistical analyses, regardless of the exact
statistical method.
## Learning objectives
The overall objective for this session is to:
1. Describe the basic framework underlying most statistical analyses
and use R to generate statistical results using this framework.
More specific objectives are to:
1. Describe the general "workflow" and steps involved in stating the
research question, constructing a model to help answer the question,
preparing the data to match the requirements of the model, fitting
it to the data, and finally extracting the results from the fitted
model.
2. Categorize the model definition step as a distinct, theory-driven
step, separate from the data, and use `{parsnip}` functions to help
with defining your model.
3. Identify various data transformation techniques and evaluate which
are good options given the data. Use functions in the `{recipes}`
package to apply these transformations to your data.
4. Use the `{broom}` package to extract the model results to later
present them in graphical format (with `{ggplot2}`).
5. Continue applying the concepts and functions used from the previous
sessions.
Specific "anti"-objectives:
- Will **not** know how to choose and apply the appropriate
statistical model or test, nor understand any statistical theory,
nor interpret the statistical results correctly, nor determine the
relevant data transformations for the statistical approach. What we
show, we show *only as demonstration purposes only*, they could be
entirely wrong in how to do them correctly if an expert were to
review them.
::: {.callout-warning appearance="default"}
We will be making *a lot* of function throughout this session and the
next. This is just a fair warning!
:::
## Exercise: What does a "model" mean?
> Time: \~6 minutes.
In science and especially statistics, we talk a lot about "models". But
what does model actually mean? What different types of definitions can
you think of? Is there a different understanding of model in statistics
compared to other areas?
1. Take 1 minute to think about your understanding of a model.
2. Then, over the next 3 minutes, discuss with your neighbour about the
meaning of "model" and see if you can come to a shared
understanding.
3. Finally, over the next 2 minutes, we will share all together what a
model is in the context of data analysis.
## Theory and "workflow" on statistical modeling
::: {.callout-note collapse="true"}
## Instructor note
Let them read it over than go over it briefly, focusing on what a model
is, that we should create research questions while thinking in the
framework of models, and the general workflow for doing statistical
analyses.
:::
<!-- TODO: Create a revealjs presentation on this? -->
::: callout-note
## Reading task: \~15 minutes
Almost all fields of research, and definitely more heavily-quantitative
and scientific fields like biomedicine and health, have math and
statistics at the core of taking data to draw inferences or general
observations about the world.
Any time we collect data and need to interpret what it means, we need
statistics. And anytime we want to make inferences about the world from
the data, we need to use statistics to determine the likelihood, or
rather the uncertainty, in those inferences. Statistics is meant to
quantify uncertainty.
How do we quantify uncertainty? By first creating a "theoretical model"
that expresses mathematically our research question. For instance, we
have a theoretical model that outdoor plants grow (non-linearly) with
water and sunlight, but that more sunlight likely means less water (less
rain). While *any* research question could be translated into a
theoretical model, not all theoretical models can be *tested* against
the real world. And that's where the second part comes in: Our
theoretical model needs to be structured in a way that allows us to
measure the items ("parameters") in our model, so that we can build a
mathematical model from the data ("test it in the real world").
```{mermaid fig-model-plant-growth}
%%| label: fig-model-plant-growth
%%| fig-cap: Simple example of a theoretical model of plant growth.
%%| echo: false
%%| eval: true
%%{init:{'flowchart':{'nodeSpacing': 20, 'rankSpacing': 30}}}%%
graph LR
Sunlight --> Growth
Water --> Growth
Sunlight --- Water
linkStyle 0,1,2 stroke-width:1px;
```
Great research questions are designed in a way to fit a theoretical
model on measurable parameters, so we can ultimately quantify the
uncertainty in our observations (the data) and in the model. And the
basic simplified math of a statistical model looks mostly the same:
$$y = intercept + x + error $$
So if we use the example of plant growth, it would look like:
$$Growth = Sunlight + Water$$
It's a bit more complicated than this, but this is enough to describe it
for this course. Throughout the rest of the session, we use specific
terms to describe each item in this formula. "Outcome" (also called
"dependent variable") refers to the $y$, "predictor" (or "independent
variable") refers to the $x$. The error and intercept are calculated for
us when we fit the model to the data. The intercept is when x is equal
to zero (the "y-intercept" on a plot). The error is the difference
between what the model estimates and the real value. It plays a role in
quantifying the model's uncertainty.
Considering the mathematical nature of statistical models, there is also
a logic and "workflow" to making these models!
1. Write a research question, designed in a way that might look like
the diagram above. Usually this step needs to be revisited, revising
the question after trying to construct the theoretical model, that
describes the measurable (and unmeasurable) parameters in the model,
and vice versa.
2. Based on the model and the type of measured parameters used
(continuous or binary), select the best mathematical model "type".
Nearly all models in statistics start from the base of a linear
regression (e.g. ANOVA is a special form of regression, t-test is a
simpler version of ANOVA), so the model "type" will probably be a
form of regression.
3. Measure your parameters (in the plant growth example, that might be
the amount of water given in liters per day, amount of plant growth
in weight, and amount of sunlight in hrs per day). Usually, this
measured data need to be processed in a special way to fit the
specifics of the model, research question, and practices of the
field.
4. Fit the data to the theoretical model in order to estimate the
values ("coefficients") of the model parameters as well as the
uncertainty in those values.
5. Extract the values and their uncertainty from the model and present
them in relation to your research questions.
```{mermaid fig-model-building-workflow}
%%| label: fig-model-building-workflow
%%| fig-cap: Simple schematic of the workflow for conducting statistical analysis.
%%| echo: false
%%| eval: true
%%{init:{'flowchart':{'nodeSpacing': 40, 'rankSpacing': 20}}}%%
graph TD
A[Research question] --> B((Statistical model))
B --> A
B --> C[Data collection]
C --> D[Data transformation]
D --> E[Scientific output]
E --> A
linkStyle 0,1,2,3,4,5 stroke-width:1px;
```
::: {.callout-caution appearance="default"}
The entire workflow for building statistical models requires highly
specific domain knowledge on not only the statistics themselves, but
also how the data was collected, what the values mean, what type of
research questions to ask and how to ask them, how to interpret the
results from the models, and how to process the data to fit the question
and model.
For instance, in our `lipidomics` dataset, if we were to actually use
this data, we would need someone familiar with -omic technologies, how
the data are measured, what the values actually mean, how to prepare
them for the modeling, the specific modeling methods used for this
field, and how we would actually interpret the findings. We have
**none** of these things, so very likely we are doing things quite wrong
here. We're only doing this modeling to highlight how to use the R
packages.
:::
Going back to our own `lipidomics` dataset, we need to do the first
step: Creating the question. While we don't have much data, there are a
surprising number of questions we could ask. But we will keep it very
simple, very basic, and very exploratory.
1. What is the estimated relationship of each metabolite with T1D
compared to the controls, adjusting for the influence of age and
gender?
Next, because we are working within a "reproducible analysis" framework
specifically with the use of `{targets}`, let's convert this question
into outputs to include as pipeline targets, along with a basic idea for
the final functions that will make up these targets and their inputs and
outputs. These targets will probably be quite different by the end, but
it's a good start to think about what it should look like in the end.
- All results for estimated relationships (in case we want to use it
for other output)
- Plot of statistical estimate for each relationship
Potential function names might be:
- `calculate_estimates()`
- `plot_estimates()`
```{mermaid fig-model-possible-targets}
%%| label: fig-model-possible-targets
%%| fig-cap: Potential inputs, outputs, and functions for the targets pipeline.
%%| echo: false
%%| eval: true
%%{init:{'flowchart':{'nodeSpacing': 20, 'rankSpacing': 30}}}%%
graph TB
lipidomics -- "calculate_estimates()" --> model_est[Model estimates]
model_est -- "plot_estimates()" --> plot_est[Plot of estimates]
plot_est -- "tar_read()" --> qmd[Quarto]
linkStyle 0,1,2 stroke-width:1px;
```
:::
::: {.callout-note collapse="true"}
## Instructor note
A few things to repeat and reinforce:
1. The workflow of the image and that it all starts with the research
question.
2. The fact that almost all statistical methods are basically special
forms of linear regression.
3. That this model creation stage requires a variety of domain
expertise, not just statistical expertise.
Also repeat the question to ask, the outputs, as well as the functions
and their flow that we can translate into the `{targets}` pipeline.
:::
## Defining the model
::: {.callout-note collapse="true"}
## Instructor note
Let them read it first and then briefly verbally walk through this
section, describing the theoretical model both graphically and
mathematically. Go through why we use `{tidymodels}` rather than other
approaches.
:::
::: callout-note
## Reading task: \~10 minutes
Now that we've talked about the workflow around making models and have
already written out some research questions, let's make a basic,
graphical theoretical model:
```{mermaid fig-model-research-question}
%%| label: fig-model-research-question
%%| fig-cap: A simple theoretical model of the research question about T1D status.
%%| echo: false
%%| eval: true
%%{init:{'flowchart':{'nodeSpacing': 20, 'rankSpacing':30}, 'themeVariables': { 'edgeLabelBackground': 'transparent'}}}%%
graph TB
Metabolite --> T1D
Age & Gender --> T1D & Metabolite
linkStyle 0,1,2,3,4 stroke-width:1px;
```
Or mathematically:
$$T1D = metabolite + age + gender$$
So, T1D status (or `class` in the `lipidomics` dataset) is our
**outcome** and the individual metabolite, age, and gender are our
**predictors**. Technically, age and gender would be "confounders" or
"covariates", since we include them only because we think they influence
the relationship between the metabolite and T1D.
If we convert the formula into a form with the variables we have in the
dataset as well as selecting only one metabolite for now (the
cholesterol metabolite, which we add "metabolite" to differentiate it
from other potential variables), it would be:
$$class = metabolite\_cholesterol + age + gender$$
Now that we have a theoretical model, we need to choose our model type.
Since T1D is binary (either you have it or you don't), the most likely
choice is logistic regression, which requires a binary outcome variable.
So we have the theoretical model and the type of model to use - how do
we express this as code in R? There are many ways of doing the same
thing in R, but some are a bit easier than others. One such approach,
that is quite generic and fits with the ideals of the `{tidyverse}`, is
a similar universe of packages called the `{tidymodels}`.
Why do we teach `{tidymodels}`? Because they are built by software
developers, employed by Posit (who also employs the people who build the
`{tidyverse}` and RStudio), and they have a strong reputation for
writing good documentation. Plus, the `{tidymodels}` set of packages
also make creating and using models quite generic, so by teaching you
these sets of tools, you can relatively easily change the model type, or
how you process the data, or other specifications without having to
learn a whole new package or set of tools.
The reason `{tidymodels}` can do that is because they designed it in a
way that makes a clear separation in the components of the model
building workflow that was described above, through the use of specific
packages for each component.
| Package | Description |
|---------------|---------------------------------------------------------------------------------------------------|
| `{parsnip}` | Model definition, such as type (e.g. `linear_reg()`) and "engine" (e.g. `glm()`). |
| `{recipes}` | Model-specific data transformations, such as removing missing values, or standardizing the data. |
| `{workflows}` | Combining model definition, data, and transformations to calculate the estimates and uncertainty. |
: Core packages within `{tidymodels}`.
We'll start with the `{parsnip}` package. Functions in this package are
used to set the details of the model you want to use. Specifically,
[functions](https://parsnip.tidymodels.org/reference/index.html#models)
to indicate the model *"type"* (e.g. linear regression) and the
`set_engines()` function to determine the "engine" to run the type
(which R-based algorithm to use, like `glm()` compared to `lm()`). Check
out the
[Examples](https://parsnip.tidymodels.org/articles/Examples.html) page
for code you might use depending on the model you want. The most
commonly used model types would be `linear_reg()`, `logistic_reg()`, and
`multinom_reg()`.
:::
Let's switch to working in the `doc/learning.qmd` file to create those
logistic regression models. Since we've already created some items in
the `{targets}` pipeline, we'll need to tell the Quarto file where to
find this "store" of outputs, and import the data using `tar_read()`. We
also need to add `library(tidymodels)` to the `setup` code chunk. Copy
and paste this code chunk below into the Quarto file.
```{{r setup}}
targets::tar_config_set(store = here::here("_targets"))
library(tidyverse)
library(targets)
library(tidymodels)
source(here::here("R/functions.R"))
lipidomics <- tar_read(lipidomics)
```
Since we will be using `{tidymodels}`, we need to install it, as well as
explicitly add the `{parsnip}`, `{recipes}`, and `{workflows}` packages.
Like `{tidyverse}`, we need to set `{tidymodels}` differently because it
is a "meta-package". We might need to force installing it with
`pak::pak("tidymodels")` so `{renv}` recognizes it.
```{r tidymodels-to-deps}
#| purl: true
#| eval: false
use_package("tidymodels", "depends")
# pak::pak("tidymodels")
use_package("parsnip")
use_package("recipes")
use_package("workflows")
```
Before continuing, let's **commit** the changes to the Git history with
{{< var keybind.git >}}. Next, in the `doc/learning.qmd` file, on the
bottom of the document create a new header and code chunk:
````
## Building the model
```{{r}}
```
````
In the new code chunk, we will set up the model specs:
```{r logistic-reg-specs}
log_reg_specs <- logistic_reg() %>%
set_engine("glm")
log_reg_specs
```
Running this on it's own doesn't show much, as you can see. But we've
now set the model we want to use.
## Data transformations specific to modeling
Setting the model type was pretty easy right? The more difficult part
comes next with the data transformations. `{recipes}` functions are
almost entirely used to apply transformations that a model might
specifically need, like mean-centering, removing missing values, and
other aspects of data processing.
Let's consider our `lipidomics` dataset. In order for us to start our
statistical analysis, we need the data to be structured in a certain way
to be able to smoothly use it as input in our model. We have at least
three easy observations on necessary transformations of the data, two of
which can be fixed with a single `{tidyr}` function, while the third one
can be fixed with `{recipes}`. Can you spot them?
```{r print-lipidomics-data}
#| column: page-inset-right
lipidomics
```
::: {.callout-note collapse="true"}
## Instructor note
Ask them if they can spot these differences. Give them a few minutes to
think and respond.
:::
The first observation isn't always an issue and depends heavily on the
model type you use. Since we are using logistic regression, the model
assumes that each row is an individual person. But our data is in the
long format, so each person has multiple rows. The second observation is
that there seems to be a data input error, since there are three
`Cholesterol` values, while all other metabolites only have one:
```{r too-many-cholesterols}
lipidomics %>%
count(code, metabolite) %>%
filter(n > 1)
```
We can fix both the long format and multiple cholesterol issues by using
`tidyr::pivot_wider()`. Before we do, the last issue is that each
metabolite has quite large differences in the values and ranges of data.
Again, whether this is an issue depends on what we want to do, but in
our research question we want to know how each metabolite influences
T1D. In order to best interpret the results and compare across
metabolites, we should ideally have all the metabolites with a similar
range and distribution of values.
Let's fix the first two issues first. While we probably only need to use
`pivot_wider()`, we should probably first tidy up the metabolite names
first so they make better column names. We do that by combining
`mutate()` with `snakecase::to_snake_case()`. In the `doc/learning.qmd`
file, we rename the metabolite names before using `pivot_wider()`. Since
we want an easy way of identifying columns that are metabolites, we will
add a `"metabolite_"` prefix using the argument `names_prefix`. To
actually fix the multiple cholesterol issue, we should look more into
the data documentation or contact the authors. But for this course, we
will merge the values by calculating a mean before pivoting. We do this
by setting the `values_fn` with `mean`.
```{r lipidomic-to-wider}
#| column: page-inset-right
lipidomics_wide <- lipidomics %>%
mutate(metabolite = snakecase::to_snake_case(metabolite)) %>%
pivot_wider(
names_from = metabolite,
values_from = value,
values_fn = mean,
names_prefix = "metabolite_"
)
lipidomics_wide
```
Since we're using a function-oriented workflow and since we will be
using this code again later on, let's convert both the "metabolite to
snakecase" and "pivot to wider" code into their own functions, before
moving them over into the `R/functions.R` file.
```{r first-snakecase-fn}
#| column: page-inset-right
column_values_to_snake_case <- function(data) {
data %>%
dplyr::mutate(metabolite = snakecase::to_snake_case(metabolite))
}
lipidomics %>%
column_values_to_snake_case()
```
This on its own should work. *However*, the column we want to change
might not always be called `metabolite`, or we might want to change it
later. So, to make this function a bit more generic, we can use
something called "curly-curly" (it looks like `{{}}` when used) and
"non-standard evaluation" (NSE).
::: callout-note
## Reading task: \~10 minutes
When you write your own functions that make use of functions in the
`{tidyverse}`, you may eventually encounter an error that might not be
very easy to figure out. Here's a very simple example using `select()`,
where one of your function's arguments is to select columns:
```{r test-nse, error=TRUE}
test_nse <- function(data, columns) {
data %>%
dplyr::select(columns)
}
lipidomics %>%
test_nse(class)
```
The error occurs because of something called "[non-standard
evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html)" (or
NSE). NSE is a major feature of R and is used quite a lot throughout R.
NSE is used a lot in the `{tidyverse}` packages. It's one of the first
things computer scientists complain about when they use R, because it is
not a common thing in other programming languages. But NSE is what
allows you to use formulas (e.g. `y ~ x + x2` in modeling, which we will
show shortly) or allows you to type out `select(class, age)` or
`library(purrr)`. In "standard evaluation", these would instead be
`select("Gender", "BMI")` or `library("purrr")`. So NSE gives
flexibility and ease of use for the user (we don't have to type quotes
every time) when doing data analysis, but can give some headaches when
programming in R, like when making functions. There's more detail about
this on the [dplyr
website](https://dplyr.tidyverse.org/articles/programming.html#warm-up),
which lists some options to handle NSE while programming. The easiest
approach is to wrap the argument with "curly-curly" (`{{}}`).
```{r test-nse-fixed}
test_nse <- function(data, columns) {
data %>%
dplyr::select({{ columns }})
}
lipidomics %>%
test_nse(class)
lipidomics %>%
test_nse(c(class, age))
```
:::
::: {.callout-note collapse="true"}
## Instructor note
You don't need to go over what they read, you can continue with making
the function below. Unless learners have some questions.
:::
We can use curly-curly (combined with `across()`) to apply
`snakecase::to_snake_case()` to columns of our choice.
```{r second-snakecase-fn}
#| column: page-inset-right
column_values_to_snake_case <- function(data, columns) {
data %>%
dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}
lipidomics %>%
column_values_to_snake_case(metabolite)
```
Move this new function over into the `R/functions.R` file, add Roxygen
documentation with {{< var keybind.roxygen >}}, style using
{{< var keybind.styler >}}, `source()` the modified `R/functions.R` file
with {{< var keybind.source >}}, and add the new function above the
`pivot_wider()` code in the `doc/learning.qmd` file.
```{r new-function-column-values-to-snakecase}
#' Convert a column's character values to snakecase format.
#'
#' @param data The lipidomics dataset.
#' @param columns The column you want to convert into the snakecase format.
#'
#' @return A data frame.
#'
column_values_to_snake_case <- function(data, columns) {
data %>%
dplyr::mutate(dplyr::across({{ columns }}, snakecase::to_snake_case))
}
```
## Exercise: Convert the pivot code into a function
> Time: \~10 minutes.
Just like with the `mutate()`, take the `pivot_wider()` code and convert
it into a new function.
1. Name the new function `metabolites_to_wider`.
2. Include one argument in the new `function()`: `data`.
3. Use `data %>%` at the beginning, like we did with the
`column_values_to_snake_case()`.
4. Use `tidyr::` before the `pivot_wider()` function.
5. Add the Roxygen documentation with {{< var keybind.roxygen >}}.
6. Move the function into the `R/functions.R` file.
7. Replace the code in the `doc/learning.qmd` file to make use of the
new functions.
```{r solution-new-function-metabolite-to-wider}
#| eval: true
#| code-fold: true
#| code-summary: "**Click for the solution**. Only click if you are struggling or are out of time."
#' Convert the metabolite long format into a wider one.
#'
#' @param data The lipidomics dataset.
#'
#' @return A wide data frame.
#'
metabolites_to_wider <- function(data) {
data %>%
tidyr::pivot_wider(
names_from = metabolite,
values_from = value,
values_fn = mean,
names_prefix = "metabolite_"
)
}
```
## Using recipes to manage transformations
We've used `{dplyr}` and `{tidyr}` to start fixing some of the issues
with the data. But we still have the third issue: How to make the
results between metabolites comparable. That's where we use `{recipes}`.
The first function is `recipe()` and it takes two forms: with or without
a formula. Remember the model formula we mentioned previously? Well,
here is where we can use it to tell `{recipes}` about the model formula
we intend to use so it knows on what variables to apply the chosen
transformations.
For a few reasons that will be clearer later, we won't ultimately use
the formula form of `recipe()`, but will show how it works. The
```{r recipes-with-formula}
recipe(class ~ metabolite_cholesterol + age + gender, data = lipidomics_wide)
```
The alternative approach is to set "roles" using `update_roles()`.
Instead of using a formula and letting `recipe()` infer the outcome and
predictors, we can explicitly select which variables are which. This has
some nice features that we will use later on.
```{r recipes-without-formula}
recipe(lipidomics_wide) %>%
update_role(metabolite_cholesterol, age, gender, new_role = "predictor") %>%
update_role(class, new_role = "outcome")
```
::: {.callout-note collapse="true"}
## Instructor note
Describe this next paragraph by switching to this website and showing
the list of some of the steps available in the code chunk below.
:::
The next "step" is to select a transformation function. There are many
`step_*` functions (some shown below) that are available in `{recipes}`
and which one you decide to use depends heavily on your data and your
research question. So take your time thinking through which
transformations you actually need for your analysis, rather than quickly
using one or more of them.
```{r list-step-transforms}
#| eval: false
recipes::step_log()
recipes::step_scale()
recipes::step_normalize()
recipes::step_center()
recipes::step_sqrt()
```
::: {.callout-tip appearance="default"}
There are so many useful transformation functions available. For
instance, if you often have to impute data, there are functions for
that. You can check them out in the Console by typing
`recipes::step_impute_` then hit the Tab key to see a list of them. Or,
if you have some missing values, there's also the
`recipes::step_naomit()`.
:::
There are many transformations we could use for the `lipidomics`
dataset, but we will use `step_normalize()` for this course. This
function is useful because it makes each variable centered to zero and a
value of 1 unit is translated to 1 standard deviation of the original
distribution. This means we can more easily compare values between
variables. We can add this to the end of the recipe:
```{r recipes-with-step-normalize}
recipe(lipidomics_wide) %>%
update_role(metabolite_cholesterol, age, gender, new_role = "predictor") %>%
update_role(class, new_role = "outcome") %>%
step_normalize(starts_with("metabolite_"))
```
Next thing to do is convert this into a function, using the same
workflow we've been using (which means this needs to be in the
`R/functions.R` script). We'll also use the curly-curly again, since we
might use a different metabolite later. Note, when adding all the
`packagename::` to each function, the `starts_with()` function comes
from the `{tidyselect}` package.
```{r new-function-create-recipe-spec}
#' A transformation recipe to pre-process the data.
#'
#' @param data The lipidomics dataset.
#' @param metabolite_variable The column of the metabolite variable.
#'
#' @return
#'
create_recipe_spec <- function(data, metabolite_variable) {
recipes::recipe(data) %>%
recipes::update_role({{ metabolite_variable }}, age, gender, new_role = "predictor") %>%
recipes::update_role(class, new_role = "outcome") %>%
recipes::step_normalize(tidyselect::starts_with("metabolite_"))
}
```
And test it out:
```{r use-create-recipe-specs-fn}
#| column: page-inset-right
recipe_specs <- lipidomics_wide %>%
create_recipe_spec(metabolite_cholesterol)
recipe_specs
```
Style using {{< var keybind.styler >}}, then commit the changes made to
the Git history with {{< var keybind.git >}}.
## Fitting the model by combining the recipe, model definition, and data {#sec-fitting-model}
We've now defined the model we want to use and created a transformation
`{recipes}` specification. Now we can start putting them together and
finally fit them to the data. This is done with the `{workflows}`
package.
Why use this package, rather than simply run the statistical analysis
and process the data as normal? When running multiple models (like we
will do in the next section) that may require different data structures,
the data transformation steps have to happen *right before* the data is
fit to the model and need to be done on exactly the data used by the
model. So if we have one data frame that we run multiple models on, but
the transformation happens to the whole data frame, we could end up with
issues due to how the transformations were applied. The `{workflows}`
package keeps track of those things for us, so we can focus on the
higher level thinking rather than on the small details of running the
models.
The `{workflows}` package has a few main functions for combining the
recipe with the model specs, as well as several for updating an existing
workflow (which might be useful if you need to run many models of
slightly different types). All model workflows need to start with
`workflow()`, followed by two main functions: `add_model()` and
`add_recipe()`. Can you guess what they do?
```{r use-workflow-for-model}
#| column: page-inset-right
workflow() %>%
add_model(log_reg_specs) %>%
add_recipe(recipe_specs)
```
While this code is already pretty concise, let's convert it into a
function to make it simplified. We'll use the same function-oriented
workflow that we've used before, where the function should ultimately be
inside the `R/functions.R` file.
```{r new-function-create-model-workflow}
#' Create a workflow object of the model and transformations.
#'
#' @param model_specs The model specs
#' @param recipe_specs The recipe specs
#'
#' @return A workflow object
#'
create_model_workflow <- function(model_specs, recipe_specs) {
workflows::workflow() %>%
workflows::add_model(model_specs) %>%
workflows::add_recipe(recipe_specs)
}
```
<!-- TODO: image showing pieces at play that we need to eventually fit into targets/questions -->
Instead of using the previously created objects, let's start the model
creation from scratch:
```{r full-model-workflow-from-almost-scratch}
#| column: page-inset-right
model_workflow <- create_model_workflow(
logistic_reg() %>%
set_engine("glm"),
lipidomics_wide %>%
create_recipe_spec(metabolite_cholesterol)
)
model_workflow
```
Now, we can do the final thing: Fitting the data to the model with
`fit()`! :raised_hands:
```{r fit-model-workflow-to-data}
#| column: page-inset-right
fitted_model <- model_workflow %>%
fit(lipidomics_wide)
fitted_model
```
This gives us a lot of information, but what we are mostly interested in
is the model estimates themselves. While this `fitted_model` object
contains a lot additional information inside, `{workflows}` thankfully
has a function to extract the information we want. In this case, it is
the `extract_fit_parsnip()` function.
```{r extract-model-fit}
#| column: page-inset-right
fitted_model %>%
extract_fit_parsnip()
```
To get this information in a tidier format, we use another function:
`tidy()`. This function comes from the `{broom}` package, which is part
of the `{tidymodels}`. But we should explicitly add it to the
dependencies:
```{r broom-to-deps}
#| purl: true
#| eval: false
use_package("broom")
```
Then, we add the `tidy()` function to our model using the `%>%` pipe.
Since we are using a logistic regression model, we need to consider how
we want the estimates to be presented, probably depending on how we want
to visualize our results. If we set `exponentiate = TRUE` in `tidy()`,
the output estimates will be odds ratios, if we set
`exponentiate = FALSE`, we will get the log odds ratios or the beta
coefficient. Here we choose `exponentiate = TRUE`:
```{r tidy-up-model-results}
#| column: page-inset-right
fitted_model %>%
extract_fit_parsnip() %>%
tidy(exponentiate = TRUE)
```
We now have a data frame of our model results! Like we did with the
`workflows()` code that we converted into a function, we do the same
thing here: Make another function (and move it to `R/functions.R`)!
:stuck_out_tongue:
```{r new-function-tidy-model-output}
#' Create a tidy output of the model results.
#'
#' @param workflow_fitted_model The model workflow object that has been fitted.
#'
#' @return A data frame.
#'
tidy_model_output <- function(workflow_fitted_model) {
workflow_fitted_model %>%
workflows::extract_fit_parsnip() %>%
broom::tidy(exponentiate = TRUE)
}
```
Replacing the code in the `doc/learning.qmd` file to use the function.
```{r use-tidy-model-output-fn}
#| column: page-inset-right
fitted_model %>%
tidy_model_output()
```
If we revise the code so it is one pipe, it would look like:
```{r single-pipe-model-results}
#| column: page-inset-right
create_model_workflow(
logistic_reg() %>%
set_engine("glm"),
lipidomics_wide %>%
create_recipe_spec(metabolite_cholesterol)
) %>%
fit(lipidomics_wide) %>%
tidy_model_output()
```
Let's briefly cover what these columns and values mean.
::: {.callout-note collapse="true"}
## Instructor note
If you want, you can go over these details briefly or in more detail,
depending on how comfortable you are. Or you can get them to read it
only.
:::
::: callout-note
## Reading task: \~10 minutes
Let's explain this output a bit, each column at a time:
- `term`: If you recall the formula
$class = metabolite + sex + gender$, you'll see all but the `class`
object there in the column `term`. This column contains all the
predictor variables, including the intercept (from the original
model).
- `estimate`: This column is the "coefficient" linked to the term in
the model. The final mathematical model here looks like:
$$ \displaylines{class = Intercept + (metabolite\_estimate \times metabolite\_value) + \\ (gender\_estimate \times gender\_value) + ...}$$
In our example, we chose to get the odds ratios. In the mathematical
model above, the estimate is represented as the log odds ratio or
beta coefficient - the constant value you multiply the value of the
term with. Interpreting each of these values can be quite tricky and
can take a surprising amount of time to conceptually break down, so
we won't do that here, since this isn't a statistics course. The
only thing you need to understand here is that the `estimate` is the
value that tells us the *magnitude* of association between the term
and `class`. This value, along with the `std.error` are the most
important values we can get from the model and we will be using them
when presenting the results.
- `std.error`: This is the uncertainty in the `estimate` value. A
higher value means there is less certainty in the value of the
`estimate`.
- `statistic`: This value is used to, essentially, calculate the
`p.value`.
- `p.value`: This is the infamous value we researchers go crazy for
and think nothing else of. While there is a lot of attention to this
single value, we tend to give it more attention than warranted. The
interpretation of the p-value is even more difficult than the
`estimate` and again, we won't cover this in this course. We won't
be using this value at all in presenting the results.
:::
Before ending, open the Git interface and commit the changes you made
with {{< var keybind.git >}}. Then push your changes up to GitHub.
## Summary
- Create research questions that (ideally) are structured in a way to
mimic how the statistical analysis will be done, preferably in a
"formula" style like $y = x1 + x2 + ... + error$ and in a diagram
style with links connecting variables.
- Statistical analyses, while requiring some trial and error, are
surprisingly structured in the workflow and steps taken. Use this
structure to help guide you in completing tasks related to running
analyses.
- Use `{parsnip}` functions to define the model you want to use, like
`logistic_reg()` for logistic regression, and set the computational
"engine" with `set_engine()`.
- Use `{recipes}` functions to set up the data transformation steps
necessary to effectively run the statistical analysis, like adding
variable "roles" (outcome vs predictor) using `update_roles()` and
adding transformation steps using any of the dozen different `step_`
functions.
- Use `{workflows}` functions to develop an analysis `workflow()` that
combines the defined model with `add_model()`, the transformation
steps with `add_recipe()`, and the data with `fit()`.
- Use `{broom}` to `tidy()` the model output, extracted using
`extract_fit_parsnip()` to get a data frame of the estimates and
standard error for the variables in the model.