/
reproducibility.Rmd
444 lines (248 loc) · 10.1 KB
/
reproducibility.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
---
title: "Reproducible science <br>using ![Rlogo](../../img/slides/Rlogo-small.png)<br><br>"
author: Thibaut Jombart
date: "2019-11-19"
output:
ioslides_presentation
---
```{r setup, include=FALSE}
## This code defines the 'verbatim' option for chunks
## which will include the chunk with its header and the
## trailing "```".
require(knitr)
hook_source_def = knit_hooks$get('source')
knit_hooks$set(source = function(x, options){
if (!is.null(options$verbatim) && options$verbatim){
opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*.*$", "", options$params.src)
bef = sprintf('\n\n ```{r %s}\n', opts, "\n")
stringr::str_c(bef, paste(knitr:::indent_block(x, " "), collapse = '\n'), "\n ```\n")
} else {
hook_source_def(x, options)
}
})
```
# On reproducibility
## What is reproducibility in science?
<center>
<img src="../../img/slides/printing-press.jpg" width="60%">
</center>
<br>
> - ability to reproduce results by a peer
> - requires <font color="#99004d">data</font>, <font color="#99004d">methods</font>, and <font color="#99004d">procedures</font>
> - increasingly, science is supposed to be reproducible
## Why does it not happen, in practice?
Some opinions on whether reproducibility is needed:
> - *Ideally, yes but we don't have time for this.*
> - *If it gets published, yes.*
> - *If it gets published, yes; unless it is in PLoS One...*
> - *No need: I work on my own.*
> - *For others to copy us? You crazy?!*
> - *No way! We rigged the data, the method does not work, and we ran the analyses in Excel.*
## Main obstacles to reproducibility {.columns-2}
<center><img src="../../img/slides/wecandoit.jpg" width="65%"></center>
> - lack of time: ultimately, reproducibility is faster
> - fear of plagiarism: low risks in practice
> - internal work, no need to share: almost never true
<br>
> - one good reason: <font color="#99004d">lack of tools to facilitate reproducibility</font>
## You never work alone
<center>
<img src="../../img/slides/looper.jpg" width="85%">
<br>
Be nice to your future selves!
</center>
## Two aspects of reproducibility using <img src="../../img/slides/Rlogo-small.png" width="50px">
<center>
<img src="../../img/slides/2pills.jpg" width="85%">
</center>
<br>
> - implementing methods as <img src="../../img/slides/Rlogo-small.png" width="30px"> packages
> - making <font color="#99004d">transparent</font> and <font color="#99004d">reproducible</font> analyses
# <img src="../../img/slides/Rlogo.png" width="50px">eproducibility in practice
## Literate programming
<center>
<img src="../../img/slides/knuth.jpg" width="55%">
</center>
> *Let us change our traditional attitude to the construction of programs: instead
of imagining that our main task is to instruct a computer what to do, let us
concentrate rather on <font color="#99004d">explaining to humans what we want
the computer to do</font>.* </center> (Donald E. Knuth, Literate Programming,
1984)
## A data-centred approach to programming
<center>
<img src="../../img/slides/literate-prog.png" width="85%">
</center>
## Literate programming in <img src="../../img/slides/Rlogo.png" width="50px">
Current workflows use the following equation:
**markdown** (`.md`) + <img src="../../img/slides/Rlogo.png" width="40px"> =
<font color="#99004d"> **Rmarkdown** </font> (`.Rmd`)
<br><br>Example:<br>
`knitr::knit2html("foo.Rmd")` $\rightarrow$ `foo.html`<br>
`rmarkdown::render("foo.Rmd")` $\rightarrow$ `foo.pdf`<br>
`rmarkdown::render("foo.Rmd")` $\rightarrow$ `foo.doc`<br>
`...`
## **Rmarkdown**: <img src="../../img/slides/Rlogo.png" width="50px"> chunks in markdown {.smaller}
```{r chunk-title, ..., verbatim = TRUE, eval = FALSE}
a <- rnorm(1000)
hist(a, col = terrain.colors(15), border = "white", main = "Normal distribution")
```
results in:
```{r rmarkdown, out.width = "80%", fig.width = 12, echo = c(2,3)}
set.seed(1)
a <- rnorm(1000)
hist(a, col = terrain.colors(15), border = "white", main = "Normal distribution")
```
## Formatting outputs
```{r another-chunk-title, ..., verbatim = TRUE, eval = FALSE}
[some R code here]
```
where `...` are options for processing and formatting, e.g:
- `eval` (`TRUE`/`FALSE`): evaluate code?
- `echo` (`TRUE`/`FALSE`): show code input?
- `results` (`"markup"/"hide"/"asis"`): show/format code output
- `message/warning/error`: show messages, warnings, errors?
- `cache` (`TRUE`/`FALSE`): cache analyses?
<br>
See [http://yihui.name/knitr/options](http://yihui.name/knitr/options) for details on all options.
## One format, several outputs
**`rmarkdown`** can generate different types of documents:
- standardised reports (`html`, `pdf`)
- journal articles. using the `rticles` package (`.pdf`)
- Tufte handouts (`.pdf`)
- word documents (`.doc`)
- slides for presentations (`html`, `pdf`)
- ...
See: [http://rmarkdown.rstudio.com/gallery.html](http://rmarkdown.rstudio.com/gallery.html).
## **`rmarkdown`**: toy example 1/2 {.smaller}
Let us consider the file \texttt{foo.Rmd}:
<pre><code>
---
title: "A toy example of rmarkdown"
author: "John Snow"
date: "`r Sys.Date()`"
output: html_document
---
This is some nice R code:
</pre></code>
```{r rnorm-example, verbatim = TRUE, eval = FALSE, echo = 2:4}
set.seed(1)
x <- rnorm(100)
x[1:6]
hist(x, col = "grey", border = "white")
```
## **`rmarkdown`**: toy example 1/2 {.smaller}
```{r toy-rmd, eval = FALSE}
rmarkdown::render("foo.Rmd")
```
<center>
<img src="../../img/slides/rmarkdown-toy.png" width="70%">
</center>
# Good practices
## **`rmarkdown`** is just the beginning {.columns-2}
<center>
<img src="../../img/slides/tablets.png" width="90%">
</center>
<br>
> - alter your original data
> - have a messy project
> - write non-portable code
> - write horrible code
> - lose work permanently
## How to treat your original data
<center>
<img src="../../img/slides/gold.jpg" width="50%">
</center>
> - **do not touch your original data**
> - save it as <font color="#99004d">read-only</font>
> - <font color="#99004d">make copies</font> - you can play with these
> - <font color="#99004d">track the changes</font> made to the original data
## How to avoid messy projects
<center>
<img src="../../img/slides/messy-office.jpg" width="50%">
</center>
> - **1 project = 1 folder**
> - subfolders for: data, analyses, figures, manuscripts, ...
> - document the project using a `README` file
> - use the Rstudio projects (if you use Rstudio)
## How to write portable code?
<center>
<img src="../../img/slides/communication.png" width="50%">
</center>
> - avoid absolute paths e.g.:<br>
`my_file <- "C:\project1\data\data.csv"`<br>
> - use the package <font color="#99004d">`here`</font> for portable paths e.g.:<br>
`my_file <- here("data/data.csv")`
> - avoid special characters and spaces in all names e.g.:<br> `éèçêäÏ*%~!?&`
> - assume case sensitivity: <br>`FooBar` $\neq$ `foobar` $\neq$ `FOOBAR`
## How to write better code?
<center>
<img src="../../img/slides/readable.jpg" width="50%">
</center>
> - name things explicitly
> - settle for one <font color="#99004d">naming convention</font>; `snake_case` is currently recommended for <img src="../../img/slides/Rlogo.png" width="40px"> packages
> - document your code using <font color="#99004d">comments</font> (`##`)
> - write <font color="#99004d">simple code</font>, in short sections
> - use current coding standards -- see the <font color="#99004d">`lintr`</font> package
## Example of `lintr`
<center>
<img src="../../img/slides/lintr.png" width="80%"><br>
<small>source: [https://github.com/jimhester/lintr](https://github.com/jimhester/lintr)</small>
</center>
## Structuring analysis reports: question-driven report
<div style="float: left; width: 60%;">
<img src="../../img/slides/report_question_driven.png" width="100%">
</div>
<div style="float: left; width: 40%;">
<br>
> - organised by questions / analysis topics
> - <font color="#99004d">pros</font>: better narrative
> - <font color="#99004d">cons</font>: harder code to follow / review
</div>
## Structuring analysis reports: code-driven report
<div style="float: left; width: 60%;">
<img src="../../img/slides/report_code_driven.png" width="100%">
</div>
<div style="float: left; width: 40%;">
<br>
> - organised by type of code
> - <font color="#99004d">pros</font>: easier to read / review code
> - <font color="#99004d">cons</font>: narrative harder to follow
</div>
## Structuring analysis reports: hybrid report
<div style="float: left; width: 60%;">
<img src="../../img/slides/report_hybrid.png" width="100%">
</div>
<div style="float: left; width: 40%;">
> - differentiates **infrastructure** *vs* **analysis** code
> - makes question-specific code *simple*, and *repetitive*
> - <font color="#99004d">pros</font>: narrative and code easier to read
> - <font color="#99004d">cons</font>: harder to design (need frequent re-factoring)
</div>
## Do not lose your work!
Because you never know what can happen..
<center>
<img src="../../img/slides/smashing-panda.gif" width="50%">
</center>
## How to avoid losing work?
<center>
<img src="../../img/slides/lost.jpg" width="40%">
</center>
> - **never rely on a single computer** to store your work
> - <font color="#99004d">backups</font> are good, <font color="#99004d">syncing</font> with a server is better (e.g. Dropbox)
> - use <font color="#99004d">version numbers</font> to track progress
> - use <a href="https://github.com/reconhub/reportfactory"><font color="#99004d">reportfactory</font></a> for repeated analysis updates
> - use <font color="#99004d">version control systems</font> (e.g. GIT) for serious
coding projects
## Going further
<center>
<img src="../../img/slides/road.jpg" width="70%">
</center>
<br>
> - check our <a href="https://github.com/reconhub/guides"><font color="#99004d">golden rules</font></a> for writing analysis reports
> - use <a href="https://github.com/reconhub/report_factories_templates"><font color="#99004d">report factory templates</font></a> as starting points
> - use <a href="https://r4epis.netlify.com"><font color="#99004d">R4epis templates</font></a> as starting points
##
<br>
<center>
<img src="../../img/slides/the-end.jpg" width="100%">
</center>