-
Notifications
You must be signed in to change notification settings - Fork 0
/
sw-talk.Rmd
304 lines (243 loc) · 10.7 KB
/
sw-talk.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
---
title: "Stop Writing Bad Science Code"
author: "Robert Link"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
revealjs::revealjs_presentation:
transition: none
theme: night
highlight: zenburn
center: false
self_contained: false
reveal_plugins: ["notes"]
reveal_options:
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
##
<img src="figs/rube-goldberg.png" height=600>
<aside class="notes">
First step is to make sure you have code. Even simple data prep and
analysis should be implemented in software.
</aside>
## The Big Pile of Scripts
```R
rm(list = ls())
library('ggplot2')
library('reshape2')
library('ggthemes')
library('dplyr')
srcdir <- dirname(sys.frame(1)$ofile)
source(file.path(srcdir,'food-demand.R'))
source(file.path(srcdir,'food-demand-plots.R'))
source(file.path(srcdir,'mcpar-analysis.R'))
## read the observational data
print('Reading data.')
datadir <- normalizePath(file.path(srcdir, '../../data'))
setwd(datadir)
alldata <- read.csv('food-dmnd-price-allrgn.csv')
## read the mc results: we assume that the data has all been gzipped
setwd('../runs')
mcrslt.all <- read.mc.data('mc-food-dmnd.allrgn.dat.gz')
```
## The Big Pile of Scripts
```R
## find the maximum likelihood parameter sets
pv.all <- mcparam.ML(mcrslt.all) # vector for passing to mc.likelihood
p.all <- mc2param(pv.all) # parameter struct for passing to food.dmnd
pp.all <- mcrslt.all[which.max(mcrslt.all$LL),] # data frame for pretty-printing
### plots that require just the observational data
print('Running: obs plots (output = hist.sigma)')
## histogram of sigma values
pltdata <- mutate(alldata, Nonstaples=sqrt(sig2Qn), Staples=sqrt(sig2Qs)) %>%
select(GCAM_region_name, year, Nonstaples, Staples) %>%
melt(id=c('GCAM_region_name', 'year'))
hist.sigma <- ggplot(data=pltdata, aes(x=value)) +
facet_grid(variable ~ .) + geom_histogram(binwidth=0.02) + theme_minimal() +
ggtitle('Imputed sigma values') + xlab('sigma (1000 cal/person/day)')
ggsave('hist-sigma-v23.png')
```
## The Big Pile of Scripts (Aftermath)
```
> ls()
[1] "alldata" "apply.bc.rgn" "apply.bias.corrections" "calc.elas.actual"
[5] "calc.hicks.actual" "calc1eps" "calc1q" "chisq.all"
[9] "chisq.rgn" "chisq.rgn.trn" "chisq.yr" "chisq.yr.trn"
[13] "compute.bc.rgn" "compute.bias.corrections" "datadir" "eta.constant"
[17] "eta.n" "eta.s" "fit.with.err" "food.dmnd"
[21] "food.dmnd.byincome" "food.dmnd.byyear" "hist.sigma" "lamks2epsy0"
[25] "make.byincome.plot" "make.byyear.plot" "make.demand.plot" "make.paper1.mc.plots"
[29] "make.paper1.obs.plots" "make.paper1.param.plots" "mc.chunksize" "mc.eval.fd.likelihood"
[33] "mc.food.dmnd.byyear" "mc.likelihood" "mc.likelihood.1" "mc.logfile"
[37] "mc.make.byyear.plot" "mc.obsdata" "mc.setup" "mc.splitvec"
[41] "mc2param" "mcparam.clip.tails" "mcparam.density" "mcparam.itercount"
[45] "mcparam.ML" "mcparam.sample" "mcrslt.all" "mcrslt.rgn"
[49] "mcrslt.yr" "merge.trn.tst" "namemc" "p.all"
[53] "p.rgn" "p.yr" "paper1.bc.plots" "paper1.chisq"
[57] "paper1.fix.notation" "paper1.gen.residual.data" "paper1.residual.analysis" "paper1.rmse.all"
[61] "path" "plot.elas" "plot.hicks" "plot.qty"
[65] "plts.bc" "plts.mc.all" "plts.mc.rgn" "plts.mc.yr"
[69] "plts.param.all" "Pm.vals" "pmax8" "pmax9"
[73] "pmin8" "pmin9" "Pn.vals" "pnscl"
[77] "pp.all" "pp.rgn" "pp.yr" "prepare.obs"
[81] "process.gcam.data" "Ps.vals" "psscl" "pv.all"
[85] "pv.rgn" "pv.yr" "read.mc.data" "report.param.values"
[89] "resid.stats" "rmse.all" "samp.params" "simplify.region"
[93] "srcdir" "validate.params" "vec2param" "x0"
[97] "x1" "xval.rgn.trn" "xval.rgn.tst" "xval.yr.trn"
[101] "xval.yr.tst" "y.vals"
```
<aside class="notes">
What are the problems here?
* confusing
* side effects
* not reuseable
* awkward workflow
* non-portable
* not documented
</aside>
## A Good Analysis Script
```R
run_analysis <- function(inputfile) {. . .}
make_plots <- function(results) {. . .}
## Run analysis. Results are a named list with the following elements:
## (. . .)
results <- run_analysis('input-data.csv')
## Make common plots. Plots are returned in a list with the following elements:
## (. . .)
plots <- make_plots(results)
```
<aside class="notes">
Minimal standard that fixes the worst problems.
* Somewhat portable
* Creates two functions and two objects
* No side effects
* Explains what the results are
Still has problems.
* Not completely portable
* Inflexible workflow
* Not reuseable
BTW, much of this advice is really hard to put into practice in low-level languages like C++.
</aside>
## Use Your Language's Package System
- R
+ _Writing R Packages_ ([http://r-pkgs.had.co.nz/](http://r-pkgs.had.co.nz/))
- Python
+ _How To Package Your Python Code_ <br/> ([https://python-packaging.readthedocs.io](https://python-packaging.readthedocs.io))
- C++ (and other compiled languages)
+ CMake ([https://cmake.org](https://cmake.org)) ...**or**...
## Use Your Language's Package System
- R
+ _Writing R Packages_ ([http://r-pkgs.had.co.nz/](http://r-pkgs.had.co.nz/))
- Python
+ _How To Package Your Python Code_ <br/> ([https://python-packaging.readthedocs.io](https://python-packaging.readthedocs.io))
- C++ (and other compiled languages)
+ CMake ([https://cmake.org](https://cmake.org)) ...**or**...
+ Why not try Rcpp? ([http://www.rcpp.org/](http://www.rcpp.org/))
<aside class="notes">
What does packaging do for you?
* True portability
* Package namespace
* System knows where to find supporting files
* Makes a lot of the other advice easier to follow
</aside>
## Refactor Your Code
```R
read_and_clean_data <- function(filename) {. . .}
calc_frobnitz_coef <- function(frobdata) {. . .}
make_sankey_diagram <- function(pltdata) {. . .}
```
<aside class="notes">
Each function should do _one_ conceptually unified thing.
</aside>
## Refactor Your Packages
* expt1-data
* my-utilities
* my-analysis
* expt1-paper
## Use a Version Control System
<aside class="notes">
What version control does for you:
* Reproducibility: can recover (and record!) the exact version of code/data used in a run
* Rollback: good for comparisons, bug hunting, and reverting to known good state
* History: *when* and *why* you did the things you did
* Coordination: branching for parallel development
</aside>
## Use a Version Control System
...and learn how to use it well
[
<img src="figs/git_2x.png" alt="Don't be this guy!" height=400 style="float: right">
](https://xkcd.com/1597)
* **Git** <br> _Pro Git_ (https://git-scm.com/book)
* **Mercurial** <br>_Hg Init: A Mercurial Tutorial_ (http://hginit.com)
<aside class="notes">
Git, especially, has a reputation for being difficult to learn. Do not be put off by the ghost
stories. The scientific knowledge you have mastered is almost certainly more complex than that
needed to use git skillfully.
</aside>
## Develop Good Documentation
```R
#' Calculate the Frobnitz coefficient
#'
#' This version of the Frobnitz calculation follows the treatment in ...
#'
#' There are three possible modes for the calculation. The 'normal' mode ...
#'
#' @param bflux Vector of bogon flux values in counts/m2
#' @param mode String indicating the mode (see details). The default is 'normal'
#' @return Vector of Frobnitz coefficients (unitless)
calc_frob <- function(bflux, mode = 'normal') {
...
}
```
Use a documentation generator:
* **R**: roxygen2 (http://roxygen.org)
* **Python**: sphinx (http://www.sphinx-doc.org)
<aside class="notes">
Another example: if you have an argument called "file", tell the user whether it's a string (file name) or
a file handle.
</aside>
## Write a Test Suite
Testing science codes is hard because often you don't know the correct answer.
```sh
#!/usr/bin/env bash
# Run the basic RCPs
$HECTOR input/hector_rcp45.ini
$HECTOR input/hector_rcp85.ini
# Make sure the model handles year changes
sed 's/startDate=1745/startDate=1740/' input/hector_rcp45.ini > input/hector_rcp45_time.ini
$HECTOR input/hector_rcp45_time.ini
# Turn on the constraint settings one by one and run the model
# CO2
sed 's/;Ca_constrain=csv:constraints\/lawdome_co2.csv/Ca_constrain=csv:constraints\/lawdome_co2.csv/' input/hector_rcp45.ini > input/hector_rcp45_co2.ini
$HECTOR input/hector_rcp45_co2.ini
```
* Verify that the code runs in all of its modes.
* Where possible, compare to reference output.
* Generate a set of diagnostic figures.
* Thoroughly test utility functions.
## Use a Profiler to Figure Out Why Your Code is Slow
<img src="figs/profile.png" height=500>
## Miscellaneous Tips
* Use a public repository
+ GitHub (git only)
+ BitBucket (git or mercurial)
* Practice frequent code review
* Critique yourself retrospectively
* Do periodic spring cleaning
* Use good judgement
<aside class="notes">
* GitHub and BitBucket
+ ... provide tools that make following a lot of the other advice easier
+ ... make collaboration easier
+ People worry too much about getting "scooped" by making their code public. Your competitors almost certainly
have better things to do than scraping GitHub for project ideas.
* Start your code reviews early. A reviewer looking at 100 lines of code will likely do a thorough job. A reviewer
looking at 1000 lines of code will get bogged down understanding the code and focus on trivia.
* Spring cleaning: especially, don't keep obsolete code/data around "just in case". That sort of cruft has a way
of making it into production results unintentionally.
</aside>
## Slides and Notes
https://rplzzz.github.io/sw-talk