-
Notifications
You must be signed in to change notification settings - Fork 26
/
search_index.json
23 lines (23 loc) · 213 KB
/
search_index.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[
["index.html", "The drake R Package User Manual Chapter 1 Introduction 1.1 The drake R package 1.2 Installation 1.3 Why drake? 1.4 With Docker 1.5 Documentation 1.6 Help and troubleshooting", " The drake R Package User Manual Will Landau, Kirill Müller, Alex Axthelm, Jasper Clarkberg, Lorenz Walthert, Ellis Hughes, Matthew Mark Strasiotto Copyright Eli Lilly and Company Chapter 1 Introduction The video above is the recording from the rOpenSci Community Call from 2019-09-24. Visit the call’s page for links to additional resources, and chime in here to propose and vote for ideas for new Community Call topics and speakers. 1.1 The drake R package Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again? For projects in R, the drake package can help. It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research. 1.2 Installation You can choose among different versions of drake. The latest CRAN release may be more convenient to install, but this manual is kept up to date with the GitHub version, so some features described here may not yet be available on CRAN. # Install the latest stable release from CRAN. install.packages("drake") # Alternatively, install the development version from GitHub. install.packages("devtools") library(devtools) install_github("ropensci/drake") 1.3 Why drake? 1.3.1 What gets done stays done. Too many data science projects follow a Sisyphean loop: Launch the code. Wait while it runs. Discover an issue. Restart from scratch. For projects with long runtimes, people tend to get stuck. But with drake, you can automatically Launch the parts that changed since last time. Skip the rest. 1.3.2 Reproducibility with confidence The R community emphasizes reproducibility. Traditional themes include scientific replicability, literate programming with knitr, and version control with git. But internal consistency is important too. Reproducibility carries the promise that your output matches the code and data you say you used. With the exception of any triggers suppressed by the user, drake strives to keep this promise. 1.3.2.1 Evidence Suppose you are reviewing someone else’s data analysis project for reproducibility. You scrutinize it carefully, checking that the datasets are available and the documentation is thorough. But could you re-create the results without the help of the original author? With drake, it is quick and easy to find out. make(plan) # See also r_make(). outdated(plan) # See also r_outdated(). With everything already up to date, you have tangible evidence of reproducibility. Even though you did not re-create the results, you know the results are re-creatable. They faithfully show what the code is producing. Given the right package environment and system configuration, you have everything you need to reproduce all the output by yourself. 1.3.2.2 Ease When it comes time to actually rerun the entire project, you have much more confidence. Starting over from scratch is trivially easy. clean() # Remove the original author's results. make(plan) # Independently re-create the results from the code and input data. 1.3.2.3 Independent replication With even more evidence and confidence, you can invest the time to independently replicate the original code base if necessary. Up until this point, you relied on basic drake functions such as make(), so you may not have needed to peek at any substantive author-defined code in advance. In that case, you can stay usefully ignorant as you reimplement the original author’s methodology. In other words, drake could potentially improve the integrity of independent replication. 1.3.2.4 Big data efficiency Select a specialized data format to increase speed and reduce memory consumption. In version 7.5.2.9000 and above, the available formats are “fst” for data frames (example below) and “keras” for Keras models (example here). library(drake) n <- 1e8 # Each target is 1.6 GB in memory. plan <- drake_plan( data_fst = target( data.frame(x = runif(n), y = runif(n)), format = "fst" ), data_old = data.frame(x = runif(n), y = runif(n)) ) make(plan) #> target data_fst #> target data_old build_times(type = "build") #> # A tibble: 2 x 4 #> target elapsed user system #> <chr> <Duration> <Duration> <Duration> #> 1 data_fst 13.93s 37.562s 7.954s #> 2 data_old 184s (~3.07 minutes) 177s (~2.95 minutes) 4.157s 1.3.2.5 History As of version 7.5.0, drake tracks the history of your analysis: what you built, when you built it, how you built it, the arguments you used in your function calls, and how to get the data back. (Disable with make(history = FALSE)) drake_history(analyze = TRUE) #> # A tibble: 7 x 8 #> target time hash exists command runtime latest quiet #> <chr> <chr> <chr> <lgl> <chr> <dbl> <lgl> <lgl> #> 1 data 2019-06-23… e580e… TRUE raw_data %>% muta… 0.001 TRUE NA #> 2 fit 2019-06-23… 62a16… TRUE lm(Ozone ~ Temp +… 0.00300 TRUE NA #> 3 hist 2019-06-23… 10bcd… TRUE create_plot(data) 0.00500 FALSE NA #> 4 hist 2019-06-23… 00fad… TRUE create_plot(data) 0.00300 TRUE NA #> 5 raw_da… 2019-06-23… 63172… TRUE "readxl::read_exc… 0.00900 TRUE NA #> 6 report 2019-06-23… dd965… TRUE "rmarkdown::rende… 0.476 FALSE TRUE #> 7 report 2019-06-23… dd965… TRUE "rmarkdown::rende… 0.369 TRUE TRUE The history has arguments like quiet (because of the call to knit(quiet = TRUE)) and hashes to help you recover old data. To learn more, see the end of the walkthrough chapter and the drake_history() help file. 1.3.2.6 Reproducible recovery drake’s data recovery feature is another way to avoid rerunning commands. It is useful if: You want to revert to your old code, maybe with git reset. You accidentally clean()ed a target and to get it back. You want to rename an expensive target. See the walkthrough chapter for details. 1.3.2.7 Readability and transparency Ideally, independent observers should be able to read your code and understand it. drake helps in several ways. The drake plan explicitly outlines the steps of the analysis, and vis_drake_graph() visualizes how those steps depend on each other. drake takes care of the parallel scheduling and high-performance computing (HPC) for you. That means the HPC code is no longer tangled up with the code that actually expresses your ideas. You can generate large collections of targets without necessarily changing your code base of imported functions, another nice separation between the concepts and the execution of your workflow 1.3.3 Scale up and out. Not every project can complete in a single R session on your laptop. Some projects need more speed or computing power. Some require a few local processor cores, and some need large high-performance computing systems. But parallel computing is hard. Your tables and figures depend on your analysis results, and your analyses depend on your datasets, so some tasks must finish before others even begin. drake knows what to do. Parallelism is implicit and automatic. See the high-performance computing guide for all the details. # Use the spare cores on your local machine. options(clustermq.scheduler = "multicore") make(plan, parallelism = "clustermq", jobs = 4) # Or scale up to a supercomputer. drake_hpc_tmpl_file("slurm_clustermq.tmpl") # https://slurm.schedmd.com/ options( clustermq.scheduler = "slurm", clustermq.template = "slurm_clustermq.tmpl" ) make(plan, parallelism = "clustermq", jobs = 100) 1.4 With Docker drake and Docker are compatible and complementary. Here are some examples that run drake inside a Docker image. drake-gitlab-docker-example: A small pedagogical example workflow that leverages drake, Docker, GitLab, and continuous integration in a reproducible analysis pipeline. Created by Noam Ross. pleurosoriopsis: The workflow that supports Ebihara et al. 2019. “Growth Dynamics of the Independent Gametophytes of Pleurorosiopsis makinoi (Polypodiaceae)” Bulletin of the National Science Museum Series B (Botany) 45:77-86.. Created by Joel Nitta. Alternatively, it is possible to run drake outside Docker and use the future package to send targets to a Docker image. drake’s Docker-psock example demonstrates how. Download the code with drake_example(\"Docker-psock\"). 1.5 Documentation 1.5.1 Core concepts The following resources explain what drake can do and how it works. The learndrake workshop devotes particular attention to drake’s mental model. The user manual. drakeplanner, an R/Shiny app to help learn drake and create new projects. Run locally with drakeplanner::drakeplanner() or access it at https://wlandau.shinyapps.io/drakeplanner. learndrake, an R package for teaching an extended drake workshop. It contains notebooks, slides, Shiny apps, the latter two of which are publicly deployed. See the README for instructions and links. 1.5.2 In practice Miles McBain’s excellent blog post explains the motivating factors and practical issues {drake} addresses for most projects, how to set up a project as quickly and painlessly as possible, and how to overcome common obstacles. Miles’ dflow package generates the file structure for a boilerplate drake project. It is a more thorough alternative to drake::use_drake(). drake is heavily function-oriented by design, and Miles’ fnmate package automatically generates boilerplate code and docstrings for functions you mention in drake plans. 1.5.3 Use cases The official rOpenSci use cases and associated discussion threads describe applications of drake in the real world. Many of these use cases are linked from the drake tag on the rOpenSci discussion forum. Here are some additional applications of drake in real-world projects. efcaguab/demografia-del-voto efcaguab/great-white-shark-nsw IndianaCHE/Detailed-SSP-Reports joelnitta/pleurosoriopsis pat-s/pathogen-modeling sol-eng/tensorflow-w-r tiernanmartin/home-and-hope 1.5.4 drake projects as R packages Some folks like to structure their drake workflows as R packages. Examples are below. In your own analysis packages, be sure to supply the namespace of your package to the envir argument of make() and friends (e.g. make(envir = getNamespace(\"yourPackage\") so drake can watch you package’s functions for changes and rebuild downstream targets accordingly. b-rodrigues/coolmlproject tiernanmartin/drakepkg 1.5.5 Frequently asked questions The FAQ page is an index of links to appropriately-labeled issues on GitHub. To contribute, please submit a new issue and ask that it be labeled as a frequently asked question. 1.5.6 Reference The reference website. The official repository of example code. Download an example workflow from here with drake_example(). Presentations and workshops by Will Landau, Kirill Müller, Amanda Dobbyn, Karthik Ram, Sina Rüeger, Christine Stawitz, and others. See specific links at https://books.ropensci.org/drake/index.html#presentations The FAQ page, which links to appropriately-labeled issues on GitHub. 1.5.7 Function reference The reference section lists all the available functions. Here are the most important ones. drake_plan(): create a workflow data frame (like my_plan). make(): build your project. drake_history(): show what you built, when you built it, and the function arguments you used. loadd(): load one or more built targets into your R session. readd(): read and return a built target. vis_drake_graph(): show an interactive visual network representation of your workflow. outdated(): see which targets will be built in the next make(). deps_code(): check the dependencies of a command or function. drake_failed(): list the targets that failed to build in the last make(). diagnose(): return the full context of a build, including errors, warnings, and messages. 1.5.8 Tutorials Thanks to Kirill for constructing two interactive learnr tutorials: one supporting drake itself, and a prerequisite walkthrough of the cooking package. 1.5.9 Examples The official rOpenSci use cases and associated discussion threads describe applications of drake in action. Here are some more real-world sightings of drake in the wild. ecohealthalliance/drake-gitlab-docker-example efcaguab/demografia-del-voto efcaguab/great-white-shark-nsw IndianaCHE/Detailed-SSP-Reports joelnitta/pleurosoriopsis pat-s/pathogen-modeling sol-eng/tensorflow-w-r tiernanmartin/home-and-hope There are also multiple drake-powered example projects available here, ranging from beginner-friendly stubs to demonstrations of high-performance computing. You can generate the files for a project with drake_example() (e.g. drake_example(\"gsp\")), and you can list the available projects with drake_examples(). You can contribute your own example project with a fork and pull request. 1.5.10 Presentations Author Venue Date Materials Bruno Rodrigues YouTube 2020-05-11 video, source Matt Dray Bioinformatics London Meetup 2020-01-30 slides, source Matt Dray Coffee & Coding, UK Dept for Transport 2019-10-02 slides Patrick Schratz whyR Conference 2019-09-27 workshop, slides, source Will Landau rOpenSci Community Calls 2019-09-24 Video recording and resource links Will Landau R/Pharma 2019 2019-08-21 slides, workspace, source Garrick Aden-Buie Bio-Data Club at Moffitt Cancer Center 2019-07-19 slides, workspace, source Tiernan Martin Cascadia R Conference 2019-06-08 slides Dominik Rafacz satRday Gdansk 2019-05-18 slides, source Amanda Dobbyn R-Ladies NYC 2019-02-12 slides, source Will Landau Harvard DataFest 2019-01-22 slides, source Karthik Ram RStudio Conference 2019-01-18 video, slides, resources Sina Rüeger Geneva R User Group 2018-10-04 slides, example code Will Landau R in Pharma 2018-08-16 video, slides, source Christine Stawitz R-Ladies Seattle 2018-06-25 materials Kirill Müller Swiss Institute of Bioinformatics 2018-03-05 workshop, slides, source, exercises 1.5.11 Context and history For context and history, check out this post on the rOpenSci blog and episode 22 of the R Podcast. 1.6 Help and troubleshooting The GitHub issue tracker is the best place to request help with your use case. Please search both open and closed ones before posting a new issue. Don’t be afraid to open a new issue, just please take 30 seconds to search for existing threads that could solve your problem. "],
["similar-work.html", "Chapter 2 Similar work 2.1 Pipeline tools 2.2 Memoization 2.3 Literate programming 2.4 Acknowledgements", " Chapter 2 Similar work drake enhances reproducibility and high-performance computing, but not in all respects. Literate programming, local library managers, containerization, and strict session managers offer more robust solutions in their respective domains. And for the problems drake does solve, it stands on the shoulders of the giants that came before. 2.1 Pipeline tools 2.1.1 GNU Make The original idea of a time-saving reproducible build system extends back at least as far as GNU Make, which still aids the work of data scientists as well as the original user base of compiled language programmers. In fact, the name “drake” stands for “Data Frames in R for Make”. Make is used widely in reproducible research. Below are some examples from Karl Broman’s website. Bostock, Mike (2013). “A map of flowlines from NHDPlus.” https://github.com/mbostock/us-rivers. Powered by the Makefile at https://github.com/mbostock/us-rivers/blob/master/Makefile. Broman, Karl W (2012). “Halotype Probabilities in Advanced Intercross Populations.” G3 2(2), 199-202.Powered by the Makefile at https://github.com/kbroman/ailProbPaper/blob/master/Makefile. Broman, Karl W (2012). “Genotype Probabilities at Intermediate Generations in the Construction of Recombinant Inbred Lines.” *Genetics 190(2), 403-412. Powered by the Makefile at https://github.com/kbroman/preCCProbPaper/blob/master/Makefile. Broman, Karl W and Kim, Sungjin and Sen, Saunak and Ane, Cecile and Payseur, Bret A (2012). “Mapping Quantitative Trait Loci onto a Phylogenetic Tree.” Genetics 192(2), 267-279. Powered by the Makefile at https://github.com/kbroman/phyloQTLpaper/blob/master/Makefile. Whereas GNU Make is language-agnostic, drake is fundamentally designed for R. Instead of a Makefile, drake supports an R-friendly domain-specific language for declaring targets. Targets in GNU Make are files, whereas targets in drake are arbitrary variables in memory. (drake does have opt-in support for files via file_out(), file_in(), and knitr_in().) drake caches these objects in its own storage system so R users rarely have to think about output files. 2.1.2 Remake remake itself is no longer maintained, but its founding design goals and principles live on through drake. In fact, drake is a direct reimagining of remake with enhanced scalability, reproducibility, high-performance computing, visualization, and documentation. 2.1.3 Factual’s Drake Factual’s Drake is similar in concept, but the development effort is completely unrelated to the drake R package. 2.1.4 Other pipeline tools There are countless other successful pipeline toolkits. The drake package distinguishes itself with its R-focused approach, Tidyverse-friendly interface, and a thorough selection of parallel computing technologies and scheduling algorithms. 2.2 Memoization Memoization is the strategic caching of the return values of functions. It is a lightweight approach to the core problem that drake and other pipeline tools are trying to solve. Every time a memoized function is called with a new set of arguments, the return value is saved for future use. Later, whenever the same function is called with the same arguments, the previous return value is salvaged, and the function call is skipped to save time. The memoise package is the primary implementation of memoization in R. Memoization saves time for small projects, but it arguably does not go far enough for large reproducible pipelines. In reality, the return value of a function depends not only on the function body and the arguments, but also on any nested functions and global variables, the dependencies of those dependencies, and so on upstream. drake tracks this deeper context, while memoise does not. 2.3 Literate programming Literate programming is the practice of narrating code in plain vernacular. The goal is to communicate the research process clearly, transparently, and reproducibly. Whereas commented code is still mostly code, literate knitr / R Markdown reports can become websites, presentation slides, lecture notes, serious scientific manuscripts, and even books. 2.3.1 knitr and R Markdown drake and knitr are symbiotic. drake’s job is to manage large computation and orchestrate the demanding tasks of a complex data analysis pipeline. knitr’s job is to communicate those expensive results after drake computes them. knitr / R Markdown reports are small pieces of an overarching drake pipeline. They should focus on communication, and they should do as little computation as possible. To insert a knitr report in a drake pipeline, use the knitr_in() function inside your drake plan, and use loadd() and readd() to refer to targets in the report itself. See an example here. 2.3.2 Version control drake is not a version control tool. However, it is fully compatible with git, svn, and similar software. In fact, it is good practice to use git alongside drake for reproducible workflows. However, data poses a challenge. The datasets created by make() can get large and numerous, and it is not recommended to put the .drake/ cache or the .drake_history/ logs under version control. Instead, it is recommended to use a data storage solution such as DropBox or OSF. 2.3.3 Containerization and R package environments drake does not track R packages or system dependencies for changes. Instead, it defers to tools like Docker, Singularity, renv, and packrat, which create self-contained portable environments to reproducibly isolate and ship data analysis projects. drake is fully compatible with these tools. 2.3.4 workflowr The workflowr package is a project manager that focuses on literate programming, sharing over the web, file organization, and version control. Its brand of reproducibility is all about transparency, communication, and discoverability. For an example of workflowr and drake working together, see this machine learning project by Patrick Schratz (source). 2.4 Acknowledgements Special thanks to Jarad Niemi, my advisor from graduate school, for first introducing me to the idea of Makefiles for research. He originally set me down the path that led to drake. Many thanks to Julia Lowndes, Ben Marwick, and Peter Slaughter for reviewing drake for rOpenSci, and to Maëlle Salmon for such active involvement as the editor. Thanks also to the following people for contributing early in development. Alex Axthelm Chan-Yub Park Daniel Falster Eric Nantz Henrik Bengtsson Ian Watson Jasper Clarkberg Kendon Bell Kirill Müller Credit for images is attributed here. "],
["walkthrough.html", "Chapter 3 Walkthrough 3.1 Set the stage. 3.2 Make your results. 3.3 Go back and fix things. 3.4 History and provenance 3.5 Reproducible data recovery and renaming 3.6 Try the code yourself! 3.7 Thanks", " Chapter 3 Walkthrough A typical data analysis workflow is a sequence of data transformations. Raw data becomes tidy data, then turns into fitted models, summaries, and reports. Other analyses are usually variations of this pattern, and drake can easily accommodate them. 3.1 Set the stage. To set up a project, load your packages, library(drake) library(dplyr) library(ggplot2) library(tidyr) #> #> Attaching package: 'tidyr' #> The following objects are masked from 'package:drake': #> #> expand, gather load your custom functions, create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone)) + theme_gray(24) } check any supporting files (optional), ## Get the files with drake_example("main"). file.exists("raw_data.xlsx") #> [1] TRUE file.exists("report.Rmd") #> [1] TRUE and plan what you are going to do. plan <- drake_plan( raw_data = readxl::read_excel(file_in("raw_data.xlsx")), data = raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), hist = create_plot(data), fit = lm(Ozone ~ Wind + Temp, data), report = rmarkdown::render( knitr_in("report.Rmd"), output_file = file_out("report.html"), quiet = TRUE ) ) plan #> [90m# A tibble: 5 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m raw_data readxl::read_excel(file_in("raw_data.xlsx")) … #> [90m2[39m data raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TR… #> [90m3[39m hist create_plot(data) … #> [90m4[39m fit lm(Ozone ~ Wind + Temp, data) … #> [90m5[39m report rmarkdown::render(knitr_in("report.Rmd"), output_file = file_out("re… Optionally, visualize your workflow to make sure you set it up correctly. The graph is interactive, so you can click, drag, hover, zoom, and explore. vis_drake_graph(plan) 3.2 Make your results. So far, we have just been setting the stage. Use make() or r_make() to do the real work. Targets are built in the correct order regardless of the row order of plan. make(plan) # See also r_make(). #> [32m▶[39m target raw_data #> [32m▶[39m target data #> [32m▶[39m target fit #> [32m▶[39m target hist #> [32m▶[39m target report Except for output files like report.html, your output is stored in a hidden .drake/ folder. Reading it back is easy. readd(data) %>% # See also loadd(). head() #> [90m# A tibble: 6 x 6[39m #> Ozone Solar.R Wind Temp Month Day #> [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m 41 190 7.4 67 5 1 #> [90m2[39m 36 118 8 72 5 2 #> [90m3[39m 12 149 12.6 74 5 3 #> [90m4[39m 18 313 11.5 62 5 4 #> [90m5[39m 42.1 [31mNA[39m 14.3 56 5 5 #> [90m6[39m 28 [31mNA[39m 14.9 66 5 6 The graph shows everything up to date. vis_drake_graph(plan) # See also r_vis_drake_graph(). 3.3 Go back and fix things. You may look back on your work and see room for improvement, but it’s all good! The whole point of drake is to help you go back and change things quickly and painlessly. For example, we forgot to give our histogram a bin width. readd(hist) #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. So let’s fix the plotting function. create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone), binwidth = 10) + theme_gray(24) } drake knows which results are affected. vis_drake_graph(plan) # See also r_vis_drake_graph(). The next make() just builds hist and report. No point in wasting time on the data or model. make(plan) # See also r_make(). #> [32m▶[39m target hist #> [32m▶[39m target report loadd(hist) hist 3.4 History and provenance As of version 7.5.2, drake tracks the history and provenance of your targets: what you built, when you built it, how you built it, the arguments you used in your function calls, and how to get the data back. history <- drake_history(analyze = TRUE) history #> [90m# A tibble: 7 x 11[39m #> target current built exists hash command seed runtime na.rm quiet #> [3m[90m<chr>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<lgl>[39m[23m #> [90m1[39m data TRUE 2020… TRUE 11e2… [90m"[39mraw_d… 1.29[90me[39m9 0.013[4m0[24m TRUE [31mNA[39m #> [90m2[39m fit TRUE 2020… TRUE 3c87… [90m"[39mlm(Oz… 1.11[90me[39m9 0.005[4m0[24m[4m0[24m [31mNA[39m [31mNA[39m #> [90m3[39m hist FALSE 2020… TRUE 88ae… [90m"[39mcreat… 2.10[90me[39m8 0.014 [31mNA[39m [31mNA[39m #> [90m4[39m hist TRUE 2020… TRUE 0304… [90m"[39mcreat… 2.10[90me[39m8 0.004 [31mNA[39m [31mNA[39m #> [90m5[39m raw_d… TRUE 2020… TRUE 855d… [90m"[39mreadx… 1.20[90me[39m9 0.014 [31mNA[39m [31mNA[39m #> [90m6[39m report TRUE 2020… TRUE f900… [90m"[39mrmark… 1.30[90me[39m9 1.35 [31mNA[39m TRUE #> [90m7[39m report TRUE 2020… TRUE f900… [90m"[39mrmark… 1.30[90me[39m9 0.866 [31mNA[39m TRUE #> [90m# … with 1 more variable: output_file [3m[90m<chr>[90m[23m[39m Remarks: The quiet column appears above because one of the drake_plan() commands has knit(quiet = TRUE). The hash column identifies all the previous the versions of your targets. As long as exists is TRUE, you can recover old data. Advanced: if you use make(cache_log_file = TRUE) and put the cache log file under version control, you can match the hashes from drake_history() with the git commit history of your code. Let’s use the history to recover the oldest histogram. hash <- history %>% filter(target == "hist") %>% pull(hash) %>% head(n = 1) cache <- drake_cache() cache$get_value(hash) #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. 3.5 Reproducible data recovery and renaming Remember how we made that change to our histogram? What if we want to change it back? If we revert create_plot(), make(plan, recover = TRUE) restores the original plot. create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone)) + theme_gray(24) } # The report still needs to run in order to restore report.html. make(plan, recover = TRUE) #> [34mℹ[39m unloading 1 targets from environment #> [32m✔[39m recover hist #> [32m▶[39m target report readd(hist) # old histogram #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. drake’s data recovery feature is another way to avoid rerunning commands. It is useful if: You want to revert to your old code, maybe with git reset. You accidentally clean()ed a target and you want to get it back. You want to rename an expensive target. In version 7.5.2 and above, make(recover = TRUE) can salvage the values of old targets. Before building a target, drake checks if you have ever built something else with the same command, dependencies, seed, etc. that you have right now. If appropriate, drake assigns the old value to the new target instead of rerunning the command. Caveats: This feature is still experimental. Recovery may not be a good idea if your external dependencies have changed a lot over time (R version, package environment, etc.). 3.5.1 Undoing clean() # Is the data really gone? clean() # garbage_collection = FALSE # Nope! make(plan, recover = TRUE) # The report still builds since report.md is gone. #> [32m✔[39m recover raw_data #> [32m✔[39m recover data #> [32m✔[39m recover fit #> [32m✔[39m recover hist #> [32m✔[39m recover report # When was the raw data *really* first built? diagnose(raw_data)$date #> [1] "2020-07-12 13:10:37.463289 +0000 GMT" 3.5.2 Renaming You can use recovery to rename a target. The trick is to supply the random number generator seed that drake used with the old target name. Also, renaming a target unavoidably invalidates downstream targets. # Get the old seed. old_seed <- diagnose(data)$seed # Now rename the data and supply the old seed. plan <- drake_plan( raw_data = readxl::read_excel(file_in("raw_data.xlsx")), # Previously just named "data". airquality_data = target( raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), seed = !!old_seed ), # `airquality_data` will be recovered from `data`, # but `hist` and `fit` have changed commands, # so they will build from scratch. hist = create_plot(airquality_data), fit = lm(Ozone ~ Wind + Temp, airquality_data), report = rmarkdown::render( knitr_in("report.Rmd"), output_file = file_out("report.html"), quiet = TRUE ) ) make(plan, recover = TRUE) #> [32m✔[39m recover airquality_data #> [32m▶[39m target fit #> [32m▶[39m target hist #> [32m▶[39m target report 3.6 Try the code yourself! Use drake_example(\"main\") to download the code files for this example. 3.7 Thanks Thanks to Kirill Müller for originally providing this example. "],
["plans.html", "Chapter 4 drake plans 4.1 Functions 4.2 Intro to plans 4.3 A strategy for building up plans 4.4 How to choose good targets 4.5 Special data formats for targets 4.6 Special columns 4.7 Static files 4.8 Dynamic files 4.9 Large plans", " Chapter 4 drake plans Most data analysis workflows consist of several steps, such as data cleaning, model fitting, visualization, and reporting. A drake plan is the high-level catalog of all these steps for a single workflow. It is the centerpiece of every drake-powered project, and it is always required. However, the plan is almost never the first thing we write. A typical plan rests on a foundation of carefully-crafted custom functions. 4.1 Functions A function is a reusable instruction that accepts some inputs and returns a single output. After we define a function once, we can easily call it any number of times. root_square_term <- function(l, w, h) { half_w <- w / 2 l * sqrt(half_w ^ 2 + h ^ 2) } root_square_term(1, 2, 3) #> [1] 3.162278 root_square_term(4, 5, 6) #> [1] 26 In practice, functions are vocabulary. They are concise references to complicated ideas, and they help us write instructions of ever increasing complexity. # right rectangular pyramid volume_pyramid <- function(length_base, width_base, height) { area_base <- length_base * width_base term1 <- root_square_term(length_base, width_base, height) term2 <- root_square_term(width_base, length_base, height) area_base + term1 + term2 } volume_pyramid(3, 5, 7) #> [1] 73.09366 The root_square_term() function is custom shorthand that makes volume_pyramid() easier to write and understand. volume_pyramid(), in turn, helps us crudely approximate the total square meters of stone eroded from the Great Pyramid of Giza (dimensions from Wikipedia). volume_original <- volume_pyramid(230.4, 230.4, 146.5) volume_current <- volume_pyramid(230.4, 230.4, 138.8) volume_original - volume_current # volume eroded #> [1] 2760.183 This function-oriented code is concise and clear. Contrast it with the cumbersome mountain of imperative arithmetic that would have otherwise daunted us. # Don't try this at home! width_original <- 230.4 length_original <- 230.4 height_original <- 146.5 # We supply the same lengths and widths, # but we use different variable names # to illustrate the general case. width_current <- 230.4 length_current <- 230.4 height_current <- 138.8 area_original <- length_original * width_original term1_original <- length_original * sqrt((width_original / 2) ^ 2 + height_original ^ 2) term2_original <- width_original * sqrt((length_original / 2) ^ 2 + height_original ^ 2) volume_original <- area_original + term1_original + term2_original area_current <- length_current * width_current term1_current <- length_current * sqrt((width_current / 2) ^ 2 + height_current ^ 2) term2_current <- width_current * sqrt((length_current / 2) ^ 2 + height_current ^ 2) volume_current <- area_current + term1_current + term2_current volume_original - volume_current # volume eroded #> [1] 2760.183 Unlike imperative scripts, functions break down complex ideas into manageable pieces, and they gradually build up bigger and bigger pieces until an elegant solution materializes. This process of building up functions helps us think clearly, understand what we are doing, and explain our methods to others. 4.2 Intro to plans A drake plan is a data frame with columns named target and command. Each row represents a step in the workflow. Each command is a concise expression that makes use of our functions, and each target is the return value of the command. (The target column has the names of the targets, not the values. These names must not conflict with the names of your functions or other global objects.) We create plans with the drake_plan() function. plan <- drake_plan( raw_data = readxl::read_excel(file_in("raw_data.xlsx")), data = raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), hist = create_plot(data), fit = lm(Ozone ~ Wind + Temp, data), report = rmarkdown::render( knitr_in("report.Rmd"), output_file = file_out("report.html"), quiet = TRUE ) ) plan #> [90m# A tibble: 5 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m raw_data readxl::read_excel(file_in("raw_data.xlsx")) … #> [90m2[39m data raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TR… #> [90m3[39m hist create_plot(data) … #> [90m4[39m fit lm(Ozone ~ Wind + Temp, data) … #> [90m5[39m report rmarkdown::render(knitr_in("report.Rmd"), output_file = file_out("re… The plan makes use of a custom create_plot() function to produce target hist. Functions make the plan more concise and easier to read. create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone)) + theme_gray(24) } drake automatically understands the relationships among targets in the plan. It knows data depends on raw_data because the symbol raw_data is mentioned in the command for data. drake represents this dependency relationship with an arrow from raw_data to data in the graph. vis_drake_graph(plan) We can write the targets in any order and drake still understands the dependency relationships. plan <- drake_plan( raw_data = readxl::read_excel(file_in("raw_data.xlsx")), data = raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), hist = create_plot(data), fit = lm(Ozone ~ Wind + Temp, data), report = rmarkdown::render( knitr_in("report.Rmd"), output_file = file_out("report.html"), quiet = TRUE ) ) vis_drake_graph(plan) The make() function runs the correct targets in the correct order and stores the results in a hidden cache. library(drake) library(glue) library(purrr) library(rlang) library(tidyverse) make(plan) #> [32m▶[39m target raw_data #> [32m▶[39m target data #> [32m▶[39m target fit #> [32m▶[39m target hist #> [32m▶[39m target report readd(hist) #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. The purpose of the plan is to identify steps we can skip in our workflow. If we change some code or data, drake saves time by running some steps and skipping others. create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone), binwidth = 10) + # new bin width theme_gray(24) } vis_drake_graph(plan) make(plan) #> [32m▶[39m target hist #> [32m▶[39m target report readd(hist) 4.3 A strategy for building up plans Building a drake plan is a gradual process. You do not need to write out every single target to start with. Instead, start with just one or two targets: for example, raw_data in the plan above. Then, make() the plan and inspect the results with readd(). If the target’s return value seems correct to you, go ahead and write another target in the plan (data), make() the bigger plan, and repeat. These repetitive make()s should skip previous work each time, and you will have an intuitive sense of the results as you go. 4.4 How to choose good targets Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case. Generally speaking, a good target is Long enough to eat up a decent chunk of runtime, and Small enough that make() frequently skips it, and Meaningful to your project, and A well-behaved R object compatible with saveRDS(). For example, data frames behave better than database connection objects (discussions here and here), R6 classes, and xgboost matrices. Above, “long” and “short” refer to computational runtime, not the size of the target’s value. The more data you return to the targets, the more data drake puts in storage, and the slower your workflow becomes. If you have a large dataset, it may not be wise to copy it over several targets. bad_plan <- drake_plan( raw = get_big_raw_dataset(), # We write this ourselves. selection = select(raw, column1, column2), filtered = filter(selection, column3 == "abc"), analysis = my_analysis_function(filtered) # Same here. ) In the above sketch, the dataset is super large, and selection and filtering are fast by comparison. It is much better to wrap up these steps in a data cleaning function and reduce the number of targets. munged_dataset <- function() { get_big_raw_dataset() %>% select(column1, column2) %>% filter(column3 == "abc") } good_plan <- drake_plan( dataset = munged_dataset(), analysis = my_analysis_function(dataset) ) 4.5 Special data formats for targets drake supports custom formats for saving and loading large objects and highly specialized objects. For example, the \"fst\" and \"fst_tbl\" formats use the fst package to save data.frame and tibble targets faster. Simply enclose the command and the format together with the target() function. library(drake) n <- 1e8 # Each target is 1.6 GB in memory. plan <- drake_plan( data_fst = target( data.frame(x = runif(n), y = runif(n)), format = "fst" ), data_old = data.frame(x = runif(n), y = runif(n)) ) make(plan) #> target data_fst #> target data_old build_times(type = "build") #> # A tibble: 2 x 4 #> target elapsed user system #> <chr> <Duration> <Duration> <Duration> #> 1 data_fst 13.93s 37.562s 7.954s #> 2 data_old 184s (~3.07 minutes) 177s (~2.95 minutes) 4.157s There are several formats, each with their own system requirements. These system requirements, such as the fst R package for the \"fst\" format, do not come pre-installed with drake. You will need to install them manually. \"file\": Dynamic files. To use this format, simply create local files and directories yourself and then return a character vector of paths as the target’s value. Then, drake will watch for changes to those files in subsequent calls to make(). This is a more flexible alternative to file_in() and file_out(), and it is compatible with dynamic branching. See https://github.com/ropensci/drake/pull/1178 for an example. \"fst\": save big data frames fast. Requires the fst package. Note: this format strips non-data-frame attributes such as the \"fst_tbl\": Like \"fst\", but for tibble objects. Requires the fst and tibble packages. Strips away non-data-frame non-tibble attributes. \"fst_dt\": Like \"fst\" format, but for data.table objects. Requires the fst and data.table packages. Strips away non-data-frame non-data-table attributes. \"diskframe\": Stores disk.frame objects, which could potentially be larger than memory. Requires the fst and disk.frame packages. Coerces objects to disk.frames. Note: disk.frame objects get moved to the drake cache (a subfolder of .drake/ for most workflows). To ensure this data transfer is fast, it is best to save your disk.frame objects to the same physical storage drive as the drake cache, as.disk.frame(your_dataset, outdir = drake_tempfile()). \"keras\": save Keras models as HDF5 files. Requires the keras package. \"qs\": save any R object that can be properly serialized with the qs package. Requires the qs package. Uses qsave() and qread(). Uses the default settings in qs version 0.20.2. \"rds\": save any R object that can be properly serialized. Requires R version >= 3.5.0 due to ALTREP. Note: the \"rds\" format uses gzip compression, which is slow. \"qs\" is a superior format. 4.6 Special columns With target(), you can define any kind of special column in the plan. drake_plan( x = target((1 + sqrt(5)) / 2, golden = "ratio"), y = target(pi * 3 ^ 2, area = "circle") ) #> [90m# A tibble: 2 x 4[39m #> target command golden area #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m #> [90m1[39m x (1 + sqrt(5))/2 ratio [31mNA[39m #> [90m2[39m y pi * 3^2 [31mNA[39m circle The following columns have special meanings, and make() reads and interprets them. format: already described above. dynamic: See the chapter on dynamic branching. transform: Automatically processed by drake_plan() except for drake_plan(transform = FALSE). See the chapter on static branching. trigger: rule to decide whether a target needs to run. See the trigger chapter to learn more. elapsed and cpu: number of seconds to wait for the target to build before timing out (elapsed for elapsed time and cpu for CPU time). hpc: logical values (TRUE/FALSE/NA) whether to send each target to parallel workers. Click here to learn more. resources: target-specific lists of resources for a computing cluster. See the advanced options in the parallel computing chapter for details. caching: overrides the caching argument of make() for each target individually. Only supported in drake version 7.6.1.9000 and above. Possible values: “master”: tell the master process to store the target in the cache. “worker”: tell the HPC worker to store the target in the cache. NA: default to the caching argument of make(). retries: number of times to retry building a target in the event of an error. seed: For statistical reproducibility, drake automatically assigns a unique pseudo-random number generator (RNG) seed to each target based on the target name and the global seed argument to make(). With the seed column of the plan, you can override these default seeds and set your own. Any non-missing seeds in the seed column override drake’s default target seeds. max_expand: for dynamic branching only. Same as the max_expand argument of [make()], but on a target-by-target basis. Limits the number of sub-targets created for a given target. Only supported in drake >= 7.11.0. 4.7 Static files drake has special functions to declare relationships between targets and external storage on disk. file_in() is for input files and directories, file_out() is for output files and directories, and knitr_in() is for R Markdown reports and knitr source files. If you use one of these functions inline in the plan, it tells drake to rerun a target when a file changes (or any of the files in a directory). All three functions appear in this plan. plan #> [90m# A tibble: 5 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m raw_data readxl::read_excel(file_in("raw_data.xlsx")) … #> [90m2[39m data raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TR… #> [90m3[39m hist create_plot(data) … #> [90m4[39m fit lm(Ozone ~ Wind + Temp, data) … #> [90m5[39m report rmarkdown::render(knitr_in("report.Rmd"), output_file = file_out("re… If we break the file_out() file, drake automatically repairs it. unlink("report.html") make(plan) #> [32m▶[39m target report file.exists("report.html") #> [1] TRUE As for knitr_in(), recall what happened when we changed the create_plot(). Not only did hist rerun, report ran as well. Why? Because knitr_in() is special. It tells drake to look for mentions of loadd() and readd() in the code chunks. drake finds the targets you mention in those loadd() and readd() calls and treats them as dependencies of the report. This lets you choose to run the report either inside or outside a drake pipeline. cat(readLines("report.Rmd"), sep = "\\n") #> --- #> title: "Example R Markdown drake file target" #> author: Will Landau and Kirill Müller #> output: html_document #> --- #> #> Run `make.R` to generate the output `report.pdf` and its dependencies. Because we use `loadd()` and `readd()` below, `drake` knows `report.pdf` depends on targets `fit`, and `hist`. #> #> ```{r content} #> library(drake) #> loadd(fit) #> print(fit) #> readd(hist) #> ``` #> #> More: #> #> - Walkthrough: [this chapter of the user manual](https://books.ropensci.org/drake/walkthrough.html) #> - Code: `drake_example("main")` That is why we have an arrow from hist to report in the graph. vis_drake_graph(plan) 4.7.1 URLs file_in() understands URLs. If you supply a string beginning with http://, https://, or ftp://, drake watches the HTTP ETag, file size, and timestamp for changes. drake_plan( external_data = download.file(file_in("http://example.com/file.zip")) ) #> [90m# A tibble: 1 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m external_data download.file(file_in("http://example.com/file.zip")) 4.7.2 Limitations of static files 4.7.2.1 Paths must be literal strings file_in(), file_out(), and knitr_in() require you to mention file and directory names explicitly. You cannot use a variable containing the name of a file. The reason is that drake detects dependency relationships with static code analysis. In other words, drake needs to know the names of all your files ahead of time (before we start building targets in make()). Here is an example of a bad plan. prefix <- "eco_" bad_plan <- drake_plan( data = read_csv(file_in(paste0(prefix, "data.csv"))) ) vis_drake_graph(bad_plan) #> Warning: Detected file_in(paste0(prefix, "data.csv")). File paths in #> file_in(), file_out(), and knitr_in() must be literal strings, not #> variables. For example, file_in("file1.csv", "file2.csv") is legal, but #> file_in(paste0(filename_variable, ".csv")) is not. Details: https:// #> books.ropensci.org/drake/plans.html#static-files Instead, write this: good_plan <- drake_plan( file = read_csv(file_in("eco_data.csv")) ) vis_drake_graph(good_plan) Or even the one below, which uses the !! (“bang-bang”) tidy evaluation unquoting operator. prefix <- "eco_" drake_plan( file = read_csv(file_in(!!paste0(prefix, "data.csv"))) ) #> [90m# A tibble: 1 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m file read_csv(file_in("eco_data.csv")) 4.7.2.2 Do not use inside functions file_out() and knitr_in() should not be used inside imported functions because drake does not know how to deal with functions that depend on targets. Instead of this: f <- function() { render(knitr_in("report.Rmd"), output_file = file_out("report.html")) } plan <- drake_plan( y = f() ) Write this: plan <- drake_plan( y = render(knitr_in("report.Rmd"), output_file = file_out("report.html")) ) Or this: f <- function(input, output) { render(input, output_file = output) } plan <- drake_plan( y = f(input = knitr_in("report.Rmd"), output = file_out("report.html")) ) file_in() can be used inside functions, but only for files that exist before you call make(). 4.7.2.3 Incompatible with dynamic branching file_out() and knitr_in() deal with static output files, so they must not be used with dynamic branching. As an alternative, consider dynamic files (described below). You can still use file_in(), but only for files that all the dynamic sub-targets depend on. (Changing a static input file dependency will invalidate all the sub-targets.) 4.7.2.4 Database connections file_in() and friends do not help us manage database connections. If you work with a database, the most general best practice is to always trigger a snapshot to make sure you have the latest data. plan <- drake_plan( data = target( get_data_from_db("my_table"), # Define yourself. trigger = trigger(condition = TRUE) # Always runs. ), preprocess = my_preprocessing(data) # Runs when the data change. ) In specific use cases, you may be able to watch database metadata for changes, but this information is situation-specific. library(DBI) # Connection objects are brittle, so they should not be targets. # We define them up front, and we use ignore() to prevent # drake from rerunning targets when the connection object changes. con <- dbConnect(...) plan <- drake_plan( data = target( dbReadTable(ignore(con), "my_table"), # Use ignore() for db connection objects. trigger = trigger(change = somehow_get_db_timestamp()) # Define yourself. ), preprocess = my_preprocessing(data) # runs when the data change ) 4.8 Dynamic files drake >= 7.11.0 supports dynamic files through a specialized format. With dynamic files, drake can watch local files without knowing them in advance. This is a more flexible alternative to file_out() and file_in(), and it is fully compatible with dynamic branching. 4.8.1 How to use dynamic files Set format = “file” in target() within drake_plan(). Return the paths to local files from the target. To link targets together in dependency relationships, reference target names and not literal character strings. 4.8.2 Example of dynamic files bad_plan <- drake_plan( upstream = target({ writeLines("one line", "my file") # Make sure the file exists. "my file" # Must return the file path. }, format = "file" # Necessary for dynamic files ), downstream = readLines("my file") # Oops! ) plot(bad_plan) good_plan <- drake_plan( upstream = target({ writeLines("one line", "my file") # Make sure the file exists. "my file" # Must return the file path. }, format = "file" # Necessary for dynamic files ), downstream = readLines(upstream) # Use the target name. ) plot(good_plan) make(good_plan) #> [32m▶[39m target upstream #> [32m▶[39m target downstream # Change how the file is generated. good_plan <- drake_plan( upstream = target({ writeLines("different line", "my file") # Change the file. "my file" }, format = "file" ), downstream = readLines(upstream) ) # The downstream target automatically reruns. make(good_plan) #> [32m▶[39m target upstream #> [32m▶[39m target downstream 4.8.3 Limitations of dynamic files Unlike file_in(), dynamic files cannot handle URLs. All files and directories must have valid local paths. 4.9 Large plans drake has special interfaces to concisely define large numbers of targets. See the chapters on static branching and dynamic branching for details. "],
["static.html", "Chapter 5 Static branching 5.1 Why static branching? 5.2 Grouping variables 5.3 Tidy evaluation 5.4 Static transformations 5.5 Target names 5.6 Tags", " Chapter 5 Static branching 5.1 Why static branching? Static branching helps us write large plans compactly. Instead of typing out every single target by hand, we use a special shorthand to declare entire batches of similar targets. To practice static branching in a controlled setting, try the interactive exercises at https://wlandau.shinyapps.io/learndrakeplans (from the workshop at https://github.com/wlandau/learndrake). Without static branching, plans like this one become too cumbersome to type by hand. # Without static branching: drake_plan( data = get_data(), analysis_fast_1_main = main(data, mean = 1, tuning = "fast"), analysis_slow_1_main = main(data, mean = 1, tuning = "slow"), analysis_fast_2_main = main(data, mean = 2, tuning = "fast"), analysis_slow_2_main = main(data, mean = 2, tuning = "slow"), analysis_fast_3_main = main(data, mean = 3, tuning = "fast"), analysis_slow_3_main = main(data, mean = 3, tuning = "slow"), analysis_fast_4_main = main(data, mean = 4, tuning = "fast"), analysis_slow_4_main = main(data, mean = 4, tuning = "slow"), analysis_fast_1_altv = altv(data, mean = 1, tuning = "fast"), analysis_slow_1_altv = altv(data, mean = 1, tuning = "slow"), analysis_fast_2_altv = altv(data, mean = 2, tuning = "fast"), analysis_slow_2_altv = altv(data, mean = 2, tuning = "slow"), analysis_fast_3_altv = altv(data, mean = 3, tuning = "fast"), analysis_slow_3_altv = altv(data, mean = 3, tuning = "slow"), analysis_fast_4_altv = altv(data, mean = 4, tuning = "fast"), analysis_slow_4_altv = altv(data, mean = 4, tuning = "slow"), summary_analysis_fast_1_main = summarize_model(analysis_fast_1_main), summary_analysis_slow_1_main = summarize_model(analysis_slow_1_main), summary_analysis_fast_2_main = summarize_model(analysis_fast_2_main), summary_analysis_slow_2_main = summarize_model(analysis_slow_2_main), summary_analysis_fast_3_main = summarize_model(analysis_fast_3_main), summary_analysis_slow_3_main = summarize_model(analysis_slow_3_main), summary_analysis_fast_4_main = summarize_model(analysis_fast_4_main), summary_analysis_slow_4_main = summarize_model(analysis_slow_4_main), summary_analysis_fast_1_altv = summarize_model(analysis_fast_1_altv), summary_analysis_slow_1_altv = summarize_model(analysis_slow_1_altv), summary_analysis_fast_2_altv = summarize_model(analysis_fast_2_altv), summary_analysis_slow_2_altv = summarize_model(analysis_slow_2_altv), summary_analysis_fast_3_altv = summarize_model(analysis_fast_3_altv), summary_analysis_slow_3_altv = summarize_model(analysis_slow_3_altv), summary_analysis_fast_4_altv = summarize_model(analysis_fast_4_altv), summary_analysis_slow_4_altv = summarize_model(analysis_slow_4_altv), model_summary_altv = dplyr::bind_rows( summary_analysis_fast_1_altv, summary_analysis_slow_1_altv, summary_analysis_fast_2_altv, summary_analysis_slow_2_altv, summary_analysis_fast_3_altv, summary_analysis_slow_3_altv, summary_analysis_fast_4_altv, summary_analysis_slow_4_altv ), model_summary_main = dplyr::bind_rows( summary_analysis_fast_1_main, summary_analysis_slow_1_main, summary_analysis_fast_2_main, summary_analysis_slow_2_main, summary_analysis_fast_3_main, summary_analysis_slow_3_main, summary_analysis_fast_4_main, summary_analysis_slow_4_main ) ) Static branching makes it easier to write and understand plans. To activate static branching, use the transform argument of target(). # With static branching: model_functions <- rlang::syms(c("main", "altv")) # We need symbols. model_functions # List of symbols. #> [[1]] #> main #> #> [[2]] #> altv plan <- drake_plan( data = get_data(), analysis = target( model_function(data, mean = mean_value, tuning = tuning_setting), # Define an analysis target for each combination of # tuning_setting, mean_value, and model_function. transform = cross( tuning_setting = c("fast", "slow"), mean_value = !!(1:4), # Why `!!`? See "Tidy Evaluation" below. model_function = !!model_functions # Why `!!`? See "Tidy Evaluation" below. ) ), # Define a new summary target for each analysis target defined previously. summary = target( summarize_model(analysis), transform = map(analysis) ), # Group together the summary targets by the corresponding value # of model_function. model_summary = target( dplyr::bind_rows(summary), transform = combine(summary, .by = model_function) ) ) plan #> [90m# A tibble: 35 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m 1[39m analysis_fast_1L_main main(data, mean = 1L, tuning = "fast") #> [90m 2[39m analysis_slow_1L_main main(data, mean = 1L, tuning = "slow") #> [90m 3[39m analysis_fast_2L_main main(data, mean = 2L, tuning = "fast") #> [90m 4[39m analysis_slow_2L_main main(data, mean = 2L, tuning = "slow") #> [90m 5[39m analysis_fast_3L_main main(data, mean = 3L, tuning = "fast") #> [90m 6[39m analysis_slow_3L_main main(data, mean = 3L, tuning = "slow") #> [90m 7[39m analysis_fast_4L_main main(data, mean = 4L, tuning = "fast") #> [90m 8[39m analysis_slow_4L_main main(data, mean = 4L, tuning = "slow") #> [90m 9[39m analysis_fast_1L_altv altv(data, mean = 1L, tuning = "fast") #> [90m10[39m analysis_slow_1L_altv altv(data, mean = 1L, tuning = "slow") #> [90m# … with 25 more rows[39m Always check the graph to make sure the plan makes sense. plot(plan) # a quick and dirty alternative to vis_drake_graph() If the graph is too complicated to look at or too slow to load, downsize the plan with max_expand. Then, when you are done debugging and testing, remove max_expand to scale back up to the full plan. model_functions <- rlang::syms(c("main", "altv")) plan <- drake_plan( max_expand = 2, data = get_data(), analysis = target( model_function(data, mean = mean_value, tuning = tuning_setting), transform = cross( tuning_setting = c("fast", "slow"), mean_value = !!(1:4), # Why `!!`? See "Tidy Evaluation" below. model_function = !!model_functions # Why `!!`? See "Tidy Evaluation" below. ) ), summary = target( summarize_model(analysis), transform = map(analysis) ), model_summary = target( dplyr::bind_rows(summary), transform = combine(summary, .by = model_function) # defined in "analysis" ) ) # Click and drag the nodes in the graph to improve the view. plot(plan) 5.2 Grouping variables A grouping variable contains iterated values for a single instance of map() or cross(). mean_value and tuning_par are grouping variables below. Notice how they are defined inside cross(). Grouping variables are not targets, and they must be declared inside static transformations. drake_plan( data = get_data(), model = target( fit_model(data, mean_value, tuning_par), transform = cross( mean_value = c(1, 2), tuning_par = c("fast", "slow") ) ) ) #> [90m# A tibble: 5 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m data get_data() #> [90m2[39m model_1_fast fit_model(data, 1, "fast") #> [90m3[39m model_2_fast fit_model(data, 2, "fast") #> [90m4[39m model_1_slow fit_model(data, 1, "slow") #> [90m5[39m model_2_slow fit_model(data, 2, "slow") Each model has its own mean_value and tuning_par. To see this correspondence, set trace = TRUE. drake_plan( trace = TRUE, data = get_data(), model = target( fit_model(data, mean_value, tuning_par), transform = cross( mean_value = c(1, 2), tuning_par = c("fast", "slow") ) ) ) #> [90m# A tibble: 5 x 5[39m #> target command mean_value tuning_par model #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m #> [90m1[39m data get_data() [31mNA[39m [31mNA[39m [31mNA[39m #> [90m2[39m model_1_fast fit_model(data, 1, "fast") 1 [90m"[39m\\"fast\\"[90m"[39m model_1_fast #> [90m3[39m model_2_fast fit_model(data, 2, "fast") 2 [90m"[39m\\"fast\\"[90m"[39m model_2_fast #> [90m4[39m model_1_slow fit_model(data, 1, "slow") 1 [90m"[39m\\"slow\\"[90m"[39m model_1_slow #> [90m5[39m model_2_slow fit_model(data, 2, "slow") 2 [90m"[39m\\"slow\\"[90m"[39m model_2_slow If we summarize those models, each summary has its own mean_value and tuning_par. In other words, grouping variables have a natural nesting, and they propagate forward so we can use them in downstream targets. Notice how mean_value and tuning_par appear in summarize_model() and combine() below. plan <- drake_plan( trace = TRUE, data = get_data(), model = target( fit_model(data, mean_value, tuning_par), transform = cross( mean_value = c(1, 2), tuning_par = c("fast", "slow") ) ), summary = target( # mean_value and tuning_par are old grouping variables from the models summarize_model(model, mean_value, tuning_par), transform = map(model) ), summary_by_tuning = target( dplyr::bind_rows(summary), # tuning_par is an old grouping variable from the models. transform = combine(summary, .by = tuning_par) ) ) plot(plan) 5.2.1 Limitations of grouping variables Each grouping variable should be defined only once. In the plan below, there are multiple conflicting definitions of a1, a2, and a3 in the dependencies of c1, so drake does not know which definitions to use. drake_plan( b1 = target(1, transform = map(a1 = 1, a2 = 1, .id = FALSE)), b2 = target(1, transform = map(a1 = 1, a3 = 1, .id = FALSE)), b3 = target(1, transform = map(a2 = 1, a3 = 1, .id = FALSE)), c1 = target(1, transform = map(a1, a2, a3, .id = FALSE)), trace = TRUE ) #> Warning in min(vapply(out, length, FUN.VALUE = integer(1))): no non-missing #> arguments to min; returning Inf #> Error: A grouping variable for target c1 is either undefined or improperly invoked. Details: https://books.ropensci.org/drake/static.html#grouping-variables Other workarounds include bind_plans() (on separate sub-plans) and dynamic branching. Always check your plans before you run them (vis_drake_graph() etc.). 5.3 Tidy evaluation In earlier plans, we used “bang-bang” operator !! from tidy evaluation, e.g. model_function = !!model_functions in cross(). But why? Why not just type model_function = model_functions? Consider the following incorrect plan. model_functions <- rlang::syms(c("main", "altv")) plan <- drake_plan( data = get_data(), analysis = target( model_function(data, mean = mean_value, tuning = tuning_setting), transform = cross( tuning_setting = c("fast", "slow"), mean_value = 1:4, # without !! model_function = model_functions # without !! ) ) ) drake_plan_source(plan) #> drake_plan( #> analysis_fast_1_model_functions = model_functions(data, mean = 1, tuning = "fast"), #> analysis_slow_1_model_functions = model_functions(data, mean = 1, tuning = "slow"), #> analysis_fast_4_model_functions = model_functions(data, mean = 4, tuning = "fast"), #> analysis_slow_4_model_functions = model_functions(data, mean = 4, tuning = "slow"), #> data = get_data() #> ) Because we omit !!, we create two problems: The commands use model_functions() instead of the desired main() and altv(). We are missing the targets with mean = 2 and mean = 3. Why? To make static branching work properly, drake does not actually evaluate the arguments to cross(). It just uses the raw symbols and expressions. To force drake to use the values instead, we need !!. model_functions <- rlang::syms(c("main", "altv")) plan <- drake_plan( data = get_data(), analysis = target( model_function(data, mean = mean_value, tuning = tuning_setting), transform = cross( tuning_setting = c("fast", "slow"), mean_value = !!(1:4), # with !! model_function = !!model_functions # with !! ) ) ) drake_plan_source(plan) #> drake_plan( #> analysis_fast_1L_main = main(data, mean = 1L, tuning = "fast"), #> analysis_slow_1L_main = main(data, mean = 1L, tuning = "slow"), #> analysis_fast_2L_main = main(data, mean = 2L, tuning = "fast"), #> analysis_slow_2L_main = main(data, mean = 2L, tuning = "slow"), #> analysis_fast_3L_main = main(data, mean = 3L, tuning = "fast"), #> analysis_slow_3L_main = main(data, mean = 3L, tuning = "slow"), #> analysis_fast_4L_main = main(data, mean = 4L, tuning = "fast"), #> analysis_slow_4L_main = main(data, mean = 4L, tuning = "slow"), #> analysis_fast_1L_altv = altv(data, mean = 1L, tuning = "fast"), #> analysis_slow_1L_altv = altv(data, mean = 1L, tuning = "slow"), #> analysis_fast_2L_altv = altv(data, mean = 2L, tuning = "fast"), #> analysis_slow_2L_altv = altv(data, mean = 2L, tuning = "slow"), #> analysis_fast_3L_altv = altv(data, mean = 3L, tuning = "fast"), #> analysis_slow_3L_altv = altv(data, mean = 3L, tuning = "slow"), #> analysis_fast_4L_altv = altv(data, mean = 4L, tuning = "fast"), #> analysis_slow_4L_altv = altv(data, mean = 4L, tuning = "slow"), #> data = get_data() #> ) 5.4 Static transformations There are four transformations in static branching: map(), cross(), split(), and combine(). They are not actual functions, just special language to supply to the transform argument of target() in drake_plan(). Each transformation is similar to a function from the Tidyverse. drake Tidyverse analogue map() pmap() from purrr cross() crossing() from tidyr split() group_map() from dplyr combine() summarize() from dplyr 5.4.1 map() map() creates a new target for each row in a grid. drake_plan( x = target( simulate_data(center, scale), transform = map(center = c(2, 1, 0), scale = c(3, 2, 1)) ) ) #> [90m# A tibble: 3 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m x_2_3 simulate_data(2, 3) #> [90m2[39m x_1_2 simulate_data(1, 2) #> [90m3[39m x_0_1 simulate_data(0, 1) You can supply the grid directly with the .data argument. Note the use of !! below. (See the tidy evaluation section.) my_grid <- tibble( sim_function = c("rnorm", "rt", "rcauchy"), title = c("Normal", "Student t", "Cauchy") ) my_grid$sim_function <- rlang::syms(my_grid$sim_function) drake_plan( x = target( simulate_data(sim_function, title, center, scale), transform = map( center = c(2, 1, 0), scale = c(3, 2, 1), .data = !!my_grid, # In `.id`, you can select one or more grouping variables # for pretty target names. # Set to FALSE to use short numeric suffixes. .id = sim_function # Try `.id = c(sim_function, center)` yourself. ) ) ) #> [90m# A tibble: 3 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m x_rnorm simulate_data(rnorm, "Normal", 2, 3) #> [90m2[39m x_rt simulate_data(rt, "Student t", 1, 2) #> [90m3[39m x_rcauchy simulate_data(rcauchy, "Cauchy", 0, 1) 5.4.2 cross() cross() creates a new target for each combination of argument values. drake_plan( x = target( simulate_data(nrow, ncol), transform = cross(nrow = c(1, 2, 3), ncol = c(4, 5)) ) ) #> [90m# A tibble: 6 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m x_1_4 simulate_data(1, 4) #> [90m2[39m x_2_4 simulate_data(2, 4) #> [90m3[39m x_3_4 simulate_data(3, 4) #> [90m4[39m x_1_5 simulate_data(1, 5) #> [90m5[39m x_2_5 simulate_data(2, 5) #> [90m6[39m x_3_5 simulate_data(3, 5) 5.4.3 split() The split() transformation distributes a dataset as uniformly as possible across multiple targets. plan <- drake_plan( large_data = get_data(), slice_analysis = target( large_data %>% analyze(), transform = split(large_data, slices = 4) ), results = target( dplyr::bind_rows(slice_analysis), transform = combine(slice_analysis) ) ) plan #> [90m# A tibble: 6 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m large_data get_data() … #> [90m2[39m results dplyr::bind_rows(slice_analysis_1, slice_analysis_2, slice_ana… #> [90m3[39m slice_analysi… drake_slice(data = large_data, slices = 4, index = 1) %>% anal… #> [90m4[39m slice_analysi… drake_slice(data = large_data, slices = 4, index = 2) %>% anal… #> [90m5[39m slice_analysi… drake_slice(data = large_data, slices = 4, index = 3) %>% anal… #> [90m6[39m slice_analysi… drake_slice(data = large_data, slices = 4, index = 4) %>% anal… plot(plan) At runtime, drake_slice() takes a single subset of the data. It supports data frames, matrices, and arbitrary arrays. drake_slice(mtcars, slices = 32, index = 1) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 drake_slice(mtcars, slices = 32, index = 2) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 5.4.4 combine() combine() aggregates targets. The closest comparison is the unquote-splice operator !!! from tidy evaluation. plan <- drake_plan( data_group1 = target( sim_data(mean = x, sd = y), transform = map(x = c(1, 2), y = c(3, 4)) ), data_group2 = target( pull_data(url), transform = map(url = c("example1.com", "example2.com")) ), larger = target( bind_rows(data_group1, data_group2, .id = "id") %>% arrange(sd) %>% head(n = 400), transform = combine(data_group1, data_group2) ) ) drake_plan_source(plan) #> drake_plan( #> data_group1_1_3 = sim_data(mean = 1, sd = 3), #> data_group1_2_4 = sim_data(mean = 2, sd = 4), #> data_group2_example1.com = pull_data("example1.com"), #> data_group2_example2.com = pull_data("example2.com"), #> larger = bind_rows(data_group1_1_3, data_group1_2_4, data_group2_example1.com, #> data_group2_example2.com, #> .id = "id" #> ) %>% #> arrange(sd) %>% #> head(n = 400) #> ) To create multiple combined groups, use the .by argument. plan <- drake_plan( data = target( sim_data(mean = x, sd = y, skew = z), transform = cross(x = c(1, 2), y = c(3, 4), z = c(5, 6)) ), combined = target( bind_rows(data, .id = "id") %>% arrange(sd) %>% head(n = 400), transform = combine(data, .by = c(x, y)) ) ) drake_plan_source(plan) #> drake_plan( #> combined_1_3 = bind_rows(data_1_3_5, data_1_3_6, .id = "id") %>% #> arrange(sd) %>% #> head(n = 400), #> combined_2_3 = bind_rows(data_2_3_5, data_2_3_6, .id = "id") %>% #> arrange(sd) %>% #> head(n = 400), #> combined_1_4 = bind_rows(data_1_4_5, data_1_4_6, .id = "id") %>% #> arrange(sd) %>% #> head(n = 400), #> combined_2_4 = bind_rows(data_2_4_5, data_2_4_6, .id = "id") %>% #> arrange(sd) %>% #> head(n = 400), #> data_1_3_5 = sim_data(mean = 1, sd = 3, skew = 5), #> data_2_3_5 = sim_data(mean = 2, sd = 3, skew = 5), #> data_1_4_5 = sim_data(mean = 1, sd = 4, skew = 5), #> data_2_4_5 = sim_data(mean = 2, sd = 4, skew = 5), #> data_1_3_6 = sim_data(mean = 1, sd = 3, skew = 6), #> data_2_3_6 = sim_data(mean = 2, sd = 3, skew = 6), #> data_1_4_6 = sim_data(mean = 1, sd = 4, skew = 6), #> data_2_4_6 = sim_data(mean = 2, sd = 4, skew = 6) #> ) 5.5 Target names drake releases after 7.12.0 let you define your own custom names with the optional .names argument of transformations. analysis_names <- c("experimental", "thorough", "minimal", "naive") plan <- drake_plan( dataset = target( get_dataset(data_index), transform = map(data_index = !!seq_len(2), .names = c("new", "old")) ), analysis = target( apply_method(method_name, dataset), transform = cross( method_name = c("method1", "method2"), dataset, .names = !!analysis_names ) ), summary = target( summarize(analysis), transform = combine(analysis, .by = dataset, .names = c("table1", "table2")) ) ) plan #> [90m# A tibble: 8 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m experimental apply_method("method1", new) #> [90m2[39m thorough apply_method("method2", new) #> [90m3[39m minimal apply_method("method1", old) #> [90m4[39m naive apply_method("method2", old) #> [90m5[39m new get_dataset(1L) #> [90m6[39m old get_dataset(2L) #> [90m7[39m table1 summarize(experimental, thorough) #> [90m8[39m table2 summarize(minimal, naive) plot(plan) The disadvantage of .names is you need to know in advance the number of targets a transformation will generate. As an alternative, all transformations have an optional .id argument to control the names of targets. Use it to select the grouping variables that go into the names, as well as the order they appear in the suffixes. drake_plan( data = target( get_data(param1, param2), transform = map( param1 = c(123, 456), param2 = c(7, 9), param2 = c("abc", "xyz"), .id = param2 ) ) ) #> [90m# A tibble: 2 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m data_7 get_data(123, 7) #> [90m2[39m data_9 get_data(456, 9) drake_plan( data = target( get_data(param1, param2), transform = map( param1 = c(123, 456), param2 = c(7, 9), param2 = c("abc", "xyz"), .id = c(param2, param1) ) ) ) #> [90m# A tibble: 2 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m data_7_123 get_data(123, 7) #> [90m2[39m data_9_456 get_data(456, 9) drake_plan( data = target( get_data(param1, param2), transform = map( param1 = c(123, 456), param2 = c(7, 9), param2 = c("abc", "xyz"), .id = c(param1, param2) ) ) ) #> [90m# A tibble: 2 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m data_123_7 get_data(123, 7) #> [90m2[39m data_456_9 get_data(456, 9) Set .id to FALSE to ignore the grouping variables altogether. drake_plan( data = target( get_data(param1, param2), transform = map( param1 = c(123, 456), param2 = c(7, 9), param2 = c("abc", "xyz"), .id = FALSE ) ) ) #> [90m# A tibble: 2 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m data get_data(123, 7) #> [90m2[39m data_2 get_data(456, 9) Finally, drake supports a special .id_chr symbol in commands to let you refer to the name of the current target as a character string. as_chr <- function(x) { deparse(substitute(x)) } plan <- drake_plan( data = target( get_data(param), transform = map(param = c(123, 456)) ), keras_model = target( save_model_hdf5(fit_model(data), file_out(!!sprintf("%s.h5", .id_chr))), transform = map(data, .id = param) ), result = target( predict(load_model_hdf5(file_in(!!sprintf("%s.h5", as_chr(keras_model))))), transform = map(keras_model, .id = param) ) ) plan #> [90m# A tibble: 6 x 2[39m #> target command #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m #> [90m1[39m data_123 get_data(123) … #> [90m2[39m data_456 get_data(456) … #> [90m3[39m keras_model_123 save_model_hdf5(fit_model(data_123), file_out("keras_model_12… #> [90m4[39m keras_model_456 save_model_hdf5(fit_model(data_456), file_out("keras_model_45… #> [90m5[39m result_123 predict(load_model_hdf5(file_in("keras_model_123.h5"))) … #> [90m6[39m result_456 predict(load_model_hdf5(file_in("keras_model_456.h5"))) … drake_plan_source(plan) #> drake_plan( #> data_123 = get_data(123), #> data_456 = get_data(456), #> keras_model_123 = save_model_hdf5(fit_model(data_123), file_out("keras_model_123.h5")), #> keras_model_456 = save_model_hdf5(fit_model(data_456), file_out("keras_model_456.h5")), #> result_123 = predict(load_model_hdf5(file_in("keras_model_123.h5"))), #> result_456 = predict(load_model_hdf5(file_in("keras_model_456.h5"))) #> ) 5.6 Tags A tag is a custom grouping variable for a transformation. There are two kinds of tags: In-tags, which contain the target name you start with, and Out-tags, which contain the target names generated by the transformations. drake_plan( x = target( command, transform = map(y = c(1, 2), .tag_in = from, .tag_out = c(to, out)) ), trace = TRUE ) #> [90m# A tibble: 2 x 7[39m #> target command y x from to out #> [3m[90m<chr>[39m[23m [3m[90m<expr_lst>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m #> [90m1[39m x_1 command 1 x_1 x x_1 x_1 #> [90m2[39m x_2 command 2 x_2 x x_2 x_2 Subsequent transformations can use tags as grouping variables and add to existing tags. plan <- drake_plan( prep_work = do_prep_work(), local = target( get_local_data(n, prep_work), transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data) ), online = target( get_online_data(n, prep_work, port = "8080"), transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data) ), summary = target( summarize(bind_rows(data, .id = "data")), transform = combine(data, .by = data_source) ), munged = target( munge(bind_rows(data, .id = "data")), transform = combine(data, .by = n) ) ) drake_plan_source(plan) #> drake_plan( #> local_1 = get_local_data(1, prep_work), #> local_2 = get_local_data(2, prep_work), #> munged_1 = munge(bind_rows(local_1, online_1, .id = "data")), #> munged_2 = munge(bind_rows(local_2, online_2, .id = "data")), #> online_1 = get_online_data(1, prep_work, port = "8080"), #> online_2 = get_online_data(2, prep_work, port = "8080"), #> prep_work = do_prep_work(), #> summary_local = summarize(bind_rows(local_1, local_2, .id = "data")), #> summary_online = summarize(bind_rows(online_1, online_2, .id = "data")) #> ) plot(plan) "],
["dynamic.html", "Chapter 6 Dynamic branching 6.1 A note about versions 6.2 Motivation 6.3 Which kind of branching should I use? 6.4 Dynamic targets 6.5 Dynamic transformations 6.6 Trace 6.7 max_expand", " Chapter 6 Dynamic branching 6.1 A note about versions The first release of dynamic branching was in drake version 7.8.0. In subsequent versions, dynamic branching behaves differently. This manual describes how dynamic branching works in development drake (to become version 7.9.0 in early January 2020). If you are using version 7.8.0, please refer to this version of the chapter instead. 6.2 Motivation In large workflows, you may need more targets than you can easily type in a plan, and you may not be able to fully specify all targets in advance. Dynamic branching is an interface to declare new targets while make() is running. It lets you create more compact plans and graphs, it is easier to use than static branching, and it improves the startup speed of make() and friends. 6.3 Which kind of branching should I use? With dynamic branching, make() is faster to initialize, and you have far more flexibility. With static branching, you have meaningful target names, and it is easier to predict what the plan is going to do in advance. There is a ton of room for overlap and personal judgement, and you can even use both kinds of branching together. 6.4 Dynamic targets A dynamic target is a vector of sub-targets. We let make() figure out which sub-targets to create and how to aggregate them. As an example, let’s fit a regression model to each continent in Gapminder data. To activate dynamic branching, use the dynamic argument of target(). library(broom) library(drake) library(gapminder) library(tidyverse) # Split the Gapminder data by continent. gapminder_continents <- function() { gapminder %>% mutate(gdpPercap = scale(gdpPercap)) %>% split(f = .$continent) } # Fit a model to a continent. fit_model <- function(continent_data) { data <- continent_data[[1]] data %>% lm(formula = gdpPercap ~ year) %>% tidy() %>% mutate(continent = data$continent[1]) %>% select(continent, term, statistic, p.value) } plan <- drake_plan( continents = gapminder_continents(), model = target(fit_model(continents), dynamic = map(continents)) ) make(plan) #> [32m▶[39m target continents #> [32m▶[39m dynamic model #> [32m❯[39m subtarget model_c56e5407 #> [32m❯[39m subtarget model_706a1529 #> [32m❯[39m subtarget model_da843806 #> [32m❯[39m subtarget model_862f8003 #> [32m❯[39m subtarget model_ebb41f51 #> [32m■[39m finalize model The data type of every sub-target is the same as the dynamic target it belongs to. In other words, model and model_23022788 are both data frames, and readd(model) and friends automatically concatenate all the model_* sub-targets. readd(model) #> [90m# A tibble: 10 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m 1[39m Africa (Intercept) -[31m4[39m[31m.[39m[31m44[39m 1.08[90me[39m[31m- 5[39m #> [90m 2[39m Africa year 4.04 5.90[90me[39m[31m- 5[39m #> [90m 3[39m Americas (Intercept) -[31m5[39m[31m.[39m[31m56[39m 6.10[90me[39m[31m- 8[39m #> [90m 4[39m Americas year 5.55 6.16[90me[39m[31m- 8[39m #> [90m 5[39m Asia (Intercept) -[31m2[39m[31m.[39m[31m74[39m 6.39[90me[39m[31m- 3[39m #> [90m 6[39m Asia year 2.75 6.23[90me[39m[31m- 3[39m #> [90m 7[39m Europe (Intercept) -[31m14[39m[31m.[39m[31m4[39m 3.12[90me[39m[31m-37[39m #> [90m 8[39m Europe year 14.5 7.06[90me[39m[31m-38[39m #> [90m 9[39m Oceania (Intercept) -[31m11[39m[31m.[39m[31m3[39m 1.32[90me[39m[31m-10[39m #> [90m10[39m Oceania year 11.5 9.48[90me[39m[31m-11[39m This behavior is powered by the vctrs. A dynamic target like model above is really a “vctr” of sub-targets. Under the hood, the aggregated value of model is what you get from calling vec_c() on all the model_* sub-targets. When you dynamically map() over a non-dynamic object, you are taking slices with vec_slice(). (When you map() over a dynamic target, each element is a sub-target and vec_slice() is not necessary.) library(vctrs) # same as readd(model) s <- subtargets(model) vec_c( readd(s[1], character_only = TRUE), readd(s[2], character_only = TRUE), readd(s[3], character_only = TRUE), readd(s[4], character_only = TRUE), readd(s[5], character_only = TRUE) ) #> [90m# A tibble: 10 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m 1[39m Africa (Intercept) -[31m4[39m[31m.[39m[31m44[39m 1.08[90me[39m[31m- 5[39m #> [90m 2[39m Africa year 4.04 5.90[90me[39m[31m- 5[39m #> [90m 3[39m Americas (Intercept) -[31m5[39m[31m.[39m[31m56[39m 6.10[90me[39m[31m- 8[39m #> [90m 4[39m Americas year 5.55 6.16[90me[39m[31m- 8[39m #> [90m 5[39m Asia (Intercept) -[31m2[39m[31m.[39m[31m74[39m 6.39[90me[39m[31m- 3[39m #> [90m 6[39m Asia year 2.75 6.23[90me[39m[31m- 3[39m #> [90m 7[39m Europe (Intercept) -[31m14[39m[31m.[39m[31m4[39m 3.12[90me[39m[31m-37[39m #> [90m 8[39m Europe year 14.5 7.06[90me[39m[31m-38[39m #> [90m 9[39m Oceania (Intercept) -[31m11[39m[31m.[39m[31m3[39m 1.32[90me[39m[31m-10[39m #> [90m10[39m Oceania year 11.5 9.48[90me[39m[31m-11[39m loadd(model) # Second slice if you were to map() over mtcars. vec_slice(mtcars, 2) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 # Fifth slice if you were to map() over letters. vec_slice(letters, 5) #> [1] "e" You can use vec_c() and vec_slice() to anticipate edge cases in dynamic branching. # If you map() over a list, each sub-target is a single-element list. vec_slice(list(1, 2), 1) #> [[1]] #> [1] 1 # If each sub-target has multiple elements, # the aggregated target (e.g. from readd()) # will have more elements than sub-targets. subtarget1 <- c(1, 2) subtarget2 <- c(3, 4) vec_c(subtarget1, subtarget2) #> [1] 1 2 3 4 Back in our plan, target(fit_model(continents), dynamic = map(continents)) is equivalent to commands fit_model(continents[1]) through fit_model(continents[5]). Since continents is really a list of data frames, continents[1] through continents[5] are also lists of data frames, which is why we need the line data <- continent_data[[1]] in fit_model(). To post-process our models, we can work with either the individual sub-targets or the whole vector of all the models. Below, year uses the former and intercept uses the latter. plan <- drake_plan( continents = gapminder_continents(), model = target(fit_model(continents), dynamic = map(continents)), # Filter each model individually: year = target(filter(model, term == "year"), dynamic = map(model)), # Aggregate all the models, then filter the whole vector: intercept = filter(model, term != "year") ) make(plan) #> [34mℹ[39m unloading 1 targets from environment #> [32m▶[39m target intercept #> [32m▶[39m dynamic year #> [32m❯[39m subtarget year_20cb8ecb #> [32m❯[39m subtarget year_f7502c3e #> [32m❯[39m subtarget year_a22d53f2 #> [32m❯[39m subtarget year_1facb02b #> [32m❯[39m subtarget year_399fff25 #> [32m■[39m finalize year readd(year) #> [90m# A tibble: 5 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Africa year 4.04 5.90[90me[39m[31m- 5[39m #> [90m2[39m Americas year 5.55 6.16[90me[39m[31m- 8[39m #> [90m3[39m Asia year 2.75 6.23[90me[39m[31m- 3[39m #> [90m4[39m Europe year 14.5 7.06[90me[39m[31m-38[39m #> [90m5[39m Oceania year 11.5 9.48[90me[39m[31m-11[39m readd(intercept) #> [90m# A tibble: 5 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Africa (Intercept) -[31m4[39m[31m.[39m[31m44[39m 1.08[90me[39m[31m- 5[39m #> [90m2[39m Americas (Intercept) -[31m5[39m[31m.[39m[31m56[39m 6.10[90me[39m[31m- 8[39m #> [90m3[39m Asia (Intercept) -[31m2[39m[31m.[39m[31m74[39m 6.39[90me[39m[31m- 3[39m #> [90m4[39m Europe (Intercept) -[31m14[39m[31m.[39m[31m4[39m 3.12[90me[39m[31m-37[39m #> [90m5[39m Oceania (Intercept) -[31m11[39m[31m.[39m[31m3[39m 1.32[90me[39m[31m-10[39m If automatic concatenation of sub-targets is confusing (e.g. if some sub-targets are NULL, as in https://github.com/ropensci-books/drake/issues/142) you can read the dynamic target as a named list (only in drake version 7.10.0 and above). readd(model, subtarget_list = TRUE) # Requires drake >= 7.10.0. #> $model_c56e5407 #> [90m# A tibble: 2 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Africa (Intercept) -[31m4[39m[31m.[39m[31m44[39m 0.000[4m0[24m[4m1[24m[4m0[24m8 #> [90m2[39m Africa year 4.04 0.000[4m0[24m[4m5[24m[4m9[24m0 #> #> $model_706a1529 #> [90m# A tibble: 2 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Americas (Intercept) -[31m5[39m[31m.[39m[31m56[39m 0.000[4m0[24m[4m0[24m[4m0[24m061[4m0[24m #> [90m2[39m Americas year 5.55 0.000[4m0[24m[4m0[24m[4m0[24m061[4m6[24m #> #> $model_da843806 #> [90m# A tibble: 2 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Asia (Intercept) -[31m2[39m[31m.[39m[31m74[39m 0.006[4m3[24m[4m9[24m #> [90m2[39m Asia year 2.75 0.006[4m2[24m[4m3[24m #> #> $model_862f8003 #> [90m# A tibble: 2 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Europe (Intercept) -[31m14[39m[31m.[39m[31m4[39m 3.12[90me[39m[31m-37[39m #> [90m2[39m Europe year 14.5 7.06[90me[39m[31m-38[39m #> #> $model_ebb41f51 #> [90m# A tibble: 2 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Oceania (Intercept) -[31m11[39m[31m.[39m[31m3[39m 1.32[90me[39m[31m-10[39m #> [90m2[39m Oceania year 11.5 9.48[90me[39m[31m-11[39m Alternatively, you can work with the individual sub-targets. subtargets(model) #> [1] "model_c56e5407" "model_706a1529" "model_da843806" "model_862f8003" #> [5] "model_ebb41f51" readd(model, subtargets = 1) # equivalent to readd() on a single model_* sub-target #> [90m# A tibble: 2 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Africa (Intercept) -[31m4[39m[31m.[39m[31m44[39m 0.000[4m0[24m[4m1[24m[4m0[24m8 #> [90m2[39m Africa year 4.04 0.000[4m0[24m[4m5[24m[4m9[24m0 6.5 Dynamic transformations Dynamic branching supports transformations map(), cross(), and group(). These transformations tell drake how to create sub-targets. 6.5.1 map() map() iterates over the vector slices of the targets you supply as arguments. We saw above how map() iterates over lists. If you give it a data frame, it will map over the rows. plan <- drake_plan( subset = head(gapminder), row = target(subset, dynamic = map(subset)) ) make(plan) #> [32m▶[39m target subset #> [32m▶[39m dynamic row #> [32m❯[39m subtarget row_9939cae3 #> [32m❯[39m subtarget row_e8047114 #> [32m❯[39m subtarget row_2ef3db10 #> [32m❯[39m subtarget row_f9171bbe #> [32m❯[39m subtarget row_7d6002e9 #> [32m❯[39m subtarget row_509468b3 #> [32m■[39m finalize row readd(row_9939cae3) #> [90m# A tibble: 1 x 6[39m #> country continent year lifeExp pop gdpPercap #> [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Afghanistan Asia [4m1[24m952 28.8 8[4m4[24m[4m2[24m[4m5[24m333 779. If you supply multiple targets, map() iterates over the slices of each. plan <- drake_plan( numbers = seq_len(2), letters = c("a", "b"), zipped = target(paste0(numbers, letters), dynamic = map(numbers, letters)) ) make(plan) #> [32m▶[39m target numbers #> [32m▶[39m target letters #> [32m▶[39m dynamic zipped #> [32m❯[39m subtarget zipped_8ac3968c #> [32m❯[39m subtarget zipped_4a7a9b07 #> [32m■[39m finalize zipped readd(zipped) #> [1] "1a" "2b" 6.5.2 cross() cross() creates a new sub-target for each combination of targets you supply as arguments. plan <- drake_plan( numbers = seq_len(2), letters = c("a", "b"), combo = target(paste0(numbers, letters), dynamic = cross(numbers, letters)) ) make(plan) #> [32m▶[39m dynamic combo #> [32m❯[39m subtarget combo_8ac3968c #> [32m❯[39m subtarget combo_ed1d2e7b #> [32m❯[39m subtarget combo_ef37ab56 #> [32m❯[39m subtarget combo_4a7a9b07 #> [32m■[39m finalize combo readd(combo) #> [1] "1a" "1b" "2a" "2b" 6.5.3 group() With group(), you can create multiple aggregates of a given target. Use the .by argument to set a grouping variable. plan <- drake_plan( data = gapminder, by = data$continent, gdp = target( tibble(median = median(data$gdpPercap), continent = by[1]), dynamic = group(data, .by = by) ) ) make(plan) #> [32m▶[39m target data #> [32m▶[39m target by #> [32m▶[39m dynamic gdp #> [32m❯[39m subtarget gdp_9adfc39f #> [32m❯[39m subtarget gdp_d9f30951 #> [32m❯[39m subtarget gdp_958a2f81 #> [32m❯[39m subtarget gdp_962b03c8 #> [32m❯[39m subtarget gdp_dc1cff81 #> [32m■[39m finalize gdp readd(gdp) #> [90m# A tibble: 5 x 2[39m #> median continent #> [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m #> [90m1[39m [4m2[24m647. Asia #> [90m2[39m [4m1[24m[4m2[24m082. Europe #> [90m3[39m [4m1[24m192. Africa #> [90m4[39m [4m5[24m466. Americas #> [90m5[39m [4m1[24m[4m7[24m983. Oceania 6.6 Trace All dynamic transforms have a .trace argument to record optional metadata for each sub-target. In the example from group(), the trace is another way to keep track of the continent of each median GDP value. plan <- drake_plan( data = gapminder, by = data$continent, gdp = target( median(data$gdpPercap), dynamic = group(data, .by = by, .trace = by) ) ) make(plan) #> [32m▶[39m dynamic gdp #> [32m❯[39m subtarget gdp_7e88fb1c #> [32m❯[39m subtarget gdp_a61b8e1b #> [32m❯[39m subtarget gdp_278ff532 #> [32m❯[39m subtarget gdp_6f3facea #> [32m❯[39m subtarget gdp_73037e69 #> [32m■[39m finalize gdp The gdp target no longer contains any explicit reference to continent. readd(gdp) #> [1] 2646.787 12081.749 1192.138 5465.510 17983.304 However, we can look up the continents in the trace. read_trace("by", gdp) #> [1] Asia Europe Africa Americas Oceania #> Levels: Africa Americas Asia Europe Oceania 6.7 max_expand Suppose we want a model for each country. gapminder_countries <- function() { gapminder %>% mutate(gdpPercap = scale(gdpPercap)) %>% split(f = .$country) } plan <- drake_plan( countries = gapminder_countries(), model = target(fit_model(countries), dynamic = map(countries)) ) The Gapminder dataset has 142 countries, which can get overwhelming. In the early stages of the workflow when we are still debugging and testing, we can limit the number of sub-targets using the max_expand argument of make(). make(plan, max_expand = 2) #> [32m▶[39m target countries #> [32m▶[39m dynamic model #> [32m❯[39m subtarget model_ab009698 #> [32m❯[39m subtarget model_cc031a6d #> [32m■[39m finalize model readd(model) #> [90m# A tibble: 4 x 4[39m #> continent term statistic p.value #> [3m[90m<fct>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m #> [90m1[39m Asia (Intercept) -[31m1[39m[31m.[39m[31m48[39m 0.170 #> [90m2[39m Asia year -[31m0[39m[31m.[39m[31m233[39m 0.821 #> [90m3[39m Europe (Intercept) -[31m4[39m[31m.[39m[31m76[39m 0.000[4m7[24m[4m7[24m[4m3[24m #> [90m4[39m Europe year 4.59 0.000[4m9[24m[4m9[24m[4m8[24m Then, when we are confident and ready, we can scale up to the full number of models. make(plan) "],
["projects.html", "Chapter 7 drake projects 7.1 External resources 7.2 Code files 7.3 Safer interactivity 7.4 Script file pitfalls 7.5 Workflows as R packages 7.6 Other tools", " Chapter 7 drake projects drake’s design philosophy is extremely R-focused. It embraces in-memory configuration, in-memory dependencies, interactivity, and flexibility. This scope leaves project setup and file management decisions mostly up to the user. This chapter tries to fill in the blanks and address practical hurdles when it comes to setting up projects. 7.1 External resources Miles McBain’s excellent blog post explains the practical issues {drake} solves for most projects, how to set up a project as quickly and painlessly as possible, and how to overcome common obstacles. Miles’ dflow package generates the file structure for a boilerplate drake project. It is a more thorough alternative to drake::use_drake(). drake is heavily function-oriented by design, and Miles’ fnmate package automatically generates boilerplate code and docstrings for functions you mention in drake plans. 7.2 Code files The names and locations of the files are entirely up to you, but this pattern is particularly useful to start with. make.R R/ ├── packages.R ├── functions.R └── plan.R Here, make.R is a master script that Loads your packages, functions, and other in-memory data. Creates the drake plan. Calls make(). Let’s consider the main example, which you can download with drake_example(\"main\"). Here, our master script is called make.R: source("R/packages.R") # loads packages source("R/functions.R") # defines the create_plot() function source("R/plan.R") # creates the drake plan # options(clustermq.scheduler = "multicore") # optional parallel computing. Also needs parallelism = "clustermq" make( plan, # defined in R/plan.R verbose = 2 ) We have an R folder containing our supporting files. packages.R typically includes all the packages you will use in the workflow. # packages.R library(drake) library(dplyr) library(ggplot2) Your functions.R typically has the supporting custom functions you write for the workflow. If there are many functions, you could split them up into multiple files. # functions.R create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone), binwidth = 10) + theme_gray(24) } Finally, it is good practice to define a plan.R that defines the plan. # plan.R plan <- drake_plan( raw_data = readxl::read_excel(file_in("raw_data.xlsx")), data = raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), hist = create_plot(data), fit = lm(Ozone ~ Wind + Temp, data), report = rmarkdown::render( knitr_in("report.Rmd"), output_file = file_out("report.html"), quiet = TRUE ) ) To run the example project above, Start a clean new R session. Run the make.R script. On Mac and Linux, you can do this by opening a terminal and entering R CMD BATCH --no-save make.R. On Windows, restart your R session and call source(\"make.R\") in the R console. Note: this part of drake does not inherently focus on your script files. There is nothing magical about the names make.R, packages.R, functions.R, or plan.R. Different projects may require different file structures. drake has other functions to inspect your results and examine your workflow. Before invoking them interactively, it is best to start with a clean new R session. # Restart R. interactive() #> [1] TRUE source("R/packages.R") source("R/functions.R") source("R/plan.R") vis_drake_graph(plan) 7.3 Safer interactivity 7.3.1 Motivation A serious drake workflow should be consistent and reliable, ideally with the help of a master R script. Before it builds your targets, this script should begin in a fresh R session and load your packages and functions in a dependable manner. Batch mode makes sure all this goes according to plan. If you use a single persistent interactive R session to repeatedly invoke make() while you develop the workflow, then over time, your session could grow stale and accidentally invalidate targets. For example, if you interactively tinker with a new version of create_plot(), targets hist and report will fall out of date without warning, and the next make() will build them again. Even worse, the outputs from hist and report will be wrong if they depend on a half-finished create_plot(). The quickest workaround is to restart R and source() your setup scripts all over again. However, a better solution is to use r_make() and friends. r_make() runs make() in a new transient R session so that accidental changes to your interactive environment do not break your workflow. 7.3.2 Usage To use r_make(), you need a configuration R script. Unless you supply a custom file path (e.g. r_make(source = \"your_file.R\") or options(drake_source = \"your_file.R\")) drake assumes this configuration script is called _drake.R. (So the file name really is magical in this case). The suggested file structure becomes: _drake.R R/ ├── packages.R ├── functions.R └── plan.R Like our previous make.R script, _drake.R runs all our pre-make() setup steps. But this time, rather than calling make(), it ends with a call to drake_config(). drake_config() is the initial preprocessing stage of make(), and it accepts all the same arguments as make(). Example _drake.R: source("R/packages.R") source("R/functions.R") source("R/plan.R") # options(clustermq.scheduler = "multicore") # optional parallel computing drake_config(plan, verbose = 2) Here is what happens when you call r_make(). drake launches a new transient R session using callr::r(). The remaining steps all happen within this transient session. Run the configuration script (e.g. _drake.R) to Load the packages, functions, global options, drake plan, etc. into the session’s environnment, and Run the call to drake_config()and store the results in a variable called config. Execute make_impl(config = config), an internal drake function. The purpose of drake_config() is to collect and sanitize all the parameters and settings that make() needs to do its job. In fact, if you do not set the config argument explicitly, then make() invokes drake_config() behind the scenes. make(plan, parallelism = \"clustermq\", jobs = 2, verbose = 6) is equivalent to config <- drake_config(plan, verbose = 2) make_impl(config = config) There are many more r_*() functions besides r_make(), each of which launches a fresh session and runs an inner drake function on the config object from _drake.R. Outer function call Inner function call r_make() make_impl(config = config) r_drake_build(...) drake_build_impl(config, ...) r_outdated(...) outdated_impl(config, ...) r_missed(...) missed_impl(config, ...) r_vis_drake_graph(...) vis_drake_graph_impl(config, ...) r_sankey_drake_graph(...) sankey_drake_graph_impl(config, ...) r_drake_ggraph(...) drake_ggraph_impl(config, ...) r_drake_graph_info(...) drake_graph_info_impl(config, ...) r_predict_runtime(...) predict_runtime_impl(config, ...) r_predict_workers(...) predict_workers_impl(config, ...) clean() r_outdated(r_args = list(show = FALSE)) #> [1] "data" "fit" "hist" "raw_data" "report" r_make() #> Loading required package: dplyr #> #> Attaching package: ‘dplyr’ #> #> The following objects are masked from ‘package:stats’: #> #> filter, lag #> #> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union #> #> Loading required package: ggplot2 #> #> Attaching package: ‘tidyr’ #> #> The following objects are masked from ‘package:drake’: #> #> expand, gather #> #> ▶ target raw_data #> ▶ target data #> ▶ target fit #> ▶ target hist #> ▶ target report r_outdated(r_args = list(show = FALSE)) #> character(0) r_vis_drake_graph(targets_only = TRUE, r_args = list(show = FALSE)) Remarks: You can run r_make() in an interactive session, but the transient process it launches will not be interactive. Thus, any browser() statements in the commands in your drake plan will be ignored. You can select and configure the underlying callr function using arguments r_fn and r_args, respectively. For example code, you can download the updated main example (drake_example(\"main\")) and experiment with files _drake.R and interactive.R. 7.4 Script file pitfalls Despite the above discussion of R scripts, drake plans rely more on in-memory functions. You might be tempted to write a plan like the following, but then drake cannot tell that my_analysis depends on my_data. bad_plan <- drake_plan( my_data = source(file_in("get_data.R")), my_analysis = source(file_in("analyze_data.R")), my_summaries = source(file_in("summarize_data.R")) ) vis_drake_graph(bad_plan, targets_only = TRUE) When it comes to plans, use functions instead. source("my_functions.R") # defines get_data(), analyze_data(), etc. good_plan <- drake_plan( my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint my_analysis = analyze_data(my_data), my_summaries = summarize_results(my_data, my_analysis) ) vis_drake_graph(good_plan, targets_only = TRUE) In drake >= 7.6.2.9000, code_to_function() leverages existing imperative scripts for use in a drake plan. get_data <- code_to_function("get_data.R") do_analysis <- code_to_function("analyze_data.R") do_summary <- code_to_function("summarize_data.R") good_plan <- drake_plan( my_data = get_data(), my_analysis = do_analysis(my_data), my_summaries = do_summary(my_data, my_analysis) ) vis_drake_graph(good_plan, targets_only = TRUE) 7.5 Workflows as R packages The R package structure is a great way to organize and quality-control a data analysis project. If you write a drake workflow as a package, you will need Supply the namespace of your package to the envir argument of make() or drake_config() (e.g. make(envir = getNamespace(\"yourPackage\") so drake can watch you package’s functions for changes and rebuild downstream targets accordingly. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = \"devtools::load_all()\") and custom set the packages argument so your package name is not included. (Everything in packages is loaded with library()). For a minimal example, see Tiernan Martin’s drakepkg. 7.6 Other tools drake enhances reproducibility, but not in all respects. Local library managers, containerization, and session management tools offer more robust solutions in their respective domains. Reproducibility encompasses a wide variety of tools and techniques all working together. Comprehensive overviews: PLOS article by Wilson et al. RStudio Conference 2019 presentation by Karthik Ram. rrtools by Ben Marwick. "],
["scripts.html", "Chapter 8 Script-based workflows 8.1 Function-oriented workflows 8.2 Traditional and legacy workflows 8.3 Overcoming Technical Debt 8.4 Dependencies 8.5 Building the connections 8.6 Run the workflow 8.7 Keeping the results up to date 8.8 Final thoughts", " Chapter 8 Script-based workflows 8.1 Function-oriented workflows drake works best when you write functions for data analysis. Functions break down complicated ideas into manageable pieces. # R/functions.R get_data <- function(file){ readxl::read_excel(file) } munge_data <- function(raw_data){ raw_data %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))) } fit_model <- function(munged_data){ lm(Ozone ~ Wind + Temp, munged_data) } When we express computational steps as functions like get_data(), munge_data(), and fit_model(), we create special shorthand to make the rest of our code easier to read and understand. # R/plan.R plan <- drake_plan( raw_data = get_data(file_in("raw_data.xlsx")), munged_data = munge_data(raw_data), model = fit_model(munged_data) ) This function-oriented approach is elegant, powerful, testable, scalable, and maintainable. However, it can be challenging to convert pre-existing traditional script-based analyses to function-oriented drake-powered workflows. This chapter describes a stopgap to retrofit drake to existing projects. Custom functions are still better in the long run, but the following workaround is quick and painless, and it does not require you to change your original scripts. 8.2 Traditional and legacy workflows It is common to express data analysis tasks as numbered scripts. 01_data.R 02_munge.R 03_histogram.R 04_regression.R 05_report.R The numeric prefixes indicate the order in which these scripts need to run. # run_everything.R source("01_data.R") source("02_munge.R") source("03_histogram.R") source("04_regression.R") source("05_report.R") # Calls rmarkdown::render() on report.Rmd. 8.3 Overcoming Technical Debt code_to_function() creates drake_plan()-ready functions from scripts like these. # R/functions.R load_data <- code_to_function("01_data.R") munge_data <- code_to_function("02_munge.R") make_histogram <- code_to_function("03_histogram.R") do_regression <- code_to_function("04_regression.R") generate_report <- code_to_function("05_report.R") Each function contains all the code from its corresponding script, along with a special final line to make sure we never return the same value twice. print(load_data) #> function (...) #> { #> raw_data <- readxl::read_excel("raw_data.xlsx") #> saveRDS(raw_data, "data/loaded_data.RDS") #> list(time = Sys.time(), tempfile = tempfile()) #> } 8.4 Dependencies drake pays close attention to dependencies. In drake, a target’s dependencies are the things it needs in order to build. Dependencies can include functions, files, and other targets upstream. Any time a dependency changes, the target is no longer valid. The make() function automatically detects when dependencies change, and it rebuilds the targets that need to rebuild. To leverage drake’s dependency-watching capabilities, we create a drake plan. This plan should include all the steps of the analysis, from loading the data to generating a report. To write the plan, we plug in the functions we created from code_to_function(). simple_plan <- drake_plan( data = load_data(), munged_data = munge_data(), hist = make_histogram(), fit = do_regression(), report = generate_report() ) It’s a start, but right now, drake has no idea which targets to run first and which need to wait for dependencies! In the following graph, there are no edges (arrows) connecting the targets! vis_drake_graph(simple_plan) 8.5 Building the connections Just as our original scripts had to run in a certain order, so do our targets now. We pass targets as function arguments to express this execution order. For example, when we write munged_data = munge_data(data), we are signaling to drake that the munged_data target depends on the function munge_data() and the target data. script_based_plan <- drake_plan( data = load_data(), munged_data = munge_data(data), hist = make_histogram(munged_data), fit = do_regression(munged_data), report = generate_report(hist, fit) ) vis_drake_graph(script_based_plan) 8.6 Run the workflow We can now run the workflow with the make() function. The first call to make() runs all the data analysis tasks we got from the scripts. make(script_based_plan) #> ▶ target data #> ▶ target munged_data #> ▶ target fit #> ▶ target hist #> ▶ target report 8.7 Keeping the results up to date Any time we change a script, we need to run code_to_function() again to keep our function up to date. drake notices when this function changes, and make() reruns the updated function and the all downstream functions that rely on the output. For example, let’s fine tune our histogram. We open 03_histogram.R, change the binwidth argument, and call code_to_function(\"03_histogram.R\") all over again. # We need to rerun code_to_function() to tell drake that the script changed. make_histogram <- code_to_function("03_histogram.R") Targets hist and report depend on the code we modified, so drake marks those targets as outdated. outdated(script_based_plan) #> [1] "hist" "report" vis_drake_graph(script_based_plan, targets_only = TRUE) When you call make(), drake runs make_histogram() because the underlying script changed, and it runs generate_report() because the report depends on hist. make(script_based_plan) #> ▶ target hist #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. #> ▶ target report All the targets are now up to date! vis_drake_graph(script_based_plan, targets_only = TRUE) 8.8 Final thoughts Countless data science workflows consist of numbered imperative scripts, and code_to_function() lets drake accommodate script-based projects too big and cumbersome to refactor. However, for new projects, we strongly recommend that you write functions. Functions help organize your thoughts, and they improve portability, readability, and compatibility with drake. For a deeper discussion of functions and their role in drake, consider watching the webinar recording of the 2019-09-23 rOpenSci Community Call. Even old projects are sometimes pliable enough to refactor into functions, especially with the new Rclean package. "],
["churn.html", "Chapter 9 Customer churn and deep learning 9.1 Packages 9.2 Functions 9.3 Plan 9.4 Dependency graph 9.5 Run the models 9.6 Inspect the results 9.7 Add models 9.8 Inspect the results again 9.9 Update your code 9.10 History and provenance", " Chapter 9 Customer churn and deep learning drake is designed for workflows with long runtimes, and a major use case is deep learning. This chapter demonstrates how to leverage drake to manage a deep learning workflow. The original example comes from a blog post by Matt Dancho, and the chapter’s content itself comes directly from this R notebook, part of an RStudio Solutions Engineering example demonstrating TensorFlow in R. The notebook is modified and redistributed under the terms of the Apache 2.0 license, copyright RStudio (details here). 9.1 Packages First, we load our packages into a fresh R session. library(drake) library(keras) library(tidyverse) library(rsample) library(recipes) library(yardstick) 9.2 Functions drake is R-focused and function-oriented. We create functions to preprocess the data, prepare_recipe <- function(data) { data %>% training() %>% recipe(Churn ~ .) %>% step_rm(customerID) %>% step_naomit(all_outcomes(), all_predictors()) %>% step_discretize(tenure, options = list(cuts = 6)) %>% step_log(TotalCharges) %>% step_mutate(Churn = ifelse(Churn == "Yes", 1, 0)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep() } define a keras model, exposing arguments to set the dimensionality and activation functions of the layers, define_model <- function(rec, units1, units2, act1, act2, act3) { input_shape <- ncol( juice(rec, all_predictors(), composition = "matrix") ) keras_model_sequential() %>% layer_dense( units = units1, kernel_initializer = "uniform", activation = act1, input_shape = input_shape ) %>% layer_dropout(rate = 0.1) %>% layer_dense( units = units2, kernel_initializer = "uniform", activation = act2 ) %>% layer_dropout(rate = 0.1) %>% layer_dense( units = 1, kernel_initializer = "uniform", activation = act3 ) } train a model, train_model <- function( rec, units1 = 16, units2 = 16, act1 = "relu", act2 = "relu", act3 = "sigmoid" ) { model <- define_model( rec = rec, units1 = units1, units2 = units2, act1 = act1, act2 = act2, act3 = act3 ) compile( model, optimizer = "adam", loss = "binary_crossentropy", metrics = c("accuracy") ) x_train_tbl <- juice( rec, all_predictors(), composition = "matrix" ) y_train_vec <- juice(rec, all_outcomes()) %>% pull() fit( object = model, x = x_train_tbl, y = y_train_vec, batch_size = 32, epochs = 32, validation_split = 0.3, verbose = 0 ) model } compare predictions against reality, confusion_matrix <- function(data, rec, model) { testing_data <- bake(rec, testing(data)) x_test_tbl <- testing_data %>% select(-Churn) %>% as.matrix() y_test_vec <- testing_data %>% select(Churn) %>% pull() yhat_keras_class_vec <- model %>% predict_classes(x_test_tbl) %>% as.factor() %>% fct_recode(yes = "1", no = "0") yhat_keras_prob_vec <- model %>% predict_proba(x_test_tbl) %>% as.vector() test_truth <- y_test_vec %>% as.factor() %>% fct_recode(yes = "1", no = "0") estimates_keras_tbl <- tibble( truth = test_truth, estimate = yhat_keras_class_vec, class_prob = yhat_keras_prob_vec ) estimates_keras_tbl %>% conf_mat(truth, estimate) } and compare the performance of multiple models. compare_models <- function(...) { name <- match.call()[-1] %>% as.character() df <- map_df(list(...), summary) %>% filter(.metric %in% c("accuracy", "sens", "spec")) %>% mutate(name = rep(name, each = n() / length(name))) %>% rename(metric = .metric, estimate = .estimate) ggplot(df) + geom_line(aes(x = metric, y = estimate, color = name, group = name)) + theme_gray(24) } 9.3 Plan Next, we define our workflow in a drake plan. We will prepare the data, train different models with different activation functions, and compare the models in terms of performance. activations <- c("relu", "sigmoid") plan <- drake_plan( data = read_csv(file_in("customer_churn.csv"), col_types = cols()) %>% initial_split(prop = 0.3), rec = prepare_recipe(data), model = target( train_model(rec, act1 = act), format = "keras", # Supported in drake > 7.5.2 to store models properly. transform = map(act = !!activations) ), conf = target( confusion_matrix(data, rec, model), transform = map(model, .id = act) ), metrics = target( compare_models(conf), transform = combine(conf) ) ) The plan is a data frame with the steps we are going to do. plan #> # A tibble: 7 x 3 #> target command format #> <chr> <expr_lst> <chr> #> 1 conf_relu confusion_matrix(data, rec, model_relu) … <NA> #> 2 conf_sigmoid confusion_matrix(data, rec, model_sigmoid) … <NA> #> 3 data read_csv(file_in("customer_churn.csv"), col_types = cols(… <NA> #> 4 metrics compare_models(conf_relu, conf_sigmoid) … <NA> #> 5 model_relu train_model(rec, act1 = "relu") … keras #> 6 model_sigmo… train_model(rec, act1 = "sigmoid") … keras #> 7 rec prepare_recipe(data) … <NA> 9.4 Dependency graph The graph visualizes the dependency relationships among the steps of the workflow. vis_drake_graph(plan) 9.5 Run the models Call make() to actually run the workflow. make(plan) #> ▶ target data #> ▶ target rec #> ▶ target model_relu #> ▶ target model_sigmoid #> ▶ target conf_relu #> ▶ target conf_sigmoid #> ▶ target metrics 9.6 Inspect the results The two models performed about the same. readd(metrics) # see also loadd() 9.7 Add models Let’s try the softmax activation function. activations <- c("relu", "sigmoid", "softmax") plan <- drake_plan( data = read_csv(file_in("customer_churn.csv"), col_types = cols()) %>% initial_split(prop = 0.3), rec = prepare_recipe(data), model = target( train_model(rec, act1 = act), format = "keras", # Supported in drake > 7.5.2 to store models properly. transform = map(act = !!activations) ), conf = target( confusion_matrix(data, rec, model), transform = map(model, .id = act) ), metrics = target( compare_models(conf), transform = combine(conf) ) ) vis_drake_graph(plan) # see also outdated() and predict_runtime() make() skips the relu and sigmoid models because they are already up to date. (Their dependencies did not change.) Only the softmax model needs to run. make(plan) #> ▶ target model_softmax #> ▶ target conf_softmax #> ▶ target metrics 9.8 Inspect the results again readd(metrics) # see also loadd() 9.9 Update your code If you change upstream functions, even nested ones, drake automatically refits the affected models. Let’s increase dropout in both layers. define_model <- function(rec, units1, units2, act1, act2, act3) { input_shape <- ncol( juice(rec, all_predictors(), composition = "matrix") ) keras_model_sequential() %>% layer_dense( units = units1, kernel_initializer = "uniform", activation = act1, input_shape = input_shape ) %>% layer_dropout(rate = 0.15) %>% # Changed from 0.1 to 0.15. layer_dense( units = units2, kernel_initializer = "uniform", activation = act2 ) %>% layer_dropout(rate = 0.15) %>% # Changed from 0.1 to 0.15. layer_dense( units = 1, kernel_initializer = "uniform", activation = act3 ) } All the models and downstream results are affected. make(plan) #> ▶ target model_relu #> ▶ target model_softmax #> ▶ target model_sigmoid #> ▶ target conf_relu #> ▶ target conf_softmax #> ▶ target conf_sigmoid #> ▶ target metrics 9.10 History and provenance drake version 7.5.0 and above tracks history and provenance. You can see which models you ran, when you ran them, how long they took, and which settings you tried (i.e. named arguments to function calls in your commands). history <- drake_history() history #> # A tibble: 17 x 10 #> target current built exists hash command seed runtime prop act1 #> <chr> <lgl> <chr> <lgl> <chr> <chr> <int> <dbl> <dbl> <chr> #> 1 conf_r… FALSE 2020-07… TRUE 3384… "confusion_… 4.05e8 0.451 NA <NA> #> 2 conf_r… TRUE 2020-07… TRUE 7c60… "confusion_… 4.05e8 0.489 NA <NA> #> 3 conf_s… FALSE 2020-07… TRUE a652… "confusion_… 1.93e9 0.453 NA <NA> #> 4 conf_s… TRUE 2020-07… TRUE cfa8… "confusion_… 1.93e9 0.411 NA <NA> #> 5 conf_s… FALSE 2020-07… TRUE c901… "confusion_… 1.80e9 0.735 NA <NA> #> 6 conf_s… TRUE 2020-07… TRUE 889c… "confusion_… 1.80e9 0.667 NA <NA> #> 7 data TRUE 2020-07… TRUE ca84… "read_csv(f… 1.29e9 0.062 0.3 <NA> #> 8 metrics FALSE 2020-07… TRUE 066c… "compare_mo… 1.21e9 0.057 NA <NA> #> 9 metrics FALSE 2020-07… TRUE cbce… "compare_mo… 1.21e9 0.0740 NA <NA> #> 10 metrics TRUE 2020-07… TRUE 552c… "compare_mo… 1.21e9 0.0610 NA <NA> #> 11 model_… FALSE 2020-07… TRUE c2d8… "train_mode… 1.47e9 19.0 NA relu #> 12 model_… TRUE 2020-07… TRUE e21d… "train_mode… 1.47e9 4.40 NA relu #> 13 model_… FALSE 2020-07… TRUE 8752… "train_mode… 1.26e9 4.46 NA sigm… #> 14 model_… TRUE 2020-07… TRUE c435… "train_mode… 1.26e9 4.29 NA sigm… #> 15 model_… FALSE 2020-07… TRUE 1c94… "train_mode… 8.05e8 4.62 NA soft… #> 16 model_… TRUE 2020-07… TRUE 0b11… "train_mode… 8.05e8 4.44 NA soft… #> 17 rec TRUE 2020-07… TRUE 4a87… "prepare_re… 6.29e8 0.299 NA <NA> And as long as you did not run clean(garbage_collection = TRUE), you can get the old data back. Let’s find the oldest run of the relu model. hash <- history %>% filter(act1 == "relu") %>% pull(hash) %>% head(n = 1) drake_cache()$get_value(hash) #> Model #> Model: "sequential" #> ________________________________________________________________________________ #> Layer (type) Output Shape Param # #> ================================================================================ #> dense (Dense) (None, 16) 576 #> ________________________________________________________________________________ #> dropout (Dropout) (None, 16) 0 #> ________________________________________________________________________________ #> dense_1 (Dense) (None, 16) 272 #> ________________________________________________________________________________ #> dropout_1 (Dropout) (None, 16) 0 #> ________________________________________________________________________________ #> dense_2 (Dense) (None, 1) 17 #> ================================================================================ #> Total params: 865 #> Trainable params: 865 #> Non-trainable params: 0 #> ________________________________________________________________________________ "],
["stan.html", "Chapter 10 Validating a small hierarchical model with Stan 10.1 The drake project 10.2 Functions 10.3 Plan 10.4 Try it out!", " Chapter 10 Validating a small hierarchical model with Stan The goal of this example workflow is to validate a small Bayesian hierarchical model. y_i ~ iid Normal(alpha + x_i * beta, sigma^2) alpha ~ Normal(0, 1) beta ~ Normal(0, 1) sigma ~ Uniform(0, 1) We simulate multiple datasets from the model and fit the model on each dataset. For each model fit, we determine if the 50% credible interval of the regression coefficient beta contains the true value of beta used to generate the data. If we implemented the model correctly, roughly 50% of the models should recapture the true beta in 50% credible intervals. 10.1 The drake project Because of the long computation time involved, this chapter of the manual does not actually run the analysis code. The complete code can be found at https://github.com/wlandau/drake-examples/tree/master/stan and downloaded with drake::drake_example(\"stan\"), and we encourage you to try out the code yourself. This chapter serves to walk through the functions and plan and explain the overall thought process. The file structure is that of a typical drake project with some additions to allow optional high-performance computing on a cluster. ├── run.sh ├── run.R ├── _drake.R ├── sge.tmpl ├── R/ ├──── packages.R ├──── functions.R ├──── plan.R ├── stan/ ├──── model.stan └── report.Rmd File Purpose run.sh Shell script to run run.R in a persistent background process. Works on Unix-like systems. Helpful for long computations on servers. run.R R script to run r_make(). _drake.R The special R script that powers functions r_make() and friends (details here). sge.tmpl A clustermq template file to deploy targets in parallel to a Sun Grid Engine cluster. R/packages.R A custom R script loading the packages we need. R/functions.R A custom R script with user-defined functions. R/plan.R A custom R script that defines the drake plan. stan/model.stan The specification of our Stan model. report.Rmd An R Markdown report summarizing the results of the analysis. The following sections walk through the functions and plan. 10.2 Functions Good functions have meaningful inputs and outputs that are easy to generate. For data anlaysis, good inputs and outputs are typically datasets, models, and summaries of fitted models. The functions below for our Stan workflow follow this pattern. First, we need a function to compile the model. It accepts a Stan model specification file (a *.stan text file) and returns a paths to both the model file and the compiled RDS file. (We need to set rstan_options(auto_write = TRUE) to make sure stan_model() generates the RDS file.) We return the file paths because the target that uses this function will be a dynamic file target. drake will reproducibly watch these files for changes and automatically recompile and run the model if changes are detected. compile_model <- function(model_file) { rstan_options(auto_write = TRUE) stan_model(model_file) c(model_file, path_ext_set(model_file, "rds")) } Next, we need a function to simulate a dataset from the hierarchical model. simulate_data <- function() { alpha <- rnorm(1, 0, 1) beta <- rnorm(1, 0, 1) sigma <- runif(1, 0, 1) x <- rbinom(100, 1, 0.5) y <- rnorm(100, alpha + x * beta, sigma) tibble(x = x, y = y, beta_true = beta) } Lastly, we write a function to fit the compiled model to a simulated dataset. We pass in the *.stan model specification file, but rstan is smart enough to use the compiled RDS model file instead if available. In Bayesian data analysis workflows with many runs of the same model, we need to make a conscious effort to conserve computing resources. That means we should not save all the posterior samples from every single model fit. Instead, we compute summary statistics on the chains such as posterior quantiles, coverage in credible intervals, and convergence diagnostics. fit_model <- function(model_file, data) { output <- stan( file = model_file, data = list(x = data$x, y = data$y, n = nrow(data)), refresh = 0 ) mcmc_list <- As.mcmc.list(output) samples <- as.data.frame(as.matrix(mcmc_list)) beta_25 <- quantile(samples$beta, 0.25) beta_median <- quantile(samples$beta, 0.5) beta_75 <- quantile(samples$beta, 0.75) beta_true <- data$beta_true[1] beta_cover <- beta_25 < beta_true && beta_true < beta_75 psrf <- max(gelman.diag(mcmc_list, multivariate = FALSE)$psrf[, 1]) ess <- min(effectiveSize(mcmc_list)) tibble( beta_cover = beta_cover, beta_true = beta_true, beta_25 = beta_25, beta_median = beta_median, beta_75 = beta_75, psrf = psrf, ess = ess ) } 10.3 Plan Our drake plan is defined in the R/plan.R script. plan <- drake_plan( model_file = target( compile_model("stan/model.stan"), format = "file", hpc = FALSE ), index = target( seq_len(10), # Change the number of simulations here. hpc = FALSE ), data = target( simulate_data(), dynamic = map(index), format = "fst_tbl" ), fit = target( fit_model(model_file[1], data), dynamic = map(data), format = "fst_tbl" ), report = target( render( knitr_in("report.Rmd"), output_file = file_out("report.html"), quiet = TRUE ), hpc = FALSE ) ) The following subsections describe the strategy and practical adjustments behind each target. 10.3.1 Model The model_files target is a dynamic file target to reproducibly track our Stan model specification file (stan/model.stan) and compiled model file (stan/model.rds). Below, format = \"file\" indicates that the target is a dynamic file target, and hpc = FALSE tells drake not to run the target on a parallel worker in high-performance computing scenarios. model_files = target( compile_model("stan/model.stan"), format = "file", hpc = FALSE ), 10.3.2 Index The index target is simply a numeric vector from 1 to the number of simulations. To fit our model multiple times, we are going to dynamically map over index. This is a small target and we do not want to waste expensive computing resources on it, so we set hpc = FALSE. index = target( seq_len(1000), # Change the number of simulations here. hpc = FALSE ) 10.3.3 Data data is a dynamic target with one sub-target per simulated dataset, so we write dynamic = map(index) below. In addition, these datasets are data frames, so we choose format = \"fst_tbl\" below to increase read/write speeds and conserve storage space. Read here for more on specialized storage formats. data = target( simulate_data(), dynamic = map(index), format = "fst_tbl" ) 10.3.4 Fit We want to fit our model once for each simulated dataset, so our fit target dynamically maps over the datasets with dynamic = map(data). We also pass in the path to the stan/model.stan specification file, but rstan is smart enough to use the compiled model stan/model.rds instead if available. And since fit_model() returns a data frame, we also choose format = \"fst_tbl\" here. fit = target( fit_model(model_files[1], data), dynamic = map(data), format = "fst_tbl" ) 10.3.5 Report R Markdown reports should never do any heavy lifting in drake pipelines. They should simply leverage the computationally expensive work done in the previous targets. If we follow this good practice and our report renders quickly, we should not need heavy computing resources to process it, and we can set hpc = FALSE below. The report.Rmd file itself has loadd() and readd() statements to refer to these targets, and with the knitr_in() keyword below, drake knows that it needs to update the report when the models or datasets change. Similarly, file_out(\"report.html\") tells drake to rerun the report if the output file gets corrupted. report = target( render( knitr_in("report.Rmd"), output_file = file_out("report.html"), quiet = TRUE ), hpc = FALSE ) 10.4 Try it out! The complete code can be found at https://github.com/wlandau/drake-examples/tree/master/stan and downloaded with drake::drake_example(\"stan\"), and we encourage you to try out the code yourself. "],
["packages.html", "Chapter 11 An analysis of R package download trends 11.1 Get the code. 11.2 Overview 11.3 Analysis 11.4 Other ways to trigger downloads", " Chapter 11 An analysis of R package download trends This chapter explores R package download trends using the cranlogs package, and it shows how drake’s custom triggers can help with workflows with remote data sources. 11.1 Get the code. Write the code files to your workspace. drake_example("packages") The new packages folder now includes a file structure of a serious drake project, plus an interactive-tutorial.R to narrate the example. The code is also online here. 11.2 Overview This small data analysis project explores some trends in R package downloads over time. The datasets are downloaded using the cranlogs package. library(cranlogs) cran_downloads(packages = "dplyr", when = "last-week") Above, each count is the number of times dplyr was downloaded from the RStudio CRAN mirror on the given day. To stay up to date with the latest download statistics, we need to refresh the data frequently. With drake, we can bring all our work up to date without restarting everything from scratch. 11.3 Analysis First, we load the required packages. drake detects the packages you install and load. library(cranlogs) library(drake) library(dplyr) library(ggplot2) library(knitr) library(rvest) We will want custom functions to summarize the CRAN logs we download. make_my_table <- function(downloads){ group_by(downloads, package) %>% summarize(mean_downloads = mean(count)) } make_my_plot <- function(downloads){ ggplot(downloads) + geom_line(aes(x = date, y = count, group = package, color = package)) } Next, we generate the plan. We want to explore the daily downloads from the knitr, Rcpp, and ggplot2 packages. We will use the cranlogs package to get daily logs of package downloads from RStudio’s CRAN mirror. In our drake_plan(), we declare targets older and recent to kcontain snapshots of the logs. The following drake_plan() syntax is described here, which is supported in drake 7.0.0 and above. plan <- drake_plan( older = cran_downloads( packages = c("knitr", "Rcpp", "ggplot2"), from = "2016-11-01", to = "2016-12-01" ), recent = target( command = cran_downloads( packages = c("knitr", "Rcpp", "ggplot2"), when = "last-month" ), trigger = trigger(change = latest_log_date()) ), averages = target( make_my_table(data), transform = map(data = c(older, recent)) ), plot = target( make_my_plot(data), transform = map(data) ), report = knit( knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE ) ) Notice the custom trigger for the target recent. Here, we are telling drake to rebuild recent whenever a new day’s log is uploaded to http://cran-logs.rstudio.com. In other words, drake keeps track of the return value of latest_log_date() and recomputes recent (during make()) if that value changed since the last make(). Here, latest_log_date() is one of our custom imported functions. We use it to scrape http://cran-logs.rstudio.com using the rvest package. latest_log_date <- function(){ read_html("http://cran-logs.rstudio.com/") %>% html_nodes("li:last-of-type") %>% html_nodes("a:last-of-type") %>% html_text() %>% max } Now, we run the project to download the data and analyze it. The results will be summarized in the knitted report, report.md, but you can also read the results directly from the cache. make(plan) readd(averages_recent) readd(averages_older) readd(plot_recent) readd(plot_older) If we run make() again right away, we see that everything is up to date. But if we wait until a new day’s log is uploaded, make() will update recent and everything that depends on it. make(plan) To visualize the build behavior, you can plot the dependency network. vis_drake_graph(plan) 11.4 Other ways to trigger downloads Sometimes, our remote data sources get revised, and web scraping may not be the best way to detect changes. We may want to look at our remote dataset’s modification time or HTTP ETag. To see how this works, consider the CRAN log file from February 9, 2018. url <- "http://cran-logs.rstudio.com/2018/2018-02-09-r.csv.gz" We can track the modification date using the httr package. library(httr) # For querying websites. HEAD(url)$headers[["last-modified"]] In our drake plan, we can track this timestamp and trigger a download whenever it changes. plan <- drake_plan( logs = target( get_logs(url), trigger = trigger(change = HEAD(url)$headers[["last-modified"]]) ) ) plan where library(R.utils) # For unzipping the files we download. library(curl) # For downloading data. get_logs <- function(url){ curl_download(url, "logs.csv.gz") # Get a big file. gunzip("logs.csv.gz", overwrite = TRUE) # Unzip it. out <- read.csv("logs.csv", nrows = 4) # Extract the data you need. unlink(c("logs.csv.gz", "logs.csv")) # Remove the big files out # Value of the target. } When we are ready, we run the workflow. make(plan) readd(logs) If the log file at the url ever changes, the timestamp will update remotely, and make() will download the file again. "],
["gsp.html", "Chapter 12 Finding the best model of gross state product 12.1 Get the code. 12.2 Objective and methods 12.3 Data 12.4 Analysis 12.5 Results 12.6 Comparison with GNU Make 12.7 References", " Chapter 12 Finding the best model of gross state product The following data analysis workflow shows off drake’s ability to generate lots of reproducibly-tracked tasks with ease. The same technique would be cumbersome, even intractable, with GNU Make. 12.1 Get the code. Write the code files to your workspace. drake_example("gsp") The new gsp folder now includes a file structure of a serious drake project, plus an interactive-tutorial.R to narrate the example. The code is also online here. 12.2 Objective and methods The goal is to search for factors closely associated with the productivity of states in the USA around the 1970s and 1980s. For the sake of simplicity, we use gross state product as a metric of productivity, and we restrict ourselves to multiple linear regression models with three variables. For each of the 84 possible models, we fit the data and then evaluate the root mean squared prediction error (RMSPE). \\[ \\begin{aligned} \\text{RMSPE} = \\sqrt{(\\text{y} - \\widehat{y})^T(y - \\widehat{y})} \\end{aligned} \\] Here, \\(y\\) is the vector of observed gross state products in the data, and \\(\\widehat{y}\\) is the vector of predicted gross state products under one of the models. We take the best variables to be the triplet in the model with the lowest RMSPE. 12.3 Data The Produc dataset from the Ecdat package contains data on the Gross State Product from 1970 to 1986. Each row is a single observation on a single state for a single year. The dataset has the following variables as columns. See the references later in this report for more details. gsp: gross state product. state: the state. year: the year. pcap: private capital stock. hwy: highway and streets. water: water and sewer facilities. util: other public buildings and structures. pc: public capital. emp: labor input measured by the employment in non-agricultural payrolls. unemp: state unemployment rate. library(Ecdat) data(Produc) head(Produc) 12.4 Analysis First, we load the required packages. drake is aware of all the packages you load with library() or require(). library(biglm) # lightweight models, easier to store than with lm() library(drake) library(Ecdat) # econometrics datasets library(ggplot2) library(knitr) library(purrr) library(tidyverse) Next, we construct our plan. The following code uses drake’s special new language for generating plans (learn more here). predictors <- setdiff(colnames(Produc), "gsp") # We will try all combinations of three covariates. combos <- combn(predictors, 3) %>% t() %>% as.data.frame(stringsAsFactors = FALSE) %>% setNames(c("x1", "x2", "x3")) head(combos) # We need to list each covariate as a symbol. for (col in colnames(combos)) { combos[[col]] <- rlang::syms(combos[[col]]) } # Requires drake >= 7.0.0 or the development version # at github.com/ropensci/drake. # Install with remotes::install_github("ropensci/drake"). plan <- drake_plan( model = target( biglm(gsp ~ x1 + x2 + x3, data = Ecdat::Produc), transform = map(.data = !!combos) # Remember the bang-bang!! ), rmspe_i = target( get_rmspe(model, Ecdat::Produc), transform = map(model) ), rmspe = target( bind_rows(rmspe_i, .id = "model"), transform = combine(rmspe_i) ), plot = ggsave( filename = file_out("rmspe.pdf"), plot = plot_rmspe(rmspe), width = 8, height = 8 ), report = knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE) ) plan We also need to define functions for summaries and plots. get_rmspe <- function(model_fit, data){ y <- data$gsp yhat <- as.numeric(predict(model_fit, newdata = data)) terms <- attr(model_fit$terms, "term.labels") tibble( rmspe = sqrt(mean((y - yhat)^2)), # nolint X1 = terms[1], X2 = terms[2], X3 = terms[3] ) } plot_rmspe <- function(rmspe){ ggplot(rmspe) + geom_histogram(aes(x = rmspe), bins = 15) } We have a report.Rmd file to summarize our results at the end. drake_example("gsp") file.copy(from = "gsp/report.Rmd", to = ".", overwrite = TRUE) We can inspect the project before we run it. vis_drake_graph(plan) Now, we can run the project. make(plan, verbose = 0L) 12.5 Results Here are the root mean squared prediction errors of all the models. results <- readd(rmspe) library(ggplot2) plot_rmspe(rmspe = results) And here are the best models. The best variables are in the top row under X1, X2, and X3. head(results[order(results$rmspe, decreasing = FALSE), ]) 12.6 Comparison with GNU Make If we were using Make instead of drake with the same set of targets, the analogous Makefile would look something like this pseudo-code sketch. models = model_state_year_pcap.rds model_state_year_hwy.rds ... # 84 of these model_% Rscript -e 'saveRDS(lm(...), ...)' rmspe_%: model_% Rscript -e 'saveRDS(get_rmspe(...), ...)' rmspe.rds: rmspe_% Rscript -e 'saveRDS(dplyr::bind_rows(...), ...)' rmspe.pdf: rmspe.rds Rscript -e 'ggplot2::ggsave(plot_rmspe(readRDS(\"rmspe.rds\")), \"rmspe.pdf\")' report.md: report.Rmd Rscript -e 'knitr::knit(\"report.Rmd\")' There are three main disadvantages to this approach. Every target requires a new call to Rscript, which means that more time is spent initializing R sessions than doing the actual work. The user must micromanage nearly one hundred output files (in this case, *.rds files), which is cumbersome, messy, and inconvenient. drake, on the other hand, automatically manages storage using a storr cache. The user needs to write the names of the 84 models near the top of the Makefile, which is less convenient than maintaining a data frame in R. 12.7 References Baltagi, Badi H (2003). Econometric analysis of panel data, John Wiley and sons, http://www.wiley.com/legacy/wileychi/baltagi/. Baltagi, B. H. and N. Pinnoi (1995). “Public capital stock and state productivity growth: further evidence”, Empirical Economics, 20, 351-359. Munnell, A. (1990). “Why has productivity growth declined? Productivity and public investment”\", New England Economic Review, 3-22. Yves Croissant (2016). Ecdat: Data Sets for Econometrics. R package version 0.3-1. https://CRAN.R-project.org/package=Ecdat. "],
["hpc.html", "Chapter 13 High-performance computing 13.1 Start small 13.2 Let make() schedule your targets. 13.3 The master process 13.4 Parallel backends 13.5 The clustermq backend 13.6 The future backend 13.7 Advanced options", " Chapter 13 High-performance computing This chapter provides guidance on time-consuming drake workflows and high-level parallel computation. 13.1 Start small Before you jump into high-performance computing with a large workflow, consider running a downsized version to debug and test things first. That way, you can avoid consuming lots of computing resources until you are reasonably sure everything works. Create a test plan with drake_plan(max_expand = SMALL_NUMBER) before scaling up to the full set of targets, and take temporary shortcuts in your commands so your targets build more quickly for test mode. See this section on plans for details. 13.2 Let make() schedule your targets. When it comes time to activate high-performance computing, drake launches its own parallel workers and sends targets to those workers. The workers can be local processes or jobs on a cluster. drake uses your project’s implicit dependency graph to figure out which targets can run in parallel and which ones need to wait for dependencies. load_mtcars_example() # from https://github.com/wlandau/drake-examples/tree/master/mtcars vis_drake_graph(my_plan) You do not need to not micromanage how targets are scheduled, and you do not need to run simultaneous instances of make(). 13.3 The master process make() takes care of the jobs it launches, but make() itself is a job too, and it is your responsibility to manage it. 13.3.1 Master on a cluster Most clusters will let you submit make() as a job on a compute node. Let’s consider the Sun Grid Engine (SGE) as an example. First, we create a script that calls make() (or r_make()). # make.R source("R/packages.R") source("R/packages.R") source("R/packages.R") options( clustermq.scheduler = "sge", # Created by drake_hpc_template_file("sge_clustermq.tmpl"): clustermq.template = "sge_clustermq.tmpl" ) make( plan, parallelism = "clustermq", jobs = 8, console_log_file = "drake.log" ) Then, we create a shell script (say, run.sh) to call make.R. This script may look different if you use a different scheduler such as SLURM. # run.sh #!/bin/bash #$ -j y # combine stdout/error in one file #$ -o log.out # output file #$ -cwd # use pwd as work dir #$ -V # use environment variable module load R # Uncomment if R is an environment module. R --no-save CMD BATCH make.R Finally, to run the whole workflow, we call qsub. qsub run.sh And here is what happens: A new job starts on the cluster with the configuration flags next to #$ in run.sh. run.sh opens R and runs make.R. make.R invokes drake using the make() function. make() launches 8 new jobs on the cluster So 9 simultaneous jobs run on the cluster and we avoid bothering the headnode / login node. 13.3.2 Local master Alternatively, you can run make() in a persistent background process. The following should work in the Mac/Linux terminal/shell. nohup nice -19 R --no-save CMD BATCH make.R & where: nohup: Keep the job running even if you log out of the machine. nice -19: This is a low-priority job that should not consume many resources. Other processes should take priority. R CMD BATCH: Run the R script in a fresh new R session. --no-save: do not save the workspace in a .RData file. &: Run this job in the background so you can do other stuff in the terminal window. Alternatives to nohup include screen and Byobu. 13.4 Parallel backends Choose the parallel backend with the parallelism argument and set the jobs argument to scale the work appropriately. make(my_plan, parallelism = "future", jobs = 2) The two primary backends with long term support are clustermq and future. If you can install ZeroMQ, the best choice is usually clustermq. (It is faster than future.) However, future is more accessible: it does not require ZeroMQ, it supports parallel computing on Windows, it can work with more restrictive wall time limits on clusters, and it can deploy targets to Docker images (drake_example(\"Docker-psock\")). 13.5 The clustermq backend 13.5.1 Persistent workers The make(parallelism = \"clustermq\", jobs = 2) launches 2 parallel persistent workers. The master process assigns targets to workers, and the workers simultaneously traverse the dependency graph. 13.5.2 Installation Persistent workers require the clustermq R package, which in turn requires ZeroMQ. Please refer to the clustermq installation guide for specific instructions. 13.5.3 On your local machine To run your targets in parallel over the cores of your local machine, set the global option below and run make(). options(clustermq.scheduler = "multicore") make(plan, parallelism = "clustermq", jobs = 2) 13.5.4 On a cluster Set the clustermq global options to register your computing resources. For SLURM: options(clustermq.scheduler = "slurm", clustermq.template = "slurm_clustermq.tmpl") Here, slurm_clustermq.tmpl is a template file with configuration details. Use drake_hpc_template_file() to write one of the available examples. drake_hpc_template_file("slurm_clustermq.tmpl") # Write the file slurm_clustermq.tmpl. After modifying slurm_clustermq.tmpl by hand to meet your needs, call make() as usual. make(plan, parallelism = "clustermq", jobs = 4) 13.6 The future backend 13.6.1 Transient workers make(parallelism = \"future\", jobs = 2) launches transient workers to build your targets. When a target is ready to build, the master process creates a fresh worker to build it, and the worker terminates when the target is done. jobs = 2 means that at most 2 transient workers are allowed to run at a given time. 13.6.2 Installation Install the future package. install.packages("future") # CRAN release # Alternatively, install the GitHub development version. devtools::install_github("HenrikBengtsson/future", ref = "develop") If you intend to use a cluster, be sure to install the future.batchtools package too. The future ecosystem contains even more packages that extend future’s parallel computing functionality, such as future.callr. 13.6.3 On your local machine First, select a future plan to tell future how to create the workers. See this table for descriptions of the core options. future::plan(future::multiprocess) Next, run make(). make(plan, parallelism = "future", jobs = 2) 13.6.4 On a cluster Install the future.batchtools package and use this list to select a future plan that matches your resources. You will also need a compatible template file with configuration details. As with clustermq, drake can generate some examples: drake_hpc_template_file("slurm_batchtools.tmpl") # Edit by hand. Next, register the template file with a plan. library(future.batchtools) future::plan(batchtools_slurm, template = "slurm_batchtools.tmpl") Finally, run make(). make(plan, parallelism = "future", jobs = 2) 13.7 Advanced options 13.7.1 Selectivity Some targets build so quickly that it is not worth sending them to parallel workers. To run these targets locally in the master process, define a special hpc column of your drake plan. Below, NA and TRUE are treated the same, and make(plan, parallelism = \"clustermq\") only sends model_1 and model_2 to parallel workers. drake_plan( model = target( crazy_long_computation(index), transform = map(index = c(1, 2)) ), accuracy = target( summarize_accuracy(model), transform = combine(model), hpc = FALSE ), specificity = target( summarize_specificity(model), transform = combine(model), hpc = FALSE ), report = target( render(knitr_in("results.Rmd"), output_file = file_out("results.html")), hpc = FALSE ) ) 13.7.2 Memory options By default, make() keeps targets in memory during runtime. Some targets are dependencies of other targets downstream, while others may be no longer actually need to be in memory. The memory_strategy argument to make() allows you to choose the tradeoff that best suits your project. Options: \"speed\": Once a target is loaded in memory, just keep it there. This choice maximizes speed and hogs memory. \"memory\": Just before building each new target, unload everything from memory except the target’s direct dependencies. This option conserves memory, but it sacrifices speed because each new target needs to reload any previously unloaded targets from storage. \"lookahead\": Just before building each new target, search the dependency graph to find targets that will not be needed for the rest of the current make() session. In this mode, targets are only in memory if they need to be loaded, and we avoid superfluous reads from the cache. However, searching the graph takes time, and it could even double the computational overhead for large projects. 13.7.3 Storage options In make(caching = \"master\"), the workers send the targets to the master process, and the master process stores them one by one in the cache. caching = \"master\" is compatible with all storr cache formats, including the more esoteric ones like storr_dbi() and storr_environment(). In make(caching = \"worker\"), the parallel workers are responsible for writing the targets to the cache. Some output-heavy projects can benefit from this form of parallelism. However, it can sometimes add slowness on clusters due to lag from network file systems. And there are additional restrictions: All the workers must have the same file system and the same working directory as the master process. Only the default storr_rds() cache may be used. Other formats like storr_dbi() and storr_environment() cannot accommodate parallel cache operations. See the storage chapter for details. 13.7.4 The template argument for persistent workers For more control and flexibility in the clustermq backend, you can parameterize your template file and use the template argument of make(). For example, suppose you want to programatically set the number of “slots” (basically cores) per job on an SGE system (clustermq guide to SGE setup here). Begin with a parameterized template file sge_clustermq.tmpl with a custom n_slots placeholder. # File: sge_clustermq.tmpl # Modified from https://github.com/mschubert/clustermq/wiki/SGE #$ -N {{ job_name }} # job name #$ -t 1-{{ n_jobs }} # submit jobs as array #$ -j y # combine stdout/error in one file #$ -o {{ log_file | /dev/null }} # output file #$ -cwd # use pwd as work dir #$ -V # use environment variable #$ -pe smp {{ n_slots | 1 }} # request n_slots cores per job module load R ulimit -v $(( 1024 * {{ memory | 4096 }} )) CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")' Then when you run make(), use the template argument to set n_slots. options(clustermq.scheduler = "sge", clustermq.template = "sge_clustermq.tmpl") library(drake) load_mtcars_example() make( my_plan, parallelism = "clustermq", jobs = 16, template = list(n_slots = 4) # Request 4 cores per persistent worker. ) Custom placeholders like n_slots are processed with the infuser package. 13.7.5 The resources column for transient workers Different targets may need different resources. For example, plan <- drake_plan( data = download_data(), model = big_machine_learning_model(data) ) The model needs a GPU and multiple CPU cores, and the data only needs the bare minimum resources. Declare these requirements with target(), as below. This is equivalent to adding a new list column to the plan, where each element is a named list for the resources argument of future::future(). plan <- drake_plan( data = target( download_data(), resources = list(cores = 1, gpus = 0) ), model = target( big_machine_learning_model(data), resources = list(cores = 4, gpus = 1) ) ) plan str(plan$resources) Next, plug the names of your resources into the brew patterns of your batchtools template file. The following sge_batchtools.tmpl file shows how to do it, but the file itself probably requires modification before it will work with your own machine. #!/bin/bash #$ -cwd #$ -j y #$ -o <%= log.file %> #$ -V #$ -N <%= job.name %> #$ -pe smp <%= resources[["cores"]] %> # CPU cores #$ -l gpu=<%= resources[["gpus"]] %> # GPUs. Rscript -e 'batchtools::doJobCollection("<%= uri %>")' exit 0 Finally, register the template file and run your project. library(drake) library(future.batchtools) future::plan(batchtools_sge, template = "sge_batchtools.tmpl") make(plan, parallelism = "future", jobs = 2) 13.7.6 Parallel computing within targets To recruit parallel processes within individual targets, we recommend the future.callr and furrr packages. Usage details depend on the parallel backend you choose for make(). If you must write custom code with mclapply(), please read the subsection below on locked bindings/environments. 13.7.6.1 Locally Use future.callr and furrr normally. library(drake) # The targets just collect the process IDs of the callr processes. plan <- drake_plan( x = furrr::future_map_int(1:2, function(x) Sys.getpid()), y = furrr::future_map_int(1:2, function(x) Sys.getpid()) ) # Tell the drake targets to fork up to 4 callr processes. future::plan(future.callr::callr) # Build the targets. make(plan) # Process IDs of the local workers of x: readd(x) 13.7.6.2 Persistent workers Each persistent worker needs its own future::plan(), which we set with the prework argument of make(). The following example uses SGE. To learn about templates for other clusters, please consult the clustermq documentation. library(drake) # The targets just collect the process IDs of the callr processes. plan <- drake_plan( x = furrr::future_map_int(1:2, function(x) Sys.getpid()), y = furrr::future_map_int(1:2, function(x) Sys.getpid()) ) # Write a template file for clustermq. writeLines( c( "#!/bin/bash", "#$ -N {{ job_name }} # job name", "#$ -t 1-{{ n_jobs }} # submit jobs as array", "#$ -j y # combine stdout/error in one file", "#$ -o {{ log_file | /dev/null }} # output file", "#$ -cwd # use pwd as work dir", "#$ -V # use environment variables", "#$ -pe smp 4 # request 4 cores per job", "module load R-qualified/3.5.2 # if loading R from an environment module", "ulimit -v $(( 1024 * {{ memory | 4096 }} ))", "CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker(\\"{{ master }}\\")'" ), "sge_clustermq.tmpl" ) # Register the scheduler and template file with clustermq. options( clustermq.scheduler = "sge", clustermq.template = "sge_clustermq.tmpl" ) # Build the targets. make( plan, parallelism = "clustermq", jobs = 2, # Each of the two workers can spawn up to 4 local processes. prework = quote(future::plan(future.callr::callr)) ) # Process IDs of the local workers of x: readd(x) 13.7.6.3 Transient workers As explained in the future vignette, we can nest our future::plans(). Each target gets its own remote job, and each job can spawn up to 4 local callr processes. The following example uses SGE. To learn about templates for other clusters, please consult the future.batchtools documentation. library(drake) # The targets just collect the process IDs of the callr processes. plan <- drake_plan( x = furrr::future_map_int(1:2, function(x) Sys.getpid()), y = furrr::future_map_int(1:2, function(x) Sys.getpid()) ) # Write a template file for future.batchtools. writeLines( c( "#!/bin/bash", "#$ -cwd # use pwd as work dir", "#$ -j y # combine stdout/error in one file", "#$ -o <%= log.file %> # output file", "#$ -V # use environment variables", "#$ -N <%= job.name %> # job name", "#$ -pe smp 4 # 4 cores per job", "module load R # if loading R from an environment module", "Rscript -e 'batchtools::doJobCollection(\\"<%= uri %>\\")'", "exit 0" ), "sge_batchtools.tmpl" ) # In our nested plans, each target gets its own remote SGE job, # and each worker can spawn up to 4 `callr` processes. future::plan( list( future::tweak( future.batchtools::batchtools_sge, template = "sge_batchtools.tmpl" ), future.callr::callr ) ) # Build the targets. make(plan, parallelism = "future", jobs = 2) # Process IDs of the local workers of x: readd(x) 13.7.6.4 Number of local workers per target By default, future::availableCores() determines the number of local callr workers. To better manage resources, you may wish to further restrict the number of callr workers for all targets in the plan, e.g. future::plan(future::callr, workers = 4L) or: future::plan( list( future::tweak( future.batchtools::batchtools_sge, template = "sge_batchtools.tmpl" ), future::tweak(future.callr::callr, workers = 4L) ) ) Alternatively, you can use chunking to prevent individual targets from using too many workers, e.g. furrr::future_map(.options = furrr::future_options(scheduling = 4)). Here, the scheduling argument sets the average number of futures per worker. 13.7.6.5 Locked binding/environment errors Some workflows unavoidably use mclapply(), which is known to modify the global environment against drake’s will. If you are stuck, there are two workarounds. Use make(lock_envir = FALSE). Use the envir argument of make(). That way, drake locks your special custom environment instead of the global environment. # Load the main example: https://github.com/wlandau/drake-examples library(drake) drake_example("main") setwd("main") # Define and populate a special custom environment. envir <- new.env(parent = globalenv()) source("R/packages.R", local = envir) source("R/functions.R", local = envir) source("R/plan.R", local = envir) # Check the contents of your environments. ls(envir) # Should have your functions and plan ls() # The global environment should only have what you started with. # Build the targets using your custom environment make(envir$plan, envir = envir) "],
["time.html", "Chapter 14 Time: speed, time logging, prediction, and strategy 14.1 Why is drake so slow? 14.2 Time logging 14.3 Predict total runtime 14.4 Strategize your high-performance computing", " Chapter 14 Time: speed, time logging, prediction, and strategy 14.1 Why is drake so slow? 14.1.1 Help us find out! If you encounter slowness, please report it to https://github.com/ropensci/drake/issues and we will do our best to speed up drake for your use case. Please include a reproducible example and tell us about your operating system and version of R. In addition, flame graphs from the proffer package really help us identify bottlenecks. 14.1.2 Too many targets? make() and friends tend to slow down if you have a huge number of targets. There are unavoidable overhead costs from storing each single target and checking if it is up to date, so please read this advice on choosing good targets and consider dividing your work into a manageably small number of meaningful targets. Dynamic branching can also help in many cases. 14.1.3 Big data? drake saves the return value of each target to on disk storage. So in addition to dividing your work into a smaller number of targets, specialized storage formats can help speed things up. It may also be worth reflecting on how much data you really need to store. And if the cache is too big, the storage chapter has advice for downsizing it. 14.1.4 Aggressive shortcuts If your plan still needs tens of thousands of targets, you can take aggressive shortcuts to make things run faster. make( plan, verbose = 0L, # Console messages can pile up runtime. log_progress = FALSE, # drake_progress() will be useless. log_build_times = FALSE, # build_times() will be useless. recoverable = FALSE, # make(recover = TRUE) cannot be used later. history = FALSE, # drake_history() cannot be used later. session_info = FALSE, # drake_get_session_info() cannot be used later. lock_envir = FALSE, # See https://docs.ropensci.org/drake/reference/make.html#self-invalidation. ) 14.2 Time logging Thanks to Jasper Clarkberg, drake records how long it takes to build each target. For large projects that take hours or days to run, this feature becomes important for planning and execution. library(drake) load_mtcars_example() # from https://github.com/wlandau/drake-examples/tree/master/mtcars make(my_plan) build_times(digits = 8) # From the cache. ## `dplyr`-style `tidyselect` commands build_times(starts_with("coef"), digits = 8) 14.3 Predict total runtime drake uses these times to predict the runtime of the next make(). At this moment, everything is up to date in the current example, so the next make() should ideally take no time at all (except for preprocessing overhead). predict_runtime(my_plan) Suppose we change a dependency to make some targets out of date. Now, the next make() should take longer since some targets are out of date. reg2 <- function(d){ d$x3 <- d$x ^ 3 lm(y ~ x3, data = d) } predict_runtime(my_plan) And what if you plan to delete the cache and build all the targets from scratch? predict_runtime(my_plan, from_scratch = TRUE) 14.4 Strategize your high-performance computing Let’s say you are scaling up your workflow. You just put bigger data and heavier computation in your custom code, and the next time you run make(), your targets will take much longer to build. In fact, you estimate that every target except for your R Markdown report will take two hours to complete. Let’s write down these known times in seconds. known_times <- rep(7200, nrow(my_plan)) names(known_times) <- my_plan$target known_times["report"] <- 5 known_times How many parallel jobs should you use in the next make()? The predict_runtime() function can help you decide. predict_runtime(jobs = n) simulates persistent parallel workers and reports the estimated total runtime of make(jobs = n). (See also predict_workers().) time <- c() for (jobs in 1:12){ time[jobs] <- predict_runtime( my_plan, jobs_predict = jobs, from_scratch = TRUE, known_times = known_times ) } library(ggplot2) ggplot(data.frame(time = time / 3600, jobs = ordered(1:12), group = 1)) + geom_line(aes(x = jobs, y = time, group = group)) + scale_y_continuous(breaks = 0:10 * 4, limits = c(0, 29)) + theme_gray(16) + xlab("jobs argument of make()") + ylab("Predicted runtime of make() (hours)") We see serious potential speed gains up to 4 jobs, but beyond that point, we have to double the jobs to shave off another 2 hours. Your choice of jobs for make() ultimately depends on the runtime you can tolerate and the computing resources at your disposal. A final note on predicting runtime: the output of predict_runtime() and predict_workers() also depends the optional workers column of your drake_plan(). If you micromanage which workers are allowed to build which targets, you may minimize reads from disk, but you could also slow down your workflow if you are not careful. See the high-performance computing guide for more. "],
["memory.html", "Chapter 15 Memory management 15.1 Garbage collection and custom files 15.2 Memory strategies 15.3 Data splitting", " Chapter 15 Memory management The default settings of drake prioritize speed over memory efficiency. For projects with large data, this default behavior can cause problems. Consider the following hypothetical workflow, where we simulate several large datasets and summarize them. reps <- 10 # Serious workflows may have several times more. # Reduce `n` to lighten the load if you want to try this workflow yourself. # It is super high in this chapter to motivate the memory issues. generate_large_data <- function(rep, n = 1e8) { tibble(x = rnorm(n), y = rnorm(n), rep = rep) } get_means <- function(...) { out <- NULL for (dataset in list(...)) { out <- bind_rows(out, colMeans(dataset)) } out } plan <- drake_plan( large_data = target( generate_large_data(rep), transform = map(rep = !!seq_len(reps), .id = FALSE) ), means = target( get_means(large_data), transform = combine(large_data) ), summ = summary(means) ) print(plan) vis_drake_graph(plan) If you call make(plan) with no additional arguments, drake will try to load all the datasets into the same R session. Each dataset from generate_large_data(n = 1e8) occupies about 2.4 GB of memory, and most machines cannot handle all the data at once. We should use memory more wisely. 15.1 Garbage collection and custom files make() has a garbage_collection argument, which tells drake to periodically unload data objects that no longer belong to variables. You can also run garbage collection manually with the gc() function. For more on garbage collection, please refer to the memory usage chapter of Advanced R. Let’s reduce the memory consumption of our example workflow: Call gc() after every loop iteration of get_means(). Avoid drake’s caching system with custom file_out() files in the plan. Call make(plan, garbage_collection = TRUE). reps <- 10 # Serious workflows may have several times more. files <- paste0(seq_len(reps), ".rds") generate_large_data <- function(file, n = 1e8) { out <- tibble(x = rnorm(n), y = rnorm(n)) # a billion rows saveRDS(out, file) } get_means <- function(files) { out <- NULL for (file in files) { x <- colMeans(readRDS(file)) out <- bind_rows(out, x) gc() # Use the gc() function here to make sure each x gets unloaded. } out } plan <- drake_plan( large_data = target( generate_large_data(file = file_out(file)), transform = map(file = !!files, .id = FALSE) ), means = get_means(file_in(!!files)), summ = summary(means) ) print(plan) vis_drake_graph(plan) make(plan, garbage_collection = TRUE) 15.2 Memory strategies make() has a memory_strategy argument to customize how drake loads and unloads targets. With the right memory strategy, you can rely on drake’s built-in caching system without having to bother with messy file_out() files. Each memory strategy follows three stages for each target: Initial discard: before building the target, optionally discard some other targets from the R session. The choice of discards depends on the memory strategy. (Note: we do not actually get the memory back until we call gc().) Initial load: before building the target, optionally load any dependencies that are not already in memory. Final discard: optionally discard or keep the return value after the target finishes building. Either way, the return value is still stored in the cache, so you can load it with loadd() and readd(). The implementation of these steps varies from strategy to strategy. Memory strategy Initial discard Initial load Final discard “speed” Discard nothing Load any missing dependencies. Keep the return value loaded. “autoclean”1 Discard all targets which are not dependencies of the current target. Load any missing dependencies. Discard the return value. “preclean” Discard all targets which are not dependencies of the current target. Load any missing dependencies. Keep the return value loaded. “lookahead” Discard all targets which are not dependencies of either (1) the current target or (2) other targets waiting to be checked or built. Load any missing dependencies. Keep the return value loaded. “unload”2 Unload all targets. Load nothing. Discard the return value. “none”3 Unload nothing. Load nothing. Discard the return value. With the \"speed\", \"autoclean\", \"preclean\", and \"lookahead\" strategies, you can simply call make(plan, memory_strategy = YOUR_CHOICE, garbage_collection = TRUE) and trust that your targets will build normally. For the \"unload\" and \"none\" strategies, there is extra work to do: you will need to manually load each target’s dependencies with loadd() or readd(). This manual bookkeeping lets you aggressively optimize your workflow, and it is less cumbersome than swarms of file_out() files. It is particularly useful when you have a large combine() step. Let’s redesign the workflow to reap the benefits of make(plan, memory_strategy = \"none\", garbage_collection = TRUE). The trick is to use match.call() inside get_means() so we can load and unload dependencies one at a time instead of all at once. reps <- 10 # Serious workflows may have several times more. generate_large_data <- function(rep, n = 1e8) { tibble(x = rnorm(n), y = rnorm(n), rep = rep) } # Load targets one at a time get_means <- function(...) { arg_symbols <- match.call(expand.dots = FALSE)$... arg_names <- as.character(arg_symbols) out <- NULL for (arg_name in arg_names) { dataset <- readd(arg_name, character_only = TRUE) out <- bind_rows(out, colMeans(dataset)) gc() # Run garbage collection. } out } plan <- drake_plan( large_data = target( generate_large_data(rep), transform = map(rep = !!seq_len(reps), .id = FALSE) ), means = target( get_means(large_data), transform = combine(large_data) ), summ = { loadd(means) # Annoying, but necessary with the "none" strategy. summary(means) } ) Now, we can build our targets. make(plan, memory_strategy = "none", garbage_collection = TRUE) But there is a snag: we needed to manually load means in the command for summ (notice the call to loadd()). This is annoying, especially because means is quite small. Fortunately, drake lets you define different memory strategies for different targets in the plan. The target-specific memory strategies override the global one (i.e. the memory_strategy argument of make()). plan <- drake_plan( large_data = target( generate_large_data(rep), transform = map(rep = !!seq_len(reps), .id = FALSE), memory_strategy = "none" ), means = target( get_means(large_data), transform = combine(large_data), memory_strategy = "unload" # Be careful with this one. ), summ = summary(means) ) print(plan) In fact, now you can run make() without setting a global memory strategy at all. make(plan, garbage_collection = TRUE) 15.3 Data splitting The split() transformation breaks up a dataset into smaller targets. The ordinary use of split() is to partition an in-memory dataset into slices. drake_plan( data = get_large_data(), x = target( data %>% analyze_data(), transform = split(data, slices = 4) ) ) However, you can also use it to load individual pieces of a large file, thus conserving memory. The trick is to break up an index set instead of the data itself. In the following sketch, get_number_of_rows() and read_selected_rows() are user-defined functions, and %>% is the magrittr pipe. get_number_of_rows <- function(file) { # ... } read_selected_rows <- function(which_rows, file) { # ... } plan <- drake_plan( row_indices = file_in("large_file.csv") %>% get_number_of_rows() %>% seq_len(), subset = target( row_indices %>% read_selected_rows(file = file_in("large_file.csv")), transform = split(row_indices, slices = 4) ) ) plan drake_plan_source(plan) Only supported in drake version 7.5.0 and above.↩︎ Only supported in drake version 7.4.0 and above.↩︎ Only supported in drake version 7.4.0 and above.↩︎ "],
["storage.html", "Chapter 16 Storage 16.1 drake’s cache 16.2 Efficient target storage 16.3 Why is my cache so big? 16.4 Interfaces to the cache", " Chapter 16 Storage 16.1 drake’s cache When you run make(), drake stores your targets in a hidden storage cache. library(drake) load_mtcars_example() # from https://github.com/wlandau/drake-examples/tree/master/mtcars make(my_plan, verbose = 0L) The default cache is a hidden .drake folder. find_cache() ### [1] "/home/you/project/.drake" drake’s loadd() and readd() functions load targets into memory. loadd(large) head(large) head(readd(small)) 16.2 Efficient target storage drake supports custom formats for large and specialized targets. For example, the \"fst\" format uses the fst package to save data frames faster. Simply enclose the command and the format together with the target() function. library(drake) n <- 1e8 # Each target is 1.6 GB in memory. plan <- drake_plan( data_fst = target( data.frame(x = runif(n), y = runif(n)), format = "fst" ), data_old = data.frame(x = runif(n), y = runif(n)) ) make(plan) #> target data_fst #> target data_old build_times(type = "build") #> # A tibble: 2 x 4 #> target elapsed user system #> <chr> <Duration> <Duration> <Duration> #> 1 data_fst 13.93s 37.562s 7.954s #> 2 data_old 184s (~3.07 minutes) 177s (~2.95 minutes) 4.157s For more details and a complete list of formats, see https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. 16.3 Why is my cache so big? 16.3.1 Old targets By default, drake holds on to all your targets from all your runs of make(). Even if you run clean(), the data stays in the cache in case you need to recover it. clean() make(my_plan, recover = TRUE) If you really want to remove old historical values of targets, run drake_gc() or drake_cache()$gc(). drake_gc() clean() also has a garbage_collection argument for this purpose. Here is a slick way to remove historical targets and targets no longer in your plan. clean(list = cached_unplanned(my_plan), garbage_collection = TRUE) 16.3.2 Garbage from interrupted builds If make() crashes or gets interrupted, old files can accumulate in .drake/scratch/ and .drake/drake/tmp/. As long as make() is no longer running, can safely remove the files in those folders (but keep the folders themselves). 16.4 Interfaces to the cache drake uses the storr package to create and modify caches. library(storr) cache <- storr_rds(".drake") head(cache$list()) head(cache$get("small")) drake has its own interface on top of storr to make it easier to work with the default .drake/ cache. The loadd(), readd(), and cached() functions explore saved targets. head(cached()) head(readd(small)) loadd(large) head(large) rm(large) # Does not remove `large` from the cache. new_cache() create caches and drake_cache() recovers existing ones. (drake_cache() is only supported in drake version 7.4.0 and above.) cache <- drake_cache() cache$driver$path cache <- drake_cache(path = ".drake") # File path to drake's cache. cache$driver$path You can supply your own cache to make() and friends (including specialized storr caches like storr_dbi()). plan <- drake_plan(x = 1, y = sqrt(x)) make(plan, cache = cache) vis_drake_graph(plan, cache = cache) Destroy caches to remove them from your file system. cache$destroy() file.exists(".drake") "],
["visuals.html", "Chapter 17 Visualization with drake 17.1 Plotting plans 17.2 Underlying graph data: node and edge data frames 17.3 Visualizing target status 17.4 Subgraphs 17.5 Control the vis_drake_graph() legend. 17.6 Clusters 17.7 Output files 17.8 Node Selection", " Chapter 17 Visualization with drake Data analysis projects have complicated networks of dependencies, and drake can help you visualize them with vis_drake_graph(), sankey_drake_graph(), and drake_ggraph() (note the two g’s). 17.1 Plotting plans Except for drake 7.7.0 and below, you can simply plot() the plan to show the targets and their dependency relationships. library(drake) # from https://github.com/wlandau/drake-examples/tree/master/mtcars load_mtcars_example() my_plan plot(my_plan) 17.1.1 vis_drake_graph() Powered by visNetwork. Colors represent target status, and shapes represent data type. These graphs are interactive, so you can click, drag, zoom, and and pan to adjust the size and position. Double-click on nodes to contract neighborhoods into clusters or expand them back out again. If you hover over a node, you will see text in a tooltip showing the first few lines of The command of a target, or The body of an imported function, or The content of an imported text file. vis_drake_graph(my_plan) To save this interactive widget for later, just supply the name of an HTML file. vis_drake_graph(my_plan, file = "graph.html") To save a static image file, supply a file name that ends in \".png\", \".pdf\", \".jpeg\", or \".jpg\". vis_drake_graph(my_plan, file = "graph.png") 17.1.2 sankey_drake_graph() These interactive networkD3 Sankey diagrams have more nuance: the height of each node is proportional to its number of connections. Nodes with many incoming connnections tend to fall out of date more often, and nodes with many outgoing connections can invalidate bigger chunks of the downstream pipeline. sankey_drake_graph(my_plan) Saving the graphs is the same as before. sankey_drake_graph(my_plan, file = "graph.html") # Interactive HTML widget sankey_drake_graph(my_plan, file = "graph.png") # Static image file Unfortunately, a legend is not yet available for Sankey diagrams, but drake exposes a separate legend for the colors and shapes. library(visNetwork) legend_nodes() visNetwork(nodes = legend_nodes()) 17.1.3 drake_ggraph() drake_ggraph() can handle larger workflows than the other graphing functions. If your project has thousands of targets and vis_drake_graph()/sankey_drake_graph() does not render properly, consider drake_ggraph(). Powered by ggraph, drake_ggraph()s are static ggplot2 objects, and you can save them with ggsave(). drake_ggraph(my_plan) 17.1.4 text_drake_graph() If you are running R in a terminal without X Window support, the usual visualizations will show up interactively in your session. Here, you can use text_drake_graph() to see a text display in your terminal window. Terminal colors are deactivated in this manual, but you will see color in your console. # Use nchar = 0 or nchar = 1 for better results. # The color display is better in your own terminal. text_drake_graph(my_plan, nchar = 3) 17.2 Underlying graph data: node and edge data frames drake_graph_info() is used behind the scenes in vis_drake_graph(), sankey_drake_graph(), and drake_ggraph() to get the graph information ready for rendering. To save time, you can call drake_graph_info() to get these internals and then call render_drake_graph(), render_sankey_drake_graph(), or render_drake_ggraph(). str(drake_graph_info(my_plan)) 17.3 Visualizing target status drake’s visuals tell you which targets are up to date and which are outdated. make(my_plan, verbose = 0L) outdated(my_plan) sankey_drake_graph(my_plan) When you change a dependency, some targets fall out of date (black nodes). reg2 <- function(d){ d$x3 <- d$x ^ 3 lm(y ~ x3, data = d) } sankey_drake_graph(my_plan) 17.4 Subgraphs Graphs can grow enormous for serious projects, so there are multiple ways to focus on a manageable subgraph. The most brute-force way is to just pick a manual subset of nodes. However, with the subset argument, the graphing functions can drop intermediate nodes and edges. vis_drake_graph( my_plan, subset = c("regression2_small", "large") ) The rest of the subgraph functionality preserves connectedness. Use targets_only to ignore the imports. vis_drake_graph(my_plan, targets_only = TRUE) Similarly, you can just show downstream nodes. vis_drake_graph(my_plan, from = c("regression2_small", "regression2_large")) Or upstream ones. vis_drake_graph(my_plan, from = "small", mode = "in") In fact, let us just take a small neighborhood around a target in both directions. For the graph below, given order is 1, but all the custom file_out() output files of the neighborhood’s targets appear as well. This ensures consistent behavior between show_output_files = TRUE and show_output_files = FALSE (more on that later). vis_drake_graph(my_plan, from = "small", mode = "all", order = 1) 17.5 Control the vis_drake_graph() legend. Some arguments to vis_drake_graph() control the legend. vis_drake_graph(my_plan, full_legend = TRUE, ncol_legend = 2) To remove the legend altogether, set the ncol_legend argument to 0. vis_drake_graph(my_plan, ncol_legend = 0) 17.6 Clusters With the group and clusters arguments to the graphing functions, you can condense nodes into clusters. This is handy for workflows with lots of targets. Take the schools scenario from the drake plan guide. Our plan was generated with drake_plan(trace = TRUE), so it has wildcard columns that group nodes into natural clusters already. You can manually add such columns if you wish. # Visit https://books.ropensci.org/drake/static.html # to learn about the syntax with target(transform = ...). plan <- drake_plan( school = target( get_school_data(id), transform = map(id = c(1, 2, 3)) ), credits = target( fun(school), transform = cross( school, fun = c(check_credit_hours, check_students, check_graduations) ) ), public_funds_school = target( command = check_public_funding(school), transform = map(school = c(school_1, school_2)) ), trace = TRUE ) plan Ordinarily, the workflow graph gives a separate node to each individual import object or target. vis_drake_graph(plan) For large projects with hundreds of nodes, this can get quite cumbersome. But here, we can choose a wildcard column (or any other column in the plan, even custom columns) to condense nodes into natural clusters. For the group argument to the graphing functions, choose the name of a column in plan or a column you know will be in drake_graph_info(my_plan)$nodes. Then for clusters, choose the values in your group column that correspond to nodes you want to bunch together. The new graph is not as cumbersome. vis_drake_graph(plan, group = "school", clusters = c("school_1", "school_2", "school_3") ) As previously mentioned, you can group on any column in drake_graph_info(my_plan)$nodes. Let’s return to the mtcars project for demonstration. vis_drake_graph(my_plan) Let’s condense all the imports into one node and all the up-to-date targets into another. That way, the outdated targets stand out. vis_drake_graph( my_plan, group = "status", clusters = c("imported", "up to date") ) 17.7 Output files drake can reproducibly track multiple output files per target and show them in the graph. plan <- drake_plan( target1 = { file.copy(file_in("in1.txt"), file_out("out1.txt")) file.copy(file_in("in2.txt"), file_out("out2.txt")) }, target2 = { file.copy(file_in("out1.txt"), file_out("out3.txt")) file.copy(file_in("out2.txt"), file_out("out4.txt")) } ) writeLines("in1", "in1.txt") writeLines("in2", "in2.txt") make(plan) writeLines("abcdefg", "out3.txt") vis_drake_graph(plan, targets_only = TRUE) If your graph is too busy, you can hide the output files with show_output_files = FALSE. vis_drake_graph(plan, show_output_files = FALSE, targets_only = TRUE) 17.8 Node Selection (Supported in drake > 7.7.0 only) First, we define our plan, adding a custom column named “link”. mtcars_link <- "https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html" plan <- drake_plan( mtc = target( mtcars, link = !!mtcars_link ), mtc2 = target( mtc, link = !!mtcars_link ), mtc3 = target( modify_mtc2(mtc2, number), transform = map(number = !!c(1:3), .tag_in = cluster_id), link = !!mtcars_link ), trace = TRUE ) unique_stems <- unique(plan$cluster_id) 17.8.1 Perform the default action on select By supplying vis_drake_graph(on_select = TRUE, on_select_col = \"my_column\"), treats the values in the column named \"my_column\" as hyperlinks. Click on a node in the graph to navigate to the corresponding link in your browser. vis_drake_graph( plan, clusters = unique_stems, group = "cluster_id", on_select_col = "link", on_select = TRUE ) 17.8.2 Perform no action on select No action will be taken if any of the following are given to vis_drake_graph(): on_select = NULL, on_select = FALSE, on_select_col = NULL This is the default behaviour. vis_drake_graph( my_plan, clusters = unique_stems, group = "cluster_id", on_select_col = "link", on_select = NULL ) 17.8.3 Customize the onSelect event behaviour What if we instead wanted the browser to display an alert when a node is clicked? alert_behaviour <- function(){ js <- " function(props) { alert('selected node with on_select_col: \\\\r\\\\n' + this.body.data.nodes.get(props.nodes[0]).on_select_col); }" } vis_drake_graph( my_plan, on_select_col = "link", on_select = alert_behaviour() ) "],
["debugging.html", "Chapter 18 Debugging and testing drake projects 18.1 Debugging failed targets 18.2 Why do my targets keep rerunning? 18.3 More help", " Chapter 18 Debugging and testing drake projects This chapter aims to help users detect and diagnose problems with large complex workflows. 18.1 Debugging failed targets 18.1.1 Diagnosing errors When a target fails, drake tries to tell you. large_dataset <- function() { data.frame(x = rnorm(1e6), y = rnorm(1e6)) } expensive_analysis <- function(data) { # More operations go here. tricky_operation(data) } tricky_operation <- function(data) { # Expensive code here. stop("there is a bug somewhere.") } plan <- drake_plan( data = large_dataset(), analysis = expensive_analysis(data) ) make(plan) diagnose() recovers the metadata on targets. For failed targets, this includes an error object. error <- diagnose(analysis)$error error names(error) Using the call stack, you can trace back the location of the error. Once you know roughly where to find the bug, you can troubleshoot interactively. invisible(lapply(tail(error$calls, 3), print)) 18.1.2 Interactive debugging The clues from diagnose() help us go back and inspect the failing code. debug() is an interactive debugging tool which helps you verify exactly what is going wrong. Below, make(plan) pauses execution and turn interactive control over to you inside tricky_operation(). debug(tricky_operation) make(plan) # Pauses at tricky_operation(data). undebug(tricky_operation) # Undoes debug(). drake’s own drake_debug() function is nearly equivalent. drake_debug(analysis, plan) # Pauses at the command expensive_analysis(data). browser() is similar, but it affords you finer control over to pause execution tricky_operation <- function(data) { # Expensive code here. browser() # Pauses right here to give you control. stop("there is a bug somewhere.") } make(plan) 18.1.3 Efficient trial and error If you are using drake, then chances are your targets are computationally expensive and the long runtimes make debugging difficult. To speed up trial and error, run the plan on a small dataset when you debug and repair things. plan <- drake_plan( data = head(large_dataset()), # Just work with the first few rows. analysis = expensive_analysis(data) # Runs faster now. ) tricky_operation <- ... # Try to fix the function. debug(tricky_operation) # Set up to debug interactively. make(plan) # Try to run the workflow. After a lot of quick trial and error, we finally fix the function and run it on the small data. tricky_operation <- function(data) { # Good code goes here. } make(plan) Now, that the code works, it is time to scale back up to the large data. Use make(plan, recover = TRUE) to salvage old targets from before the debugging process. plan <- drake_plan( data = large_dataset(), # Use the large data again. analysis = expensive_analysis(data) # Should be repaired now. ) make(plan, recover = TRUE) 18.2 Why do my targets keep rerunning? Consider the following completed workflow. load_mtcars_example() make(my_plan) At this point, if you change the reg1() function, then make() will automatically detect and rerun downstream targets such as regression1_large. reg1 <- function (d) { lm(y ~ 1 + x, data = d) } make(my_plan) In general, targets are “outdated” or “invalidated” they are out of sync with their dependencies. If a target is outdated, the next make() automatically detects discrepancies and rebuild the affected targets. Usually, this automation adds convenience, saves time, and ensures reproducibility in the face of long runtimes. However, it can be frustrating when drake detects outdated targets when you think everything is up to date. If this happens, it is important to understand How your workflow fits together. Which targets are outdated. Why your targets are outdated. Strategies to prevent unexpected changes in the future. drake’s utility functions offer clues to guide you. 18.2.1 How your workflow fits together drake automatically analyzes your plan and functions to understand how your targets depend on each other. It assembles this information in a directed acyclic graph (DAG) which you can visualize and explore. vis_drake_graph(my_plan) To get a more localized version of the graph, use deps_target(). Unlike vis_drake_graph(), deps_target() gives you a more granular view of the dependencies of an individual target. deps_target(regression1_large, my_plan) deps_target(report, my_plan) To understand how drake detects dependencies in the first place, use deps_code(). This is what drake first sees when it reads your plan and functions to understand the dependencies. deps_code(quote( suppressWarnings(summary(regression1_large$residuals)) )) deps_code(quote( knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE) )) If drake detects new dependencies you were unaware of, that could be a reason why your targets are out of date. 18.2.2 Which targets are outdated Graphing utilities like vis_drake_graph() label the outdated targets, but sometimes it is helpful to get a more programmatic view. outdated(my_plan) 18.2.3 Why your targets are outdated The deps_profile() function offers clues. deps_profile(regression1_small, my_plan) From the data frame above, regression1_small is outdated because an R object dependency changed since the last make(). drake does not hold on to enough information to tell you precisely which object is the culprit, but functions like vis_drake_graph(), deps_target(), and deps_code() can help narrow down the possibilities. 18.2.4 Strategies to prevent unexpected changes in the future drake is sensitive to changing functions in your global environment, and this sensitivity can invalidate targets unexpectedly. Whenever you plan to run make(), it is always best to restart your R session and load your packages and functions into a fresh clean workspace. r_make() does all this cleaning and prep work for you automatically, and it is more robust and dependable (and childproofed) than ordinary r_make(). To read more, visit https://books.ropensci.org/drake/projects#safer-interactivity. 18.3 More help The GitHub issue tracker is the best place to request help with your specific use case. "],
["triggers.html", "Chapter 19 Triggers: decision rules for building targets 19.1 What are triggers? 19.2 Customization 19.3 Alternative trigger modes 19.4 A more practical example", " Chapter 19 Triggers: decision rules for building targets When you call make(), drake tries to skip as many targets as possible. If it thinks a command will return the same value as last time, it does not bother running it. In other words, drake is lazy, and laziness saves you time. 19.1 What are triggers? To figure out whether it can skip a target, drake goes through an intricate checklist of triggers: The missing trigger: Do we lack a return value from a previous make()? Maybe you are building the target for the first time or you removed it from the cache with clean(). The command trigger: did the command in the drake plan change nontrivially since the last make()? Changes to spacing, formatting, and comments are ignored. The depend trigger: did any non-file dependencies change since the last make()? These could be: Other targets. Imported objects. Imported functions. To track changes to a function, drake removes any code closed in ignore(), deparses the literal code so that whitespace is standardized and comments are removed, and then hashes the resulting string. In some cases, drake makes special adjustments for strange edge cases like Rcpp functions with pointers and functions defined with Vectorize(). However, edge cases like this one are inevitable because of the flexibility of R. Any dependencies of imported functions. Any dependencies of dependencies of imported functions, and so on. The file trigger: did any file inputs or file outputs change since the last make()? These files are the ones explicitly declared in the command with file_in(), knitr_in(), and file_out(). The seed trigger: for statistical reproducibility, drake assigns a unique seed to each target based on the target’s name and the global seed argument to make(). If you change the target’s pseudo-random number generator seed either with the seed argument or the custom seed column in the plan, this change will cause a rebuild if the seed trigger is turned on. The format trigger: did you add or change the target’s storage format since last build? Details: https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. The condition trigger: an optional user-defined piece of code that evaluates to a TRUE/FALSE value. The target builds if the value is TRUE. The change trigger: an optional user-defined piece of code that evaluates to any value (preferably small and quick to compute). The target builds if the value changed since the last make(). If any trigger detects something wrong or different with the target or its dependencies, the next make() will run the command and (re)build the target. 19.2 Customization With the trigger() function, you can create your own customized checklist of triggers. Let’s run a simple workflow with just the missing trigger. We deactivate the command, depend, and file triggers by setting the respective command, depend, and file arguments to FALSE. plan <- drake_plan( psi_1 = (sqrt(5) + 1) / 2, psi_2 = (sqrt(5) - 1) / 2 ) make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) Now, even if you wreck all the commands, nothing rebuilds. plan <- drake_plan( psi_1 = (sqrt(5) + 1) / 2 + 9999999999999, psi_2 = (sqrt(5) - 1) / 2 - 9999999999999 ) make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) You can also give different targets to different triggers. Triggers in the drake plan override the trigger argument to make(). Below, psi_2 always builds, but psi_1 only builds if it has never been built before. plan <- drake_plan( psi_1 = (sqrt(5) + 1) / 2 + 9999999999999, psi_2 = target( command = (sqrt(5) - 1) / 2 - 9999999999999, trigger = trigger(condition = psi_1 > 0) ) ) plan make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) Interestingly, psi_2 now depends on psi_1. Since psi_1 is part of the target psi_2 because of the condition trigger, it needs to be up to date before we attempt psi_2. However, since psi_1 is not part of the command, changing it will not trip the other triggers such as depend. vis_drake_graph(plan) In the next toy example below, drake reads from a file to decide whether to build x. Try it out. plan <- drake_plan( x = target( 1 + 1, trigger = trigger(condition = file_in(readRDS("file.rds"))) ) ) saveRDS(TRUE, "file.rds") make(plan) make(plan) make(plan) saveRDS(FALSE, "file.rds") make(plan) make(plan) make(plan) In a real project with remote data sources, you may want to use the condition trigger to limit your builds to times when enough bandwidth is available for a large download. For example, drake_plan( x = target( command = download_large_dataset(), trigger = trigger(condition = is_enough_bandwidth()) ) ) Since the change trigger can return any value, it is often easier to use than the condition trigger. clean(destroy = TRUE) plan <- drake_plan( x = target( command = 1 + 1, trigger = trigger(change = sqrt(y)) ) ) y <- 1 make(plan) make(plan) y <- 2 make(plan) In practice, you may want to use the change trigger to check a large remote before downloading it. drake_plan( x = target( command = download_large_dataset(), trigger = trigger( condition = is_enough_bandwidth(), change = date_last_modified() ) ) ) A word of caution: every non-NULL change trigger is always evaluated, and its value is carried around in memory throughout make(). So if you are not careful, heavy use of the change trigger could slow down your workflow and consume extra resources. The change trigger should return small values (and should ideally be quick to evaluate). To reduce memory consumption, you may want to return a fingerprint of your trigger value rather than the value itself. See the digest package for more information on computing hashes/fingerprints. library(digest) drake_plan( x = target( command = download_large_dataset(), trigger = trigger( change = digest(download_medium_dataset()) ) ) ) 19.3 Alternative trigger modes Sometimes, you may want to suppress a target without having to worry about turning off every single trigger. That is why the trigger() function has a mode argument, which controls the role of the condition trigger in the decision to build or skip a target. The available trigger modes are \"whitelist\" (default), \"blacklist\", and \"condition\". trigger(mode = \"whitelist\"): we rebuild the target whenever condition evaluates to TRUE. Otherwise, we defer to the other triggers. This is the default behavior described above in this chapter. trigger(mode = \"blacklist\"): we skip the target whenever condition evaluates to FALSE. Otherwise, we defer to the other triggers. trigger(mode = \"condition\"): here, the condition trigger is the only decider, and we ignore all the other triggers. We rebuild target whenever condition evaluates to TRUE and skip it whenever condition evaluates to FALSE. 19.4 A more practical example See the “packages” example for a more practical demonstration of triggers and their usefulness. "],
["faq.html", "A Frequently-asked questions", " A Frequently-asked questions This FAQ is a compendium of pedagogically useful issues tagged on GitHub. To contribute, please submit a new issue and ask that it be labeled a frequently asked question. [After expose_imports(myPackage), the package functions cannot find @importFrom objects](https://github.com/ropensci/drake/issues/1286) Import drake caches into other drake caches Feature request: Better progress log output Making drake play nice w/brms for modeling w/Stan Serialize caches instead of storing them in memory? Efficient track of parameters/ artifacts for different models Why does wrapping make() in a function invalidate some targets? new transform function split to chunk a data.frame Plan should be out-of-date but isn’t map using initial parameter only follows through for one stage Dynamically scale clustermq workers knitr file paths List columns don’t work in map(.data) map() back to original variables after after combine() FAQ: Functions as data Avoid re-running targets if supplied args are the same as default args How to create a jagged cross() transform How to combine() while keeping track of the sources of targets target invalidated when referenced from another plan Within-target parallelism fails cannot remove bindings from a locked environment Functions that depend on targets Erroneous circular workflow error when using NSE in function function dependencies are missing: drake_config() in a magrittr pipe Best practices for including a drake workflow in a package Can you have multiple drake plans? evaluate file.path and variables in file_out and friends Working with HPC time limits Reproducibility with random numbers How should I mix non-R code (e.g. Python and shell scripts) in a large drake workflow? Reproducible remote data sources Trouble with caches sent through Dropbox How to add .R files to drake_plan() "],
["design.html", "B Design B.1 Principles B.2 Specific classes", " B Design This chapter explains drake’s internal design and architecture. Goals: Help developers and enthusiastic users contribute to the code base. Invite high-level advice and discussion about potential improvements to the overall design. B.1 Principles B.1.1 Functions first From the user’s point of view, drake is a style of programming in its own right, and that style is zealously and irrevocably function-oriented. It harmonizes with statistics and data science, where most methodology naturally takes the form of data transformations, and it embraces the natively function-oriented design of the R language. Functions are first-class citizens in drake, and they dominate the internal design at the highest levels. B.1.2 Light use of traditional OOP Most of a drake workflow happens inside the make() function. make() accepts a data frame of function calls (the drake plan), caches some targets, and then drops its internal state when it terminates. The state does not need to persist, and the user does not need to interact with it. This is a major reason why traditional object-oriented programming plays such a small, supporting role. In drake, full OOP classes and objects are small, simple, and extremely specialized. For example, the decorated storr, priority queue, and logger reference classes are narrowly defined and fit for purpose. The S3 system appears far more often, often as a mechanism of function overloading to streamline control flow, and also as a means of adding structure and validation to small target-specific objects optimized for performance. In future development, tactical reference classes will arise as needed to encapsulate low-level patterns into natural abstractions. However, drake’s design places greater importance on maximizing runtime efficiency. B.1.3 High-performant small objects drake maintains several small list-like objects for each target, such as the local spec, the target data, triggers, and the code analysis results. drake workflows with thousands of targets have thousands of these objects, and as profiling studies have shown, we need these objects to perform as efficiently as possible. Instantiation and field access need to be fast, and the memory footprint needs to be low. For these reasons, we choose simple lists with S3 class attributes, which outclass S4 and reference classes when it comes to instantiation speed. B.1.4 Fast iteration along aggregated data Each of the large data structures aggregates a single type of information across all targets to help drake run fast. Examples include the whole workflow specification (config$spec) and the in-memory target metadata cache (config$meta). These objects are hash-table-powered environments to make field access as fast as possible. B.1.5 Access to information across targets drake aggressively analyzes dependency relationships among targets. Even while make() builds a single target, it needs to stay aware of the other targets, not only to build the dependency graph, but also for other tasks like dynamic branching. This is a major reason why the workflow specification, dependency graph, priority queue, and metadata are all stored in environments that most functions can reach. B.2 Specific classes This section describes drake’s primary internal data structures at a high level. It is not exhaustive, but it does cover most of the architecture. B.2.1 Config make(), outdated(), vis_drake_graph(), and related utilities keep track of a drake_config() object. A drake_config() object is a list of class \"drake_config\". Its purpose is to keep track of the state of a drake workflow and avoid long parameter lists in functions. Future development will focus on refactoring and formalizing drake_config() objects. B.2.2 Settings Static runtime parameters such as keep_going and log_build_times live in a list of class drake_settings, which is part of each drake_config object. B.2.3 Plan The drake plan is a simple data frame of class \"drake_plan\", and it is drake’s version of a Makefile. The manual has a whole chapter on plans. B.2.4 Specification A drake plan is an implicit representation of targets and their immediate dependencies. Before make() starts to build targets, drake makes all these local dependency structures explicit and machine-readable in a workflow specification. The overall specification (config$spec) an R environment with the local specification of each individual target and each imported object/function. Each local specification is a list of class \"drake_spec\", and it contains the names of objects referenced from the command, the files declared with file_in() and friends, the dependencies of the condition and change triggers, etc. B.2.5 Graph Whereas the specification tracks the local dependency structures, the graph (an igraph object) represents the global dependency structure of the whole workflow. It is less granular than the specification, and make() uses it to run the correct targets in the correct order. B.2.6 Priority queue In high-performance computing settings (e.g. parallelism = \"clustermq\" and parallelism = \"future\") drake creates a priority queue to schedule targets. For the sake of convenience, the underlying algorithms are different than that of a classical priority queue, but this does not seem to decrease performance in practice. B.2.7 Metadata config$meta is an environment, and each element is a list of class \"drake_meta\". Whereas the workflow specification identifies the names of dependencies, the \"drake_meta\" contains hashes (and supporting information). drake uses the hashes decide if the target is up to date. Metadata lists are stored in the \"meta\" namespace of the decorated storr. config$meta_old is similar to config$meta and exists for performance purposes. B.2.8 Cache B.2.8.1 API drake’s cache API is a decorated storr, a reference class that wraps around a storr object. drake relies heavily on storr namespaces (e.g. for metadata and recovery keys). drake’s custom wrapper around the storr class (i.e. the “decorated” part) has extra methods that power history (a txtq) and specialized data formats, as well as hash tables that only the cache needs. The new_cache() and drake_cache() functions create and reload drake caches, respectively, and they are equivalent to storr::storr_rds() plus drake:::decorate_storr(). B.2.8.2 Data Usually, the persistent data values live in a hidden .drake/ folder. Most of the files come from storr_rds() methods. Other files include the history txtq and the values of targets with specialized data formats. The files are structured so they can be used by either with storr::storr_rds() or drake::drake_cache(). Other storr backends like storr_environment() and storr_dbi() are also compatible with this approach. In these non-standard cases, .drake/ does not contain the files of the inner storr, but it still has files supporting history and specialized target formats. B.2.9 Code analysis lists drake performs static code analysis on functions and commands in order to resolve the dependency structure of a workflow. Lists of class drake_deps and drake_deps_ht store the results of static code analysis on a single code chunk. Each element of a drake_deps list is a character vector of static dependencies of a certain type (e.g. global variables or file_in() files). The elements of drake_deps_ht lists are hash tables (which increase performance when the static code analysis is running). B.2.10 Environments drake has memory management strategies to make sure a target’s dependencies are loaded when make() runs its command. Internally, memory management works with a layered system of environments. This system helps make() protect the user’s calling environment and perform dynamic branching without the need for static code analysis or metaprogramming. config$envir: the calling environment of make(), which contains the user’s functions and other imported objects. make() tries to leave this environment alone (and temporarily locks it when lock_envir is TRUE). config$envir_targets: contains static targets. Its parent is config$envir. config$dynamic: contains entire aggregated dynamic targets when drake needs them. Its parent is config$envir_targets. config$envir_subtargets: contains individual sub-targets. Its parent is config$envir_dynamic. In addition, config$envir_loaded keeps track of which targets are loaded in (2), (3), and (4) above. These environments form a known data clump, and future development will encapsulate them. B.2.11 Hash tables The drake_config() object and decorated storr keep track of multiple hash tables to cache data in memory and boost speed while iterating over large collections of targets. They are simply R environments with hash = TRUE, and drake has internal interface functions for working with them. Examples in drake_config() objects: ht_is_dynamic: keeps track of names of dynamic targets. Makes is_dynamic() faster. ht_is_subtarget: same as above, but for is_subtarget(). ht_dynamic_deps: names of dynamic dependencies of dynamic targets. Powers is_dynamic_dep(). ht_target_exists: tracks targets that already exist at the beginning of make(). ht_subtarget_parents: keeps track of the parent of each sub-target. Examples in the decorated storr: ht_encode_path and ht_decode_path: drake uses Base32 encoding to store references to static file paths. These hash tables avoid redundant encoding/decoding operations and increases performance for large collections of targets. ht_encode_namespaced and ht_decode_namespaced: same for imported namespaced functions. ht_hash: powers memo_hash(), which helps us avoid redundant calls to input_file_hash(), output_file_hash(), static_dependency_hash(), and dynamic_dependency_hash(). ht_keys: a small hash table that powers the set_progress method. This progress information is stored in the cache by default, and the user can retrieve it with drake_progress(). B.2.12 Logger The logger (config$logger) is a reference class that controls messages to the console and a custom log file) if applicable). Logging messages help users informally monitor the progress of make(). "]
]