Analyze multiple datasets in multiple ways with a smooth, efficient, parallelized, reproducible R workflow.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
inst/example
man
tests
.Rbuildignore
.gitignore
DESCRIPTION
NAMESPACE
README.md

README.md

Try remakeGenerator

remakeGenerator will be the successor to workflowHelper. remakeGenerator is internally cleaner and more flexible and extensible than workflowHelper, and it is better suited to adapt with future updates to remake. remakeGenerator is tested and available for use.

workflowHelper

This package helps to analyze multiple datasets in multiple ways. Your workflow will be

  • Reproducible. Reproduce any analysis with one call to plan_workflow() and another to make.
  • Development-friendly. Thanks to remake, whenever you change your code, your next job will only recompute the affected tasks. This minimizes headache when your workflow is under heavy development and unexpected changes happen frequently.
  • Quick to set up. Just provide the commands to generate datasets, analyze an arbitrary dataset, etc., and workflowHelper will arrange these commands in a workflow and manage your output.
  • Parallelizable. Easily distribute your workflow over multiple parallel processes.

Prerequisites

Before using this package, you should first learn about remake. GNU make is recommended but not totally necessary.

Installation

Ensure that R and GNU make are installed, as well as the dependencies in the DESCRIPTION. Open an R session and run

library(devtools)
install_github("wlandau/workflowHelper")

Alternatively, you can build the package from the source and install it by hand. First, ensure that git is installed. Next, open a command line program such as Terminal and enter the following commands.

git clone git@github.com:wlandau/workflowHelper.git
R CMD build workflowHelper
R CMD INSTALL ...

where ... is replaced by the name of the tarball produced by R CMD build.

Windows users need Rtools.

The example and tests sometimes use system("make") and similar commands. So if you're using the Windows operating system, you will need to install the Rtools package.

Example

You can run this example from start to finish with the run_example_workflowHelper() function. Alternatively, you can set up earlier stages with write_example_workflowHelper() or setup_example_workflowHelper() and then run the output manually with remake::make() or make. Then, optionally, use the clean_example_workflowHelper() function to remove all the files generated by run_example_workflowHelper(). The details of the example are below.

Suppose I want to

  1. Generate some datasets.
  2. Analyze each dataset with multiple methods of analysis.
  3. Compute summary statistics of each analysis of each dataset (model coefficients and mean squared error) and aggregate the summaries together.
  4. Generate some tables, figures, and reports using those aggregated summaries.

I keep the functions to generate data, analyze data, etc. in code.R, and the script to organize and set up the workflow is workflow.R. There are also knitr reports latex.Rnw and markdown.Rmd. You can generate these files with the write_example_workflowHelper() function. Typically, in your own workflows, you will write these files by hand.

A walk through workflow.R

First, I list the R scripts containing my code and the packages dependencies.

library(workflowHelper)
sources = strings(code.R)
packages = strings(MASS)
# packages = strings(MASS, rmarkdown, tools) # Uncomment before building pdf/html

The strings function converts R expressions into character strings, so I could have simply written sources = "code.R".

Next, I list the commands to generate the datasets.

datasets = commands(
  normal16 = normal_dataset(n = 16),
  poisson32 = poisson_dataset(n = 32),
  poisson64 = poisson_dataset(n = 64)
)

Be sure to give a unique name to each command (for example, poisson_dataset(n = 32) has the unique name poisson32). The commands function checks for names and returns a named character vector, so I could have simply written datasets = c(normal16 = "normal_dataset(n = 16)", poisson32 = "poisson_dataset(n = 32)", poisson64 = "poisson_dataset(n = 64)"). To generate 4 replicates of each kind of dataset, write datasets = reps(datasets, 4).

Similarly, I specify the commands to analyze each dataset.

analyses = commands(
  linear = linear_analysis(..dataset..),
  quadratic = quadratic_analysis(..dataset..)
)

The ..dataset.. wildcard stands for the current dataset being analyzed, which in this case is an object returned by normal_dataset or poisson_dataset. Wildcards are case-insensitive, so ..DATASET.. and ..dAtAsEt will also work.

For summaries of the analyses, there is an additional ..analysis.. wildcard that stands for the current object returned by linear_analysis or quadratic_analysis. Like ..dataset.., ..analysis.. is case-insensitive, so ..ANALYSIS.. will also work.

summaries = commands(
  mse = mse_summary(..dataset.., ..analysis..),
  coef = coefficients_summary(..analysis..)
)

Next, I specify how to produce general output from the summaries, etc. Since coef.csv has a file extension, it will automatically be treated as a file target.

output = commands(
  coef_table = do.call(I("rbind"), coef),
  coef.csv = write.csv(coef_table, target_name),
  mse_vector = unlist(mse)
)

Now, we're ready to specify plots. (Here, the a plot: TRUE line is automatically added to remake.yml.)

plots = commands(
  mse.pdf = hist(mse_vector, col = I("black"))
)

Finally, we can generate some reports.

reports = commands(
  markdown.md = list("poisson32", "coef_table", "coef.csv"), # dependencies
  latex.tex = TRUE # no dependencies here
#  markdown.html = render("markdown.md", quiet = TRUE, clean = FALSE),
#  latex.pdf = texi2pdf("latex.tex", clean = FALSE)
)

Since report.md has a .md extension, remake will automatically look for report.Rmd and knit it to report.md with the knitr package. Similarly, remake will try to build latex.tex from latex.Rnw. In each case, the command is replaced with a character vector or list of characters denoting the dependencies of the report. These could be external files or cached intermediate remake objects such as datasets or analyses. In the latter case, objects are automatically exported for use inside R code chunks as described here.

If you want to render markdown.md to markdown.html, be sure to include rmarkdown in your packages. Similarly, to compile latex.tex to latex.pdf, include the tools package. I commented out the lines to build markdown.html and latex.pdf in order to increase portability, but you may uncomment them if your copy of R is connected to copies of LaTeX and Pandoc.

Optionally, I can prepend some lines to the overarching Makefile for the workflow.

begin = c("# This is my Makefile", "# Variables...")

The stages and elements of my workflow are now planned. To put them all together, I use plan_workflow, which calls parallelRemake::write_makefile().

plan_workflow(sources, packages, datasets, analyses, summaries, output, begin)

Optionally, I can pass additional arguments to remake::make using the remake_args argument to plan_workflow. For example, plan_workflow(..., remake_args = list(verbose = FALSE)) is equivalent to remake::make(..., verbose = F) for each target. I cannot set target_names or remake_file this way. Also, if I want to suppress the writing of the Makefile, I can call plan_workflow(..., makefile = NULL).

Running the workflow

After running the workflow.R script above, I have a remake/YAML file in my current working directory. To run the whole workflow in an R session with no parallel computing, simply open an R session and enter the following.

library(remake)
make(remake_file = "remake.yml")

Thanks to remake, if I change functions in code.R and then run make again, only the outdated parts of the workflow will be rebuilt.

Running workflow.R also produces a Makefile in the current working directory. Using this master Makefile and a command line program, I have several options for running the workflow with parallel computing. Here are some examples.

  • make runs the full workflow, only building results that are out of date or missing.
  • make -j <n> is the same as above with the workflow distributed over <n> parallel processes. Similarly, you can append -j <n> to any of the commands below to activate parallelism.
  • make datasets just makes the datasets.
  • make analyses just runs the analyses of all the datasets after ensuring that the datasets are up to date.
  • make summaries computes individual summaries of each analysis of each dataset.
  • make aggregates aggregates the summaries together.
  • make output makes the final output of the workflow after ensuring all the previous results are up to date.
  • make clean removes the files generated by make. If some of your files are produced by side effects, make clean might not remove them. In that case, updates to dependencies may not trigger the desired rebuilds, so you should read the next section.
  • make reset runs make clean and then removes the Makefile and all its constituent YAML files.

Manual access to intermediate objects for debugging and testing

Intermediate objects such as datasets, analyses, and summaries are maintained in remake's hidden storr cache. At any point in the workflow, you can reload them using recall and check the available ones using recallable. Let's go back to the example. First, I check to see the names of the objects I can reload.

> recallable()
 [1] "coef"                     "coef_table"              
 [3] "mse"                      "mse_vector"              
 [5] "normal16"                 "normal16_linear"         
 [7] "normal16_linear_coef"     "normal16_linear_mse"     
 [9] "normal16_quadratic"       "normal16_quadratic_coef" 
[11] "normal16_quadratic_mse"   "poisson32"               
[13] "poisson32_linear"         "poisson32_linear_coef"   
[15] "poisson32_linear_mse"     "poisson32_quadratic"     
[17] "poisson32_quadratic_coef" "poisson32_quadratic_mse" 
[19] "poisson64"                "poisson64_linear"        
[21] "poisson64_linear_coef"    "poisson64_linear_mse"    
[23] "poisson64_quadratic"      "poisson64_quadratic_coef"
[25] "poisson64_quadratic_mse" 
> 

Then if I want to load mse, the list of summaries generated by mse_summary in code.R, I simply use recall.

> recall("mse")
$normal16_linear
[1] 0.6394384

$normal16_quadratic
[1] 0.6394384

$poisson32_linear
[1] 4.991832

$poisson32_quadratic
[1] 4.991832

$poisson64_linear
[1] 3.613922

$poisson64_quadratic
[1] 3.613922

> 

Important: do not manually access the files inside .remake/objects for serious jobs. Changes via functions like recall() and recallable() are not tracked and thus not reproducible.

High-performance computing

If you want to run make -j to distribute tasks over multiple nodes of a Slurm cluster, refer to the Makefile in this post and write

write_makefile(..., 
  begin = c(
    "SHELL=srun",
    ".SHELLFLAGS= <ARGS> bash -c"))

in an R session, where <ARGS> stands for additional arguments to srun. Then, once the Makefile is generated, you can run the workflow with nohup make -j [N] & in the command line, where [N] is the number of simultaneous tasks. For other task managers such as PBS, such an approach may not be possible. Regardless of the system, be sure that all nodes point to the same working directory so that they share the same .remake storr cache.

Use with the downsize package

You may want to use the downsize package within your custom R source code. That way, you can run a quick scaled-down version of your workflow for debugging and testing before you run the full workload. In the example, just include downsize in packages inside workflow.R and replace the top few lines of code.R with the following.

library(downsize)
scale_down()

normal_dataset = function(n = 16){
  ds(data.frame(x = rnorm(n, 1), y = rnorm(n, 5)), nrow = 4)
}

poisson_dataset = function(n = 16){
  ds(data.frame(x = rpois(n, 1), y = rpois(n, 5)), nrow = 4)
}

The call scale_down() sets the downsize option to TRUE, which is a signal to the ds function. The command ds(A, ...) says "Downsize A to a smaller object when getOption("downsize") is TRUE". For the full scaled-up workflow, just delete the first two lines or replace scale_down() with scale_up(). Unfortunately, remake does not rebuild things when options are changed, so you'll have to run make clean whenever you change the downsize option.