Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metaprogramming Difficulties (vignette potentially helpful) #451

Closed
cfhammill opened this issue Jul 3, 2018 · 9 comments
Closed

Metaprogramming Difficulties (vignette potentially helpful) #451

cfhammill opened this issue Jul 3, 2018 · 9 comments

Comments

@cfhammill
Copy link

Hi drake devs,

I'm really interested in using drake for some of my work. The lab I work in has a pipelining package of our own (https://github.com/Mouse-Imaging-Centre/pydpiper) and a co-worker and I tried briefly to adapt some of this to drake. The bulk of our pipelining software works by running shell commands against tagged input/output files. The package builds a dependency graph using these files and runs them on a variety of clusters. So we sat down to implement this in drake:

step <- function(cmd # A glue pattern that uses inputs and outputs
               , inputs = NULL
               , outputs = NULL){
  glue_data(list(inputs = file_in(inputs)
               , outputs = file_out(outputs)
               , cmd) %>%
            system)   
}

A simple example:

file_proc <-
  drake_plan(
    step("echo hi > {inputs}", outputs = "test1.txt")
  , step("cat {inputs} {inputs} > {outputs}", "test1.txt", "test2.txt")
  )

This produces:

image

putting the command in manually as

file_proc <-
  drake_plan(
    glue_data(list(outputs = file_out("test1.txt"))
            , "echo hi > {outputs}") %>%
      system()
  , glue_data(list(inputs = file_in("test1.txt"), outputs = file_out("test2.txt"))
             ,"cat {inputs} {inputs} > {outputs}") %>%
      system()
  )

gives the correct:

image

Additionally having file_in and file_out at the toplevel of drake_plan would fix this but then filenames couldn't be generated within the function. This seems to me like it is a problem that could well be solved by rlang style metaprogramming, but my few obvious attempts failed. For example having step produce a quo or expr and then evaluating it.

I think having an example vignette demonstrating how to write abstract pipeline code would be quite helpful.

@wlandau
Copy link
Member

wlandau commented Jul 3, 2018

Glad to hear you are giving drake a shot.

This behavior is actually expected. drake uses static code analysis to detect dependencies, and it would be extremely difficult to predict the values of arguments you have not yet passed to step(). I believe what you are trying to achieve is a templating mechanism to generate large pipelines. drake does have built-in wildcard templating functionality, and although it may be unfamiliar to new users, I do recommend it. I believe you already saw the second and third chapters of the separate online manual, and I plan to make these points clearer there (ropensci-books/drake#13). In addition, drake does support tidy evaluation in drake_plan() (some examples here, ref: #200, #202).

Also, drake was designed to focus on your R session rather than external files. Yes, it does support file inputs and file outputs, but the more you move away from file_in() and file_out() and use non-file targets and rely on the built-in caching system, the more mileage you will get out of drake. I realize that much of your existing work is in Python and shell scripts, so a total focus on R and R objects may not be realistic.

@cfhammill
Copy link
Author

Thanks @wlandau, turns out the following worked to solve my immediate problem:

step <- function(cmd # A glue pattern that uses inputs and outputs
               , inputs = NULL
               , outputs = NULL){
  quo({
    glue_data(list(inputs = file_in(!!!inputs)
                 , outputs = file_out(!!!outputs)
                 , cmd) %>%
              system)
  })
}

file_proc <-
  drake_plan(
    !!! step("echo hi > {inputs}", outputs = "test1.txt")
  , !!! step("cat {inputs} {inputs} > {outputs}", "test1.txt", "test2.txt")
  )

vis_drake_graph(drake_config(file_proc))

image

So the trick for people who want to programmatically write expressions for their plans is:

  1. produce a quo/expr from your function
  2. use unquote-splice (!!!/UQS) to insert it into your drake_plan literally.

@wlandau
Copy link
Member

wlandau commented Jul 30, 2018

Nice! Also keep in mind that you don't have to use the drake_plan() function. In fact, since you are this serious about sophisticated metaprogramming, I think you are better off creating the plan on your own.

library(drake)
library(glue)
library(rlang)
library(tidyverse)

write_command <- function(cmd, inputs = NULL , outputs = NULL){
  inputs <- enexpr(inputs)
  outputs <- enexpr(outputs)
  expr({
    glue_data(
      list(
        inputs = file_in(!!inputs),
        outputs = file_out(!!outputs)
      ),
      !!cmd
    ) %>%
      system
  }) %>%
    expr_text
}

meta_plan <- tribble(
  ~cmd, ~inputs, ~outputs,
  "echo hi > {outputs}", NULL, "test1.txt",
  "cat {inputs} {inputs} > {outputs}", c("x", "y", "test1.txt"), "test2.txt"
)

plan <- tibble(
  target = paste0("target_", seq_len(nrow(meta_plan))),
  command = pmap_chr(meta_plan, write_command)
) %>%
  print
#> # A tibble: 2 x 2
#>   target   command                                                        
#>   <chr>    <chr>                                                          
#> 1 target_1 "{\n    glue_data(list(inputs = file_in(NULL), outputs = file_…
#> 2 target_2 "{\n    glue_data(list(inputs = file_in(c(\"x\", \"y\", \"test…

@cfhammill
Copy link
Author

That's awesome, I didn't realize I could write the plan frame outside of drake_plan! Thanks @wlandau.

@wlandau
Copy link
Member

wlandau commented Jul 30, 2018

Thank you for bringing up this useful scenario. I added some new writing to the manual to recap our discussion.

@cfhammill
Copy link
Author

Looks great!

@wlandau
Copy link
Member

wlandau commented Aug 3, 2018

Alternatively, you could use as.call() to create calls to file_in(), file_out(), and your own custom functions. I will make another note in the manual.

library(drake)
library(glue)
library(purrr) # pmap_chr() is particularly useful here.
library(rlang)
library(tidyverse)

# A function that will be called in your commands.
command_function <- function(cmd, inputs, outputs){
  glue_data(
    list(
      inputs = inputs,
      outputs = outputs
    ),
    cmd
  ) %>%
    purrr::walk(system)
}

# A function to generate quoted calls to command_function(),
# which in turn contain quoted calls to file_in() and file_out().
write_command <- function(...){
  args <- list(...)
  args$inputs <- as.call(list(quote(file_in), args$inputs))
  args$outputs <- as.call(list(quote(file_out), args$outputs))
  c(quote(command_function), args) %>%
    as.call() %>%
    rlang::expr_text()
}

# Stuff we feed into the commands, with one row per command.
meta_plan <- tribble(
  ~cmd, ~inputs, ~outputs,
  "cat {inputs} > {outputs}", c("in1.txt", "in2.txt"), c("out1.txt", "out2.txt"),
  "cat {inputs} {inputs} > {outputs}", c("out1.txt", "out2.txt"), c("out3.txt", "out4.txt")
)

plan <- tibble(
  target = paste0("target_", seq_len(nrow(meta_plan))),
  command = pmap_chr(meta_plan, write_command)
) %>%
  print
#> # A tibble: 2 x 2
#>   target   command                                                        
#>   <chr>    <chr>                                                          
#> 1 target_1 "command_function(cmd = \"cat {inputs} > {outputs}\", inputs =…
#> 2 target_2 "command_function(cmd = \"cat {inputs} {inputs} > {outputs}\",…

writeLines("in1", "in1.txt")
writeLines("in2", "in2.txt")
make(plan)
#> target target_1
#> target target_2

Created on 2018-08-03 by the reprex package (v0.2.0).

@wlandau
Copy link
Member

wlandau commented Jan 26, 2019

Update: development drake now has a much friendlier (experimental) API. Details: https://ropenscilabs.github.io/drake-manual/plans.html#create-large-plans-the-easy-way

@wlandau
Copy link
Member

wlandau commented Feb 22, 2020

Dynamic files (#1178) may allow you to avoid this sort of metaprogramming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants