Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard alternative to gather/reduce_plan #376

Closed
tmastny opened this issue May 5, 2018 · 27 comments
Closed

Wildcard alternative to gather/reduce_plan #376

tmastny opened this issue May 5, 2018 · 27 comments

Comments

@tmastny
Copy link
Contributor

tmastny commented May 5, 2018

Introduction

I recently did a small presentation on drake to some fellow graduate students, and everyone was impressed by how intuitive it is to specify the dependencies and a workflow, especially compared to a file-centeric maker like Snakemake. In many ways, the R source code is the documentation to the workflow.

However, based on my personal opinion and on the feedback from my presentation, I feel that gather_plan is very non-intuitive and verbose, especially compared to Snakemake alternatives.

I'd like to show an example of the complexity of gather_plan and how Snakemake handles similar situations.

Example

library(caret)
library(recipes)
library(drake)

accuracy <- function(model, data) {
  value <- list()
  t <- table(data$Species, predict(model))
  value[[model$method]] <- sum(diag(t))/sum(t)
  value
}

standardizer <- function(data) {
  rec <- recipe(Species ~ ., iris) %>%
    step_center(all_predictors()) %>%
    step_scale(all_predictors()) %>%
    prep(iris)
  
  bake(rec, iris)
}

plan <- drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = method__),
 results = accuracy(model_method__, zscored)
)

This starts off nice, but now we'd like to gather all the results_method__ into a convenient list holding all the results for each method__.

One straightforward but hacky way to do this is through ls:

hacky_plan <- drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = method__),
 results = accuracy(model_method__, zscored),
 all_results = purrr::flatten(mget(ls(pattern = "results_*")))
)

but that doesn't create dependencies on the results_method__ objects, so it may return empty, or it might not be updated.

The gather_plan method is much more verbose and confusing. First you need to evaluate the plan:

plan <- evaluate_plan(
  plan,
  rules = list(method__ = c("rf", "glmnet"))
)

And then gather, but only on the results_method__ subset.

gathered_subset <- gather_plan(plan[4:5,])

plan[4,] <- gathered_subset
plan <- plan[-5,]
plan
#> # A tibble: 4 x 2
#>   target       command                                                    
#>   <chr>        <chr>                                                      
#> 1 zscored      standardizer(iris)                                         
#> 2 model_rf     train(Species ~ ., data = zscored, method = 'rf')          
#> 3 model_glmnet train(Species ~ ., data = zscored, method = 'glmnet')      
#> 4 target       list(results_rf = results_rf, results_glmnet = results_glm…

Alternate Route

I realized I could have created a separate plan tibble for results_method__, but that seems even more convoluted:

modeling_plan <- drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = "method__")
)

results_plan <- drake_plan(
  results = accuracy(model_method__, zscored)
)

evaled_model <- evaluate_plan(
  modeling_plan,
  rules = list(method__ = c("rf", "glmnet"))
)

evaled_results <- evaluate_plan(
  results_plan,
  rules = list(method__ = c("rf", "glmnet"))
)

gathered_results <- gather_plan(evaled_results)

plan <- bind_rows(evaled_model, gathered_results)
plan
#> # A tibble: 4 x 2
#>   target       command                                                    
#>   <chr>        <chr>                                                      
#> 1 zscored      standardizer(iris)                                         
#> 2 model_rf     train(Species ~ ., data = zscored, method = 'rf')          
#> 3 model_glmnet train(Species ~ ., data = zscored, method = 'glmnet')      
#> 4 target       list(results_rf = results_rf, results_glmnet = results_glm…

This isn't any better in my opinion. I had to evaluate twice, and I lose the self-documenting readable source code specified in the original drake_plan(...).

Problems with gather_plan

As demonstrated in the previous example, I think these are the main problems:

  1. You have to evaluate_plan before you gather. This is confusing to me, because evaluating sounds like it should be the final step in writing a plan. Alternatively, it would help if you could gather on wildcards somehow.

  2. The gathering dataframe must contain only the rows you wish to combine. This means you either have to split and recombine, or start separately and do multiple evaluates.

Snakemake

By default, the behavior of drake's wildcard system works like this:

files = ['hello', 'world']

rule all:
    input:
        expand('output_{file}.txt', file=files)

rule append:
    input:
        'text/{file}.txt'
    output:
        'output_{file}.txt'
    shell:
        'cat {input} > {output}'

which executes

cat hello.txt > output_hello.txt
cat world.txt > output_world.txt

But that isn't what we want. We want

cat hello.txt world.txt > output.txt

Snakemake makes this pretty simple:

files = ['hello', 'world']

rule append:
    input:
        expand('text/{file}.txt', file=files)
    output:
        'output.txt'
    shell:
        'cat {input} > {output}'

Compared to drake, gathering wildcards in snakemake is easier.

Suggestion

I'd like there to be a new function or special wildcard keyword that you could use in the original plan to gather certain wildcards.

For example:

plan <- drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = method__),
 results = accuracy(model_method__, zscored),
 output = new_gather(results_method__)
)

or

plan <- drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = method__),
 results = accuracy(model_method__, zscored),
 output = list(results_method___gather__)
)
@wlandau
Copy link
Collaborator

wlandau commented May 5, 2018

So happy to hear you're spreading the word!

I agree with the pros and con you detail about wildcards in drake vs snakemake. In many ways, snakemake is the more sophisticated tool, which should be expected because it also has a huge head start. I can find papers on snakemake from as early as 2012, whereas drake was first released in 2017. We'll get there.

The problems with wildcards, separate plans, evaluation, and awkward gathering are part of what I think @krlmlr was trying to solve with #233 and #304. Packing, unpacking, and delayed evaluation of placeholders should allow everything to happen in a single drake_plan() without the explosion and fragmentation of targets early on. So I may close this as a duplicate of #233, but I need to think carefully about it first. Right now, I am immersed in wrapping up #369, which I have been trying to solve for 3 months.

Another thing: do you happen to know if snakemake has reduce_plan()-like functionality? In drake's case, especially with the work so far on #369, I expect the parallelized pairwise reductions afforded by reduce_plan() to reduce gathering times from O(n) to potentially as low as O(log(n)). Here, drake's memory efficiency provide more advantages over gather_plan(). I feel as though I could learn more about efficiency from snakemake's approach to gathering.

@wlandau
Copy link
Collaborator

wlandau commented May 5, 2018

It will be a bit tricky, but I do think we could implement a way to gather/reduce wildcards in-place in a single drake_plan(). You would still have to evaluate the plan, but only once. Is the following what you are picturing?

drake_plan(
  other = stuff(),
  target = read_table(file_in("file_input_.txt")),
  final = gather_targets(target_input_)
) %>%
  evaluate(wildcard = "input_", values = 1:2)
#> # A tibble: 4 x 2
#>  target   command                                       
#>  <chr>    <chr>                                         
#> 1 other    stuff()                                       
#> 2 target_1 "read_table(file_in(\"file_1.txt\"))"         
#> 3 target_1 "read_table(file_in(\"file_2.txt\"))"         
#> 4 final    list(target_1 = target_1, target_2 = target_2)

At some point I need to get back to work on the wildcard package, add everything from drake, and include this (ref: #240).

@tmastny
Copy link
Contributor Author

tmastny commented May 7, 2018

Yes, gather_targets is exactly the type of command I imagined.

As for Snakemake reduce, I do not think it exists. I'm not a snakemake expert, but here is how I understand it: Snakemake allows you to have a dynamic range of files, but not rules. The problem with pairwise reduce is that you need to generate new rules to accommodate lists of any length.

@tmastny
Copy link
Contributor Author

tmastny commented May 11, 2018

Would you be open to a pull request on this issue? This would change would benefit my open package leadr and would help with my modeling work.

If you have a suggestion where to start, I'd be happy to dig into the source code.

@wlandau
Copy link
Collaborator

wlandau commented May 11, 2018

Absolutely, I would love that! Yes please! There are more and deeper issues in drake than I can address myself in a timely manner. The file R/generate.R has all the wildcard code, and tests/testthat/test-generate.R should have all the tests.

@violetcereza
Copy link
Contributor

As an interim workaround, here's a slightly condensed way to gather plans that I have been using. It also allows for more flexible syntax with the gathering function (additional & named arguments, etc).

library(drake)

plan_data <- drake_plan(
    data = extract_data("file_file__.txt")
  ) %>%
    evaluate_plan(wildcard = "file__", values = 1:10) %>%
    bind_plans(drake_plan(
      data_gathered = bind_rows(!!!rlang::syms(.$target))
    ))

You can use triple bang to fill a ... or double bang to use an argument like list(data_1, data_2, etc). You will still need to keep plan_data separate from other plans that you don't want to gather, and then bind_plans() before making.

@wlandau
Copy link
Collaborator

wlandau commented Jun 19, 2018

I think the crux of this issue is keeping track of which wildcards originally corresponded to which values after the plan has been expanded. We could store attributes in the drake_plan(), which is what dplyr does in group_by().

@wlandau
Copy link
Collaborator

wlandau commented Jul 5, 2018

I'm glad I took a step back and let this one simmer for a while. Rather than add more code analysis magic to gather specific subcollections of targets, I think it is much simpler and more flexible to let evaluate_plan() column-bind indicators to show how the wildcards were evaluated. Here is a sketch (not implemented yet).

drake_plan(
  x = method__(n__),
  y = rt(1000, df = 10)
) %>%
  evaluate_plan(
    indicators = TRUE,
    rules = list(
      method__ = c("rnorm", "rexp"),
      n__ = c(8, 16)
    )
  )
#> # A tibble: 5 x 4
#>   target     command           method__   n__
#>   <chr>      <chr>             <chr>    <dbl>
#> 1 x_rnorm_8  rnorm(8)          rnorm        8
#> 2 x_rnorm_16 rnorm(16)         rnorm       16
#> 3 x_rexp_8   rexp(8)           rexp         8
#> 4 x_rexp_16  rexp(16)          rexp        16
#> 5 y          rt(1000, df = 10) NA          NA

Then, it would be easy to do custom filtering before you call gather_plan() or reduce_plan(). This approach is much more consistent with drake's use of the data frame as the first class citizen of workflow configuration, and it even opens up a better solution to #235. cc @AlexAxthelm, @jw5.

@wlandau
Copy link
Collaborator

wlandau commented Jul 6, 2018

Also related: #229. If we put the wildcard attributes in the graph nodes, we may be able to expand/collapse the visNetwork graph according to wildcard.

@wlandau
Copy link
Collaborator

wlandau commented Jul 6, 2018

Just so I don't forget: we should probably define an attribute in the plan to keep track of which columns are wildcard indicators. Otherwise, drake will complain about non-standard columns.

@tmastny
Copy link
Contributor Author

tmastny commented Aug 28, 2018

Thanks for the PR. I definitely think adding the trace column is in the right direction, but as you mentioned it doesn't exactly solve my problem.

In the following example, the NAs don't help at all, because my wildcards propagate to later targets.

plan <- drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = method__),
 results = accuracy(model_method__, zscored)
)

evaled <- evaluate_plan(plan, rules = list(method__ = c("rf", "glmnet")), 
                        trace = TRUE)

gathered <- evaled %>%
  filter(stringr::str_detect(target, "results_")) %>%
  gather_plan(target = "results")

evaled %>%
  filter(!stringr::str_detect(target, "results_")) %>%
  bind_plans(gathered)

The real difficultly is partitioning the plan into gathering targets and everything else, and then bringing them back together.

I think the tradeoff between additional wildcard rules is worth it compared to the complicated workflow. For example, these are the minimum number of steps to get a working plan:

  • evaluate wildcards
  • partition the plan into two sections. The first group is the targets you want to gather, the rest is everything else
  • gather
  • bind gathered plans and the rest of the partition.

As I mentioned before, the great thing about Drake (and make and Snakemake) is its self-documenting nature. You can read and share the plan to see what it does. A 4-5 step procedure to properly evaluate wildcards really harms the readability of the plan in a way that Snakemake (or even make) never has to deal with.

Thanks for all the work. I realize how large and complicated this project is, and this might not be deemed important. Not a problem at all, I just wanted to articulate the UX problem I see with this workflow.

@wlandau
Copy link
Collaborator

wlandau commented Aug 28, 2018

Why drake is the way it is

When I originally designed the UI, I was reacting to Make-like workflows whose dependency structures were too complicated for the available wildcard functionality. I found myself writing code to generate Makefiles, an exercise full of friction and frustration. I came to the belief that a Makefile should really be an ordinary data frame, and we should be able to clean it, munge it, expand it, and manipulate it just like any other dataset in the tidyverse. The name drake is actually an acronym of "data frames in R for Make" (the first parallel backend was make(parallelism = "Makefile")).

The focus on expanded/evaluated data frames opens up possibilities beyond the inevitably restrictive drake_plan() function and wildcard interface. I recently added a section to the manual on custom metaprogramming to explain some of the opportunities. Related: #451.

Revisiting the original issue

To improve gathering, we would need some extra static code analysis so we can detect gather_targets() and extract its arguments. My code analysis skills have improved since I last attempted your version of the solution, so I think I see the right approach more clearly. Let's say we want to write

drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = method__),
 results = accuracy(model_method__, zscored),
 all_results = list(gather_targets("results_*"))
)

Here is what it would take.

  1. In code_dependencies() and supporting functions, detect calls to gather_targets() and the capture the arguments. We need new functions is_gather_targets_call() and analyze_gather_targets_call() analogous to the existing is_file_in_call() and analyze_file_in().
  2. Add another function to process the call to gather_targets().
drake_plan(
  zscored = standardizer(iris),
  model = train(Species ~ ., data = zscored, method = method__),
  results = accuracy(model_method__, zscored),
  all_results = list(gather_targets("results_*"))
) %>%
  evaluate_plan(rules = list(method__ = c("rf", "glmnet"))) %>%
  process_gather_targets() # needs a better name, and gather_plan() is taken.
#> # A tibble: 6 x 2
#> target         command                                            
#> * <chr>          <chr>                                              
#>   1 zscored        standardizer(iris)                                 
#> 2 model_rf       train(Species ~ ., data = zscored, method = rf)    
#> 3 model_glmnet   train(Species ~ ., data = zscored, method = glmnet)
#> 4 results_rf     accuracy(model_rf, zscored)                        
#> 5 results_glmnet accuracy(model_glmnet, zscored)                    
#> 6 all_results    "list(results_rf = results_rf, results_glmnet = results_glmnet)" 

Issues:

  1. I am not sure we should be supplying grep patterns to gather_targets(). What about tidyselect? That would allow commands like list(gather_targets(starts_with("results_"))).
  2. process_gather_targets() needs a better name, and gather_plan() is already taken. So is evaluate_plan(). Reusing the terms "evaluate" and "gather" would create ambiguity.

Specific comments

As I mentioned before, the great thing about drake (and make and Snakemake) is its self-documenting nature. You can read and share the plan to see what it does. A 4-5 step procedure to properly evaluate wildcards really harms the readability of the plan in a way that Snakemake (or even make) never has to deal with.

A fair point. But in drake's defense, those 4-5 steps usually comprise far less code than the hundreds or thousands of targets in a complicated plan. And because you actually have the fully expanded plan at the end of all that, you can see exactly what the wildcards are doing. In Make, I usually have trouble remembering how to use those special shell-like arrays and automatic variables ($@, $<, $^, $+, $%, $+, and $*) and I have to actually run the make to check if I configured things correctly. (In snakemake, do we get to see the wildcard expansion before we run things?)

In the following example, the NAs don't help at all, because my wildcards propagate to later targets.

True, but you can add your own columns if you like.

library(drake)
library(tidyverse)
plan <- drake_plan(
  zscored = standardizer(iris),
  model = train(Species ~ ., data = zscored, method = method__),
  results = accuracy(model_method__, zscored)
)
plan$group <- plan$target
evaluate_plan(
  plan,
  rules = list(method__ = c("rf", "glmnet")), 
  trace = TRUE
) %>%
  filter(group == "results") %>%
  gather_plan(target = "results")
#> # A tibble: 1 x 2
#>   target  command                                                       
#>   <chr>   <chr>                                                         
#> 1 results list(results_rf = results_rf, results_glmnet = results_glmnet)

@wlandau wlandau reopened this Aug 28, 2018
@tmastny
Copy link
Contributor Author

tmastny commented Aug 29, 2018

Thanks for the response. Drake is amazing, and I really appreciate the effort you put into issues.

Dataframe as makefile gives me an idea. The standard dataframe strategy is split-apply-combine. If we view our current process in this framework, the workflow is inefficient because we have to manually combine the partition we created to gather.

What if we had something like the following:

drake_plan(
  zscored = standardizer(iris),
  model = train(Species ~ ., data = zscored, method = method__),
  results = accuracy(model_method__, zscored)
) %>%
  evaluate_plan(rules = list(method__ = c("rf", "glmnet")), target_trace = TRUE) %>%
  nest(-target_trace) %>%
  nested_gather("results", "results_list")
# # A tibble: 6 x 2
#   target         command
#   <chr>          <chr>
# 1 zscored        standardizer(iris)
# 2 model_rf       train(Species ~ ., data = zscored, method = rf)
# 3 model_glmnet   train(Species ~ ., data = zscored, method = glmnet)
# 4 results_rf     accuracy(model_rf, zscored)
# 5 results_glmnet accuracy(model_glmnet, zscored)
# 6 results_list   list(results_rf = results_rf, results_glmnet = results_glmnet)

Then evaluate_plan(..., target_trace = TRUE) would look like this:

# A tibble: 5 x 3
  target         command                                           target_trace
  <chr>          <chr>                                             <chr>
1 zscored        standardizer(iris)                                NA
2 model_rf       train(Species ~ ., data = zscored, method = rf)   model
3 model_glmnet   train(Species ~ ., data = zscored, method = glmnmodel
4 results_rf     accuracy(model_rf, zscored)                       results
5 results_glmnet accuracy(model_glmnet, zscored)                   results

And then nested_gather might look something like this (work in progress, probably should use tidy eval and pass parameters to gather_plan and many other things):

nested_gather <- function(.data, to_nest, to_gather) {  
  gathered <- .data[which(.data$target_trace == to_nest), "data"] %>%
    .[[1]] %>%
    .[[1]] %>%
    gather_plan()

  .data[nrow(.data) + 1, ] <- list(to_gather, list(tibble()))
  .data[nrow(.data) + 1, "data"][[1]][[1]] <- gathered
  .data %>%
    unnest() %>%
    select(-target_trace)
}

And so we actually have a working example, if we manually add in the target_trace:

plan <- drake_plan(
 zscored = standardizer(iris),
 model = train(Species ~ ., data = zscored, method = method__),
 results = accuracy(model_method__, zscored)
)
evaled <- evaluate_plan(plan, rules = list(method__ = c("rf", "glmnet")))
evaled[["target_trace"]] <- c(NA, "model", "model", "results", "results")
evaled %>%
  nest(-target_trace) %>% 
  nested_gather("results", "results_list")

@wlandau
Copy link
Collaborator

wlandau commented Sep 7, 2018

Sorry it took me so long to return to this thread.

I understand correctly, these are the pieces:

  1. A new target_trace argument that records the original names of the expanded targets rather than the wildcard values.
  2. A tidyr::nest() step that groups targets based on target_trace.
  3. A final step to gather the targets in any of the nested groups.

I really like where this is going! My current thoughts gravitate toward something very similar.

  1. Complementary trace_origin and trace_wildcard arguments to evaluate_plan() (the latter of which is currently just trace).
  2. For multiple calls to evaluate_plan(trace_origin = TRUE), a naming convention that avoids multiple columns named trace_targets. Maybe the names are as simple as origin1, origin2, etc., or maybe they should be more descriptive.
  3. A mechanism to gather targets based on the trace columns. Maybe it uses nest(), but if so, I think all the nesting could happen in the backend. We have some choices here. What about a new gather_filtered() function that does a filter() %>% group_by() %>% gather_plan() %>% bind_plans() instead?

@wlandau
Copy link
Collaborator

wlandau commented Sep 8, 2018

Better yet:

  1. Keep evaluate_plan(trace = TRUE) and make it add both origin and destination columns.
drake_plan(x = rnorm(n__), y = rexp(n__)) %>%
+   evaluate_plan(wildcard = "n__", values = 1:2, trace = TRUE)
# A tibble: 4 x 4
  target command  n__   n___from
* <chr>  <chr>    <chr> <chr>   
1 x_1    rnorm(1) 1     x       
2 x_2    rnorm(2) 2     x       
3 y_1    rexp(1)  1     y       
4 y_2    rexp(2)  2     y 
  1. Add gather_by() and reduce_by(), which are like the proposed gather_filtered(), but without the filtering (group_by() %>% do(gather_plan(., ...)) %>% bind_plans())

@wlandau
Copy link
Collaborator

wlandau commented Sep 8, 2018

See #515. I really think we got it right this time.

@wlandau wlandau closed this as completed in f0e3e32 Sep 8, 2018
@wlandau
Copy link
Collaborator

wlandau commented Sep 8, 2018

By the way: I will push the new features to CRAN as soon as the next version of clustermq goes to CRAN.

@tmastny
Copy link
Contributor Author

tmastny commented Sep 9, 2018

This is great. I'm glad we were able to find a nice solution that fits in with the drake philosophy. Thanks for all your help! Truly impressed with your dedication to resolving these issues.

And the implementation is really elegant: plan %>% group_by %>% do(gather_plan(... is very natural. My only small wish is that I could choose which targets to gather in the gather_by column. Using the example in the docs, it would be nice if you could do

plan <- gather_by(plan, mu___from, at = 'bayes_model')

and only gather bayes_model instead of informal_look. Unfortunately I realize implementing this isn't as straightforward or nice.

You can work around this with filter:

plan %>%
  gather_by(mu___from) %>%
  filter(!target %in% c('target_informal_look'))

But !target %in% c('target_informal_look') could be pretty ugly if you have a longer plan. This seems error prone and messy just to gather only bayes_model:

plan %>%
  gather_by(mu___from) %>%
  filter(!target %in% c('target_informal_look', 'target_outcome1',  'target_outcome2',  'target_outcome3'))

@wlandau
Copy link
Collaborator

wlandau commented Sep 10, 2018

Yeah, I do realize gather_by() is currently limiting in some ways. I considered the options below, all of which have trade-offs. Still looking for the best of all worlds, if it exists.

Complete flexibility with respect to columns (current approach)

Here, we supply any number of columns to gather_by() and let drake figure out what to do with the values. There is power here. For example, suppose we are planning a sensitivity analysis with respect to our choice of model and hyperparameter, and we want to aggregate our results over multiple replicates for each combination of settings. This sort of plan is now trivially easy to create.

library(drake)
library(magrittr)
plan <- drake_plan(
  data = get_data(),
  informal_look = inspect_data(data, mu = mu__, rep = rep__),
  bayes_model = bayesian_model_fit(data, prior_mu = mu__, rep = rep__)
) %>%
  evaluate_plan(
    rules = list(
      mu__ = c(3, 9),
      rep__ = 1:2
    ),
    trace = TRUE
  ) %>%
  gather_by(mu__, mu___from) %>%
  drake_plan_source() %>%
  print()
#> drake_plan(
#>   data = get_data(),
#>   informal_look_3_1 = inspect_data(data, mu = 3, rep = 1),
#>   informal_look_3_2 = inspect_data(data, mu = 3, rep = 2),
#>   informal_look_9_1 = inspect_data(data, mu = 9, rep = 1),
#>   informal_look_9_2 = inspect_data(data, mu = 9, rep = 2),
#>   bayes_model_3_1 = bayesian_model_fit(data, prior_mu = 3, rep = 1),
#>   bayes_model_3_2 = bayesian_model_fit(data, prior_mu = 3, rep = 2),
#>   bayes_model_9_1 = bayesian_model_fit(data, prior_mu = 9, rep = 1),
#>   bayes_model_9_2 = bayesian_model_fit(data, prior_mu = 9, rep = 2),
#>   target_3_bayes_model = list(bayes_model_3_1 = bayes_model_3_1, bayes_model_3_2 = bayes_model_3_2),
#>   target_3_informal_look = list(informal_look_3_1 = informal_look_3_1, informal_look_3_2 = informal_look_3_2),
#>   target_9_bayes_model = list(bayes_model_9_1 = bayes_model_9_1, bayes_model_9_2 = bayes_model_9_2),
#>   target_9_informal_look = list(informal_look_9_1 = informal_look_9_1, informal_look_9_2 = informal_look_9_2),
#>   strings_in_dots = "literals"
#> )

Select on the values of a single column

We might instead want something like gather_by(column = mu__, filter = mu__ > 5). That would get rid of the superfluous targets you mentioned, but we are restricted to operating on a single column.

Multiple columns, multiple values

I suppose we could take in multiple columns and complex filter statements, or an at argument like you suggest. But I think at that point, we are essentially recapitulating an inferior version of the dplyr API (or getting half way there). This line of thinking usually makes me question drake's wildcard API as a whole. Maybe it would have been better to just come up with best practices for using dplyr for plans. Also related is the proposed DSL (#233, #304).

@wlandau
Copy link
Collaborator

wlandau commented Sep 10, 2018

What I would really love is a separate Shiny app in which the user points and clicks on an HTML widget to build a graph which then gets converted it into a plan. (Or the code for a plan. It's easy to go both ways with drake_plan_source().)

@wlandau
Copy link
Collaborator

wlandau commented Oct 16, 2018

Some updates in the development version:

@wlandau
Copy link
Collaborator

wlandau commented Oct 23, 2018

I would like to return to this issue. I am reconsidering a filter argument that selects among targets created in the gathering process. Above, these are target_3_bayes_model, target_3_informal_look, target_9_bayes_model, and target_9_informal_look. Maybe the following, which is__not implemented yet__.

library(drake)
library(magrittr)
plan <- drake_plan(
  data = get_data(),
  informal_look = inspect_data(data, mu = mu__, rep = rep__),
  bayes_model = bayesian_model_fit(data, prior_mu = mu__, rep = rep__)
) %>%
  evaluate_plan(
    rules = list(
      mu__ = c(3, 9),
      rep__ = 1:2
    ),
    trace = TRUE
  ) %>%
  gather_by(mu__, mu___from, filter = mu__ == "bayes_model", append = TRUE) %>%
  drake_plan_source() %>%
  print()
#> drake_plan(
#>   data = get_data(),
#>   informal_look_3_1 = inspect_data(data, mu = 3, rep = 1),
#>   informal_look_3_2 = inspect_data(data, mu = 3, rep = 2),
#>   informal_look_9_1 = inspect_data(data, mu = 9, rep = 1),
#>   informal_look_9_2 = inspect_data(data, mu = 9, rep = 2),
#>   bayes_model_3_1 = bayesian_model_fit(data, prior_mu = 3, rep = 1),
#>   bayes_model_3_2 = bayesian_model_fit(data, prior_mu = 3, rep = 2),
#>   bayes_model_9_1 = bayesian_model_fit(data, prior_mu = 9, rep = 1),
#>   bayes_model_9_2 = bayesian_model_fit(data, prior_mu = 9, rep = 2),
#>   target_3_bayes_model = list(bayes_model_3_1 = bayes_model_3_1, bayes_model_3_2 = bayes_model_3_2),
#>   target_9_bayes_model = list(bayes_model_9_1 = bayes_model_9_1, bayes_model_9_2 = bayes_model_9_2),
#>   strings_in_dots = "literals"
#> )

Notice that append = TRUE also keeps the informal_look_* targets. If we wanted to eliminate those from the start, we could have just called filter(mu__from == "bayes_model") before gather_by().

The same thing with append = FALSE should look like this:

library(drake)
library(magrittr)
plan <- drake_plan(
  data = get_data(),
  informal_look = inspect_data(data, mu = mu__, rep = rep__),
  bayes_model = bayesian_model_fit(data, prior_mu = mu__, rep = rep__)
) %>%
  evaluate_plan(
    rules = list(
      mu__ = c(3, 9),
      rep__ = 1:2
    ),
    trace = TRUE
  ) %>%
  gather_by(mu__, mu___from, filter = mu__ == "bayes_model", append = FALSE) %>%
  drake_plan_source() %>%
  print()
#> drake_plan(
#>   target_3_bayes_model = list(bayes_model_3_1 = bayes_model_3_1, bayes_model_3_2 = bayes_model_3_2),
#>   target_9_bayes_model = list(bayes_model_9_1 = bayes_model_9_1, bayes_model_9_2 = bayes_model_9_2),
#>   strings_in_dots = "literals"
#> )

@wlandau wlandau reopened this Oct 23, 2018
@tmastny
Copy link
Contributor Author

tmastny commented Oct 23, 2018

I really like the outcome of the filter argument. It accomplishes exactly what I would want. I also agree with keeping the informal_look_* targets, unless explicitly filtered before hand.

@wlandau
Copy link
Collaborator

wlandau commented Oct 23, 2018

Great! I will implement it next time I can grab some time. Thanks again for the input.

@wlandau
Copy link
Collaborator

wlandau commented Oct 24, 2018

Okay, I think 55e9197 and 17c2865 have what you are after. I am shooting for another CRAN release at the end of this month, and it will have the filter argument.

library(drake)
library(magrittr)
plan <- drake_plan(
  data = get_data(),
  informal_look = inspect_data(data, mu = mu__),
  bayes_model = bayesian_model_fit(data, prior_mu = mu__)
) %>%
  evaluate_plan(plan, rules = list(mu__ = 1:2), trace = TRUE) %>%
  print()
#> # A tibble: 5 x 4
#>   target          command                               mu__  mu___from   
#>   <chr>           <chr>                                 <chr> <chr>       
#> 1 data            get_data()                            <NA>  <NA>        
#> 2 informal_look_1 inspect_data(data, mu = 1)            1     informal_lo…
#> 3 informal_look_2 inspect_data(data, mu = 2)            2     informal_lo…
#> 4 bayes_model_1   bayesian_model_fit(data, prior_mu = … 1     bayes_model 
#> 5 bayes_model_2   bayesian_model_fit(data, prior_mu = … 2     bayes_model
gather_by(
  plan,
  mu___from,
  append = TRUE,
  filter = mu___from == "bayes_model"
)
#> # A tibble: 6 x 4
#>   target        command                                  mu__  mu___from  
#>   <chr>         <chr>                                    <chr> <chr>      
#> 1 data          get_data()                               <NA>  <NA>       
#> 2 informal_loo… inspect_data(data, mu = 1)               1     informal_l…
#> 3 informal_loo… inspect_data(data, mu = 2)               2     informal_l…
#> 4 bayes_model_1 bayesian_model_fit(data, prior_mu = 1)   1     bayes_model
#> 5 bayes_model_2 bayesian_model_fit(data, prior_mu = 2)   2     bayes_model
#> 6 target_bayes… list(bayes_model_1 = bayes_model_1, bay… <NA>  bayes_model

Created on 2018-10-23 by the reprex package (v0.2.1)

@tmastny
Copy link
Contributor Author

tmastny commented Oct 24, 2018

This is exactly what I was hoping for.

@wlandau
Copy link
Collaborator

wlandau commented Jan 21, 2019

@tmastny, development drake now has a new experimental domain-specific language that makes all this easier. If you are still using drake, would you have a look at https://ropenscilabs.github.io/drake-manual/plans.html?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants