-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wildcard alternative to gather/reduce_plan
#376
Comments
So happy to hear you're spreading the word! I agree with the pros and con you detail about wildcards in The problems with wildcards, separate plans, evaluation, and awkward gathering are part of what I think @krlmlr was trying to solve with #233 and #304. Packing, unpacking, and delayed evaluation of placeholders should allow everything to happen in a single Another thing: do you happen to know if |
It will be a bit tricky, but I do think we could implement a way to gather/reduce wildcards in-place in a single drake_plan(
other = stuff(),
target = read_table(file_in("file_input_.txt")),
final = gather_targets(target_input_)
) %>%
evaluate(wildcard = "input_", values = 1:2)
#> # A tibble: 4 x 2
#> target command
#> <chr> <chr>
#> 1 other stuff()
#> 2 target_1 "read_table(file_in(\"file_1.txt\"))"
#> 3 target_1 "read_table(file_in(\"file_2.txt\"))"
#> 4 final list(target_1 = target_1, target_2 = target_2) At some point I need to get back to work on the |
Yes, As for Snakemake reduce, I do not think it exists. I'm not a snakemake expert, but here is how I understand it: Snakemake allows you to have a dynamic range of files, but not rules. The problem with pairwise reduce is that you need to generate new rules to accommodate lists of any length. |
Would you be open to a pull request on this issue? This would change would benefit my open package leadr and would help with my modeling work. If you have a suggestion where to start, I'd be happy to dig into the source code. |
Absolutely, I would love that! Yes please! There are more and deeper issues in |
As an interim workaround, here's a slightly condensed way to gather plans that I have been using. It also allows for more flexible syntax with the gathering function (additional & named arguments, etc). library(drake)
plan_data <- drake_plan(
data = extract_data("file_file__.txt")
) %>%
evaluate_plan(wildcard = "file__", values = 1:10) %>%
bind_plans(drake_plan(
data_gathered = bind_rows(!!!rlang::syms(.$target))
))
You can use triple bang to fill a |
I think the crux of this issue is keeping track of which wildcards originally corresponded to which values after the plan has been expanded. We could store attributes in the |
I'm glad I took a step back and let this one simmer for a while. Rather than add more code analysis magic to gather specific subcollections of targets, I think it is much simpler and more flexible to let drake_plan(
x = method__(n__),
y = rt(1000, df = 10)
) %>%
evaluate_plan(
indicators = TRUE,
rules = list(
method__ = c("rnorm", "rexp"),
n__ = c(8, 16)
)
)
#> # A tibble: 5 x 4
#> target command method__ n__
#> <chr> <chr> <chr> <dbl>
#> 1 x_rnorm_8 rnorm(8) rnorm 8
#> 2 x_rnorm_16 rnorm(16) rnorm 16
#> 3 x_rexp_8 rexp(8) rexp 8
#> 4 x_rexp_16 rexp(16) rexp 16
#> 5 y rt(1000, df = 10) NA NA Then, it would be easy to do custom filtering before you call |
Also related: #229. If we put the wildcard attributes in the graph nodes, we may be able to expand/collapse the |
Just so I don't forget: we should probably define an attribute in the plan to keep track of which columns are wildcard indicators. Otherwise, |
Thanks for the PR. I definitely think adding the In the following example, the plan <- drake_plan(
zscored = standardizer(iris),
model = train(Species ~ ., data = zscored, method = method__),
results = accuracy(model_method__, zscored)
)
evaled <- evaluate_plan(plan, rules = list(method__ = c("rf", "glmnet")),
trace = TRUE)
gathered <- evaled %>%
filter(stringr::str_detect(target, "results_")) %>%
gather_plan(target = "results")
evaled %>%
filter(!stringr::str_detect(target, "results_")) %>%
bind_plans(gathered) The real difficultly is partitioning the plan into gathering targets and everything else, and then bringing them back together. I think the tradeoff between additional wildcard rules is worth it compared to the complicated workflow. For example, these are the minimum number of steps to get a working plan:
As I mentioned before, the great thing about Drake (and Thanks for all the work. I realize how large and complicated this project is, and this might not be deemed important. Not a problem at all, I just wanted to articulate the UX problem I see with this workflow. |
Why drake is the way it isWhen I originally designed the UI, I was reacting to Make-like workflows whose dependency structures were too complicated for the available wildcard functionality. I found myself writing code to generate The focus on expanded/evaluated data frames opens up possibilities beyond the inevitably restrictive Revisiting the original issueTo improve gathering, we would need some extra static code analysis so we can detect drake_plan(
zscored = standardizer(iris),
model = train(Species ~ ., data = zscored, method = method__),
results = accuracy(model_method__, zscored),
all_results = list(gather_targets("results_*"))
) Here is what it would take.
drake_plan(
zscored = standardizer(iris),
model = train(Species ~ ., data = zscored, method = method__),
results = accuracy(model_method__, zscored),
all_results = list(gather_targets("results_*"))
) %>%
evaluate_plan(rules = list(method__ = c("rf", "glmnet"))) %>%
process_gather_targets() # needs a better name, and gather_plan() is taken.
#> # A tibble: 6 x 2
#> target command
#> * <chr> <chr>
#> 1 zscored standardizer(iris)
#> 2 model_rf train(Species ~ ., data = zscored, method = rf)
#> 3 model_glmnet train(Species ~ ., data = zscored, method = glmnet)
#> 4 results_rf accuracy(model_rf, zscored)
#> 5 results_glmnet accuracy(model_glmnet, zscored)
#> 6 all_results "list(results_rf = results_rf, results_glmnet = results_glmnet)" Issues:
Specific comments
A fair point. But in
True, but you can add your own columns if you like. library(drake)
library(tidyverse)
plan <- drake_plan(
zscored = standardizer(iris),
model = train(Species ~ ., data = zscored, method = method__),
results = accuracy(model_method__, zscored)
)
plan$group <- plan$target
evaluate_plan(
plan,
rules = list(method__ = c("rf", "glmnet")),
trace = TRUE
) %>%
filter(group == "results") %>%
gather_plan(target = "results")
#> # A tibble: 1 x 2
#> target command
#> <chr> <chr>
#> 1 results list(results_rf = results_rf, results_glmnet = results_glmnet) |
Thanks for the response. Drake is amazing, and I really appreciate the effort you put into issues. Dataframe as makefile gives me an idea. The standard dataframe strategy is split-apply-combine. If we view our current process in this framework, the workflow is inefficient because we have to manually combine the partition we created to gather. What if we had something like the following: drake_plan(
zscored = standardizer(iris),
model = train(Species ~ ., data = zscored, method = method__),
results = accuracy(model_method__, zscored)
) %>%
evaluate_plan(rules = list(method__ = c("rf", "glmnet")), target_trace = TRUE) %>%
nest(-target_trace) %>%
nested_gather("results", "results_list")
# # A tibble: 6 x 2
# target command
# <chr> <chr>
# 1 zscored standardizer(iris)
# 2 model_rf train(Species ~ ., data = zscored, method = rf)
# 3 model_glmnet train(Species ~ ., data = zscored, method = glmnet)
# 4 results_rf accuracy(model_rf, zscored)
# 5 results_glmnet accuracy(model_glmnet, zscored)
# 6 results_list list(results_rf = results_rf, results_glmnet = results_glmnet)
Then # A tibble: 5 x 3
target command target_trace
<chr> <chr> <chr>
1 zscored standardizer(iris) NA
2 model_rf train(Species ~ ., data = zscored, method = rf) model
3 model_glmnet train(Species ~ ., data = zscored, method = glmn… model
4 results_rf accuracy(model_rf, zscored) results
5 results_glmnet accuracy(model_glmnet, zscored) results And then nested_gather <- function(.data, to_nest, to_gather) {
gathered <- .data[which(.data$target_trace == to_nest), "data"] %>%
.[[1]] %>%
.[[1]] %>%
gather_plan()
.data[nrow(.data) + 1, ] <- list(to_gather, list(tibble()))
.data[nrow(.data) + 1, "data"][[1]][[1]] <- gathered
.data %>%
unnest() %>%
select(-target_trace)
} And so we actually have a working example, if we manually add in the plan <- drake_plan(
zscored = standardizer(iris),
model = train(Species ~ ., data = zscored, method = method__),
results = accuracy(model_method__, zscored)
)
evaled <- evaluate_plan(plan, rules = list(method__ = c("rf", "glmnet")))
evaled[["target_trace"]] <- c(NA, "model", "model", "results", "results")
evaled %>%
nest(-target_trace) %>%
nested_gather("results", "results_list") |
Sorry it took me so long to return to this thread. I understand correctly, these are the pieces:
I really like where this is going! My current thoughts gravitate toward something very similar.
|
Better yet:
drake_plan(x = rnorm(n__), y = rexp(n__)) %>%
+ evaluate_plan(wildcard = "n__", values = 1:2, trace = TRUE)
# A tibble: 4 x 4
target command n__ n___from
* <chr> <chr> <chr> <chr>
1 x_1 rnorm(1) 1 x
2 x_2 rnorm(2) 2 x
3 y_1 rexp(1) 1 y
4 y_2 rexp(2) 2 y
|
See #515. I really think we got it right this time. |
By the way: I will push the new features to CRAN as soon as the next version of |
This is great. I'm glad we were able to find a nice solution that fits in with the And the implementation is really elegant:
and only gather You can work around this with
But
|
Yeah, I do realize Complete flexibility with respect to columns (current approach)Here, we supply any number of columns to library(drake)
library(magrittr)
plan <- drake_plan(
data = get_data(),
informal_look = inspect_data(data, mu = mu__, rep = rep__),
bayes_model = bayesian_model_fit(data, prior_mu = mu__, rep = rep__)
) %>%
evaluate_plan(
rules = list(
mu__ = c(3, 9),
rep__ = 1:2
),
trace = TRUE
) %>%
gather_by(mu__, mu___from) %>%
drake_plan_source() %>%
print()
#> drake_plan(
#> data = get_data(),
#> informal_look_3_1 = inspect_data(data, mu = 3, rep = 1),
#> informal_look_3_2 = inspect_data(data, mu = 3, rep = 2),
#> informal_look_9_1 = inspect_data(data, mu = 9, rep = 1),
#> informal_look_9_2 = inspect_data(data, mu = 9, rep = 2),
#> bayes_model_3_1 = bayesian_model_fit(data, prior_mu = 3, rep = 1),
#> bayes_model_3_2 = bayesian_model_fit(data, prior_mu = 3, rep = 2),
#> bayes_model_9_1 = bayesian_model_fit(data, prior_mu = 9, rep = 1),
#> bayes_model_9_2 = bayesian_model_fit(data, prior_mu = 9, rep = 2),
#> target_3_bayes_model = list(bayes_model_3_1 = bayes_model_3_1, bayes_model_3_2 = bayes_model_3_2),
#> target_3_informal_look = list(informal_look_3_1 = informal_look_3_1, informal_look_3_2 = informal_look_3_2),
#> target_9_bayes_model = list(bayes_model_9_1 = bayes_model_9_1, bayes_model_9_2 = bayes_model_9_2),
#> target_9_informal_look = list(informal_look_9_1 = informal_look_9_1, informal_look_9_2 = informal_look_9_2),
#> strings_in_dots = "literals"
#> ) Select on the values of a single columnWe might instead want something like Multiple columns, multiple valuesI suppose we could take in multiple columns and complex filter statements, or an |
What I would really love is a separate Shiny app in which the user points and clicks on an HTML widget to build a graph which then gets converted it into a plan. (Or the code for a plan. It's easy to go both ways with |
Some updates in the development version:
|
I would like to return to this issue. I am reconsidering a library(drake)
library(magrittr)
plan <- drake_plan(
data = get_data(),
informal_look = inspect_data(data, mu = mu__, rep = rep__),
bayes_model = bayesian_model_fit(data, prior_mu = mu__, rep = rep__)
) %>%
evaluate_plan(
rules = list(
mu__ = c(3, 9),
rep__ = 1:2
),
trace = TRUE
) %>%
gather_by(mu__, mu___from, filter = mu__ == "bayes_model", append = TRUE) %>%
drake_plan_source() %>%
print()
#> drake_plan(
#> data = get_data(),
#> informal_look_3_1 = inspect_data(data, mu = 3, rep = 1),
#> informal_look_3_2 = inspect_data(data, mu = 3, rep = 2),
#> informal_look_9_1 = inspect_data(data, mu = 9, rep = 1),
#> informal_look_9_2 = inspect_data(data, mu = 9, rep = 2),
#> bayes_model_3_1 = bayesian_model_fit(data, prior_mu = 3, rep = 1),
#> bayes_model_3_2 = bayesian_model_fit(data, prior_mu = 3, rep = 2),
#> bayes_model_9_1 = bayesian_model_fit(data, prior_mu = 9, rep = 1),
#> bayes_model_9_2 = bayesian_model_fit(data, prior_mu = 9, rep = 2),
#> target_3_bayes_model = list(bayes_model_3_1 = bayes_model_3_1, bayes_model_3_2 = bayes_model_3_2),
#> target_9_bayes_model = list(bayes_model_9_1 = bayes_model_9_1, bayes_model_9_2 = bayes_model_9_2),
#> strings_in_dots = "literals"
#> ) Notice that The same thing with library(drake)
library(magrittr)
plan <- drake_plan(
data = get_data(),
informal_look = inspect_data(data, mu = mu__, rep = rep__),
bayes_model = bayesian_model_fit(data, prior_mu = mu__, rep = rep__)
) %>%
evaluate_plan(
rules = list(
mu__ = c(3, 9),
rep__ = 1:2
),
trace = TRUE
) %>%
gather_by(mu__, mu___from, filter = mu__ == "bayes_model", append = FALSE) %>%
drake_plan_source() %>%
print()
#> drake_plan(
#> target_3_bayes_model = list(bayes_model_3_1 = bayes_model_3_1, bayes_model_3_2 = bayes_model_3_2),
#> target_9_bayes_model = list(bayes_model_9_1 = bayes_model_9_1, bayes_model_9_2 = bayes_model_9_2),
#> strings_in_dots = "literals"
#> ) |
I really like the outcome of the filter argument. It accomplishes exactly what I would want. I also agree with keeping the |
Great! I will implement it next time I can grab some time. Thanks again for the input. |
Okay, I think 55e9197 and 17c2865 have what you are after. I am shooting for another CRAN release at the end of this month, and it will have the library(drake)
library(magrittr)
plan <- drake_plan(
data = get_data(),
informal_look = inspect_data(data, mu = mu__),
bayes_model = bayesian_model_fit(data, prior_mu = mu__)
) %>%
evaluate_plan(plan, rules = list(mu__ = 1:2), trace = TRUE) %>%
print()
#> # A tibble: 5 x 4
#> target command mu__ mu___from
#> <chr> <chr> <chr> <chr>
#> 1 data get_data() <NA> <NA>
#> 2 informal_look_1 inspect_data(data, mu = 1) 1 informal_lo…
#> 3 informal_look_2 inspect_data(data, mu = 2) 2 informal_lo…
#> 4 bayes_model_1 bayesian_model_fit(data, prior_mu = … 1 bayes_model
#> 5 bayes_model_2 bayesian_model_fit(data, prior_mu = … 2 bayes_model
gather_by(
plan,
mu___from,
append = TRUE,
filter = mu___from == "bayes_model"
)
#> # A tibble: 6 x 4
#> target command mu__ mu___from
#> <chr> <chr> <chr> <chr>
#> 1 data get_data() <NA> <NA>
#> 2 informal_loo… inspect_data(data, mu = 1) 1 informal_l…
#> 3 informal_loo… inspect_data(data, mu = 2) 2 informal_l…
#> 4 bayes_model_1 bayesian_model_fit(data, prior_mu = 1) 1 bayes_model
#> 5 bayes_model_2 bayesian_model_fit(data, prior_mu = 2) 2 bayes_model
#> 6 target_bayes… list(bayes_model_1 = bayes_model_1, bay… <NA> bayes_model Created on 2018-10-23 by the reprex package (v0.2.1) |
This is exactly what I was hoping for. |
@tmastny, development |
Introduction
I recently did a small presentation on drake to some fellow graduate students, and everyone was impressed by how intuitive it is to specify the dependencies and a workflow, especially compared to a file-centeric maker like Snakemake. In many ways, the R source code is the documentation to the workflow.
However, based on my personal opinion and on the feedback from my presentation, I feel that
gather_plan
is very non-intuitive and verbose, especially compared to Snakemake alternatives.I'd like to show an example of the complexity of
gather_plan
and how Snakemake handles similar situations.Example
This starts off nice, but now we'd like to gather all the
results_method__
into a convenient list holding all the results for eachmethod__
.One straightforward but hacky way to do this is through
ls
:but that doesn't create dependencies on the
results_method__
objects, so it may return empty, or it might not be updated.The
gather_plan
method is much more verbose and confusing. First you need to evaluate the plan:And then gather, but only on the
results_method__
subset.Alternate Route
I realized I could have created a separate plan tibble for
results_method__
, but that seems even more convoluted:This isn't any better in my opinion. I had to evaluate twice, and I lose the self-documenting readable source code specified in the original
drake_plan(...)
.Problems with
gather_plan
As demonstrated in the previous example, I think these are the main problems:
You have to
evaluate_plan
before you gather. This is confusing to me, because evaluating sounds like it should be the final step in writing a plan. Alternatively, it would help if you could gather on wildcards somehow.The gathering dataframe must contain only the rows you wish to combine. This means you either have to split and recombine, or start separately and do multiple evaluates.
Snakemake
By default, the behavior of drake's wildcard system works like this:
which executes
But that isn't what we want. We want
cat hello.txt world.txt > output.txt
Snakemake makes this pretty simple:
Compared to drake, gathering wildcards in snakemake is easier.
Suggestion
I'd like there to be a new function or special wildcard keyword that you could use in the original plan to gather certain wildcards.
For example:
or
The text was updated successfully, but these errors were encountered: