Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceci n'est pas un pipe. #748

Closed
wants to merge 16 commits into from
Closed

Ceci n'est pas un pipe. #748

wants to merge 16 commits into from

Conversation

wlandau
Copy link
Collaborator

@wlandau wlandau commented Feb 20, 2019

Summary

This PR implements an experimental %dp% pipe operator for generating targets. Taken mostly from split_chain() in magrittr. Still needs profiling. cc @billdenney.

I am still not sure this is a good idea. We should give it some time and think it over.

library(drake)

drake_plan(
  result = data %dp%
    task1() %dp%
    task2(data = my_data, .) %dp%
    task3()
)
#> # A tibble: 4 x 2
#>   target   command                        
#>   <chr>    <expr>                         
#> 1 result.3 data                           
#> 2 result.2 task1(result.3)                
#> 3 result.1 task2(data = my_data, result.2)
#> 4 result   task3(result.1)

drake_plan(
  result = target(
    task1(data, analysis) %dp%
      task2(),
    transform = map(analysis = c("bayes", "freq"))
  )
)
#> # A tibble: 4 x 2
#>   target           command                
#>   <chr>            <expr>                 
#> 1 result_.bayes..1 task1(data, "bayes")   
#> 2 result_.bayes.   task2(result_.bayes..1)
#> 3 result_.freq..1  task1(data, "freq")    
#> 4 result_.freq.    task2(result_.freq..1)

Created on 2019-02-20 by the reprex package (v0.2.1)

Related GitHub issues and pull requests

Checklist

  • I have read drake's code of conduct, and I agree to follow its rules.
  • I have listed any substantial changes in the development news.
  • I have added testthat unit tests to tests/testthat to confirm that any new features or functionality work correctly.
  • I have tested this pull request locally with devtools::check()
  • This pull request is ready for review.
  • I think this pull request is ready to merge.

@codecov-io
Copy link

codecov-io commented Feb 20, 2019

Codecov Report

Merging #748 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #748   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          73     74    +1     
  Lines        6217   6273   +56     
=====================================
+ Hits         6217   6273   +56
Impacted Files Coverage Δ
R/api-pipe.R 100% <100%> (ø)
R/api-plan.R 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1fceb8...3ceaa1d. Read the comment docs.

@wlandau
Copy link
Collaborator Author

wlandau commented Feb 20, 2019

Still needs work. Grouping variables from transforms should not apply to upstream targets in a %dp% call.

library(drake)
drake_plan(
  result = target(
    task1(data, analysis) %dp%
      task2(),
    transform = map(analysis = c("bayes", "freq"))
  ),
  trace = TRUE
)
#> # A tibble: 4 x 4
#>   target           command                 analysis    result        
#>   <chr>            <expr>                  <chr>       <chr>         
#> 1 result_.bayes..1 task1(data, "bayes")    "\"bayes\"" result_.bayes.
#> 2 result_.bayes.   task2(result_.bayes..1) "\"bayes\"" result_.bayes.
#> 3 result_.freq..1  task1(data, "freq")     "\"freq\""  result_.freq. 
#> 4 result_.freq.    task2(result_.freq..1)  "\"freq\""  result_.freq.

Created on 2019-02-20 by the reprex package (v0.2.1)

@wlandau
Copy link
Collaborator Author

wlandau commented Feb 20, 2019

Never mind, we should be okay because all transforms happen before any %dp% calls are resolved.

@wlandau
Copy link
Collaborator Author

wlandau commented Feb 20, 2019

Confirmation of #748 (comment):

library(drake)
plan <- drake_plan(
  result = target(
    task1(data, analysis) %dp%
      task2() %dp%
      task3(),
    transform = map(analysis = c("bayes", "freq"))
  ),
  end = target(
    list(result),
    transform = combine(result)
  )
)
config <- drake_config(plan)
vis_drake_graph(config)

Created on 2019-02-20 by the reprex package (v0.2.1)

@wlandau
Copy link
Collaborator Author

wlandau commented Feb 22, 2019

As I thought, scanning for %dp% incurs a speed penalty when we create large plans.

library(drake)
x_vals = as.numeric(1:1000)
microbenchmark::microbenchmark(
  plan = drake_plan(
    x2 = target(x1, transform = map(x1 = !!x_vals)),
    x3 = target(x2, transform = map(x2)),
    x4 = target(x3, transform = map(x3)),
    x5 = target(x4, transform = map(x4)),
    x6 = target(x5, transform = map(x5)),
    x7 = target(x6, transform = map(x6)),
    x8 = target(x7, transform = map(x7)),
    x9 = target(x8, transform = map(x8)),
    x10 = target(x9, transform = map(x9)),
    x11 = target(x10, transform = combine(x10))
  )
)

With 78de526:

Unit: milliseconds
 expr      min      lq     mean   median       uq      max neval
 plan 528.6575 552.171 567.9617 562.8718 578.5794 638.1616   100

With decc787:

Unit: milliseconds
 expr      min       lq     mean   median       uq     max neval
 plan 543.5965 565.4996 580.2999 573.7455 592.0488 654.019   100

We should probably pause and gather more information. How many people will use this? If enough of the community gets behind it, I will merge.

@wlandau
Copy link
Collaborator Author

wlandau commented Feb 22, 2019

Hmmm... oddly enough, this is another case where the pipe could help condense things. The following plan is equivalent to the one from #748 (comment).

drake_plan(
  x10 = target(
    x1 %dp%
      identity %dp%
      identity %dp%
      identity %dp%
      identity %dp%
      identity %dp%
      identity %dp%
      identity %dp%
      identity %dp%
      identity,
    transform = map(x1 = !!x_vals)
  ),
  x11 = target(x10, transform = combine(x10))
)

Of course, we would replace identity() with other functions in a real application.

@wlandau
Copy link
Collaborator Author

wlandau commented Feb 22, 2019

...but the benchmarks are disappointing. Further work could probably speed it up though.

library(drake)
x_vals = as.numeric(1:1000)
microbenchmark::microbenchmark(
  plan = drake_plan(
    x10 = target(
      x1 %dp%
        identity %dp%
        identity %dp%
        identity %dp%
        identity %dp%
        identity %dp%
        identity %dp%
        identity %dp%
        identity %dp%
        identity,
      transform = map(x1 = !!x_vals)
    ),
    x11 = target(x10, transform = combine(x10))
  )
)
#> Unit: seconds
#>  expr      min       lq     mean   median      uq      max neval
#>  plan 2.270687 2.443403 2.600197 2.565977 2.71475 3.250199   100

@wlandau
Copy link
Collaborator Author

wlandau commented Mar 19, 2019

If we can evaluate the pipe before we apply the transforms, then we might not see a performance penalty. I will see if it is possible.

@wlandau
Copy link
Collaborator Author

wlandau commented Mar 19, 2019

Nope, we can't do that. If grouping variables appear in the middle of a pipe chain, we need to make sure the transformation applies to the entire chain. The best way to implement this is to just evaluate the pipe afterwards. Otherwise, we would have to manually track provenance, and things would get ugly fast.

drake::drake_plan(
  result = target(
    data %dp%
      task1() %dp%
      task2(x, .) %dp%
      task3(),
    transform = map(x = c(1, 2))
  )
)
#> # A tibble: 8 x 2
#>   target     command               
#>   <chr>      <S3: expr_list>       
#> 1 result_1_2 "data                "
#> 2 result_1_3 "task1(result_1_2)   "
#> 3 result_1_4 task2(1, result_1_3)  
#> 4 result_1   "task3(result_1_4)   "
#> 5 result_2_2 "data                "
#> 6 result_2_3 "task1(result_2_2)   "
#> 7 result_2_4 task2(2, result_2_3)  
#> 8 result_2   "task3(result_2_4)   "

Created on 2019-03-19 by the reprex package (v0.2.1)

@wlandau
Copy link
Collaborator Author

wlandau commented Mar 19, 2019

Closing re #746 (comment). If we ever want to come back (which I doubt we will) the code is in the diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants