New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy version of the pipe #120

Open
hadley opened this Issue Feb 29, 2016 · 19 comments

Comments

Projects
None yet
4 participants
@hadley
Member

hadley commented Feb 29, 2016

It would be nice to have a general way to transform expressions of the form:

tryCatch(stop("!"), error = function(e) "An error")

The obvious transformation:

stop("!") %>% 
   tryCatch(error = function(e) "An error")

doesn't work because arguments in a pipeline are evaluated in sequence.

I no longer understand how the internals of magrittr work, but I wonder if it might be possible to use delayedAssign() to overcome this problem.

@hadley hadley changed the title from A lazy evaluation version of the pipe to Lazy version of the pipe Feb 29, 2016

@smbache

This comment has been minimized.

Member

smbache commented Feb 29, 2016

It could be done; the simplified branch is closest to something that could work.
However, internally, each step would need to substitute dots, e.g. .1, .2, ... or one would encounter "promise already under evaluation" errors.

@smbache

This comment has been minimized.

Member

smbache commented Mar 2, 2016

I have made the pipe in simplified lazy now, so e.g.

iris %>%
  subset(Species == "setosa") %>%
  stop("!") %>% 
  tryCatch(error = function(e) "An error")

and

"Some message to be suppressed" %>%
  message %>% 
  suppressMessages

both work. All tests still pass.

It works with a small wrapper around delayedAssign called let for brevity. I tried to use := but it would not "print" the source code of the resulting pipeline function as ._2 := rhs(._1).
As this also hints, to make lazy assignment possible, each step needs its own LHS name internally, but this also allows for each step being inspected when debugging (further improving on the issue of debugging with pipes). The user always just uses ., of course.

The pros:

  • Lazy evaluation opens the door for some more functions available for pipelines.
  • Debugging is easier

Cons:

  • Perhaps the ._1, ._2, ... temp names are a little confusing.
  • I guess it adds some small overhead, although I don't know how much.. Maybe you know how the memory footprint is compared to using the same name and overwrite.
  • The debug_pipe call is a little weird now... Perhaps @gaborcsardi will have an idea how to start the browser in the pipeline itself, rather than one layer up? :)

Here's an example of how a pipeline function is constructed:

> . %>% cos %>% sin %>% sum
function (.) 
{
    let(._1, cos(.))
    let(._2, sin(._1))
    sum(._2)
}

Let me know your opinion...
@hadley

This comment has been minimized.

Member

hadley commented Mar 2, 2016

As a strawman, what if the output looked like this:

> . %>% cos %>% sin %>% sum
function (..0) {
  ..1 %<lazy-% cos(..0)
  ..2 %<lazy-% sin(..1)
  sum(..2)
}
@smbache

This comment has been minimized.

Member

smbache commented Mar 3, 2016

I originally tried to use := as lazy assignment operator, but the output became something like:

> . %>% cos %>% sin %>% sum
function (..0) {
 `:=`(..1, cos(..0))
 `:=`(..2, sin(..1))
  sum(..2)
}

I actually also originally had the ..1 naming, but was a little uneasy with it since these have special meaning in connection with .... It's technically not an issue, but in the minds of people...

  1. Do you know how to make an operator appear as such in a constructed call?
  2. Is it still preferable to use ..1 rather than ._1? It's easier to type...
@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 3, 2016

  1. Looks like this is just how := is rendered on output:

    > `:=`<- `+`
    > 5 := 3
    [1] 8
    > ~(5 := 3)
    ~(`:=`(5, 3))
    
  2. _1 would be even more difficult to type, which (I think) is helpful in this case.

Not sure the memory overhead can be neglected, especially for long pipes.

Would you consider a distinct pipe symbol for lazy evaluation? %>% would be eager evaluation, %L>% would be lazy:

iris %>%
  subset(Species == "setosa") %L>%
  stop("!") %>% 
  tryCatch(error = function(e) "An error")
@smbache

This comment has been minimized.

Member

smbache commented Mar 3, 2016

  1. Right, I could use a %<-% type operator instead, seems that would work.

Re memory: not even sure it makes a difference.. @hadley or @gaborcsardi , you know?

Re additional pipe: hmm not too keen on that, but my mind has been changed before. I think there needs to be sufficient downside to just using lazy evaluation, to make up for the hazzle of an additional pipe... What about %$%, %T>%, %<>%? Would they get lazy versions too? I think it's a bad road to go down.

But I like the lazy evaluation: I think it is really (really) nice to be able to use tryCatch at the end of the pipeline!

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 3, 2016

Memory: Think of a dplyr pipe using arrange(), group_by(), summarize(), ungroup() on a local data frame. In the "old" version of the simplified branch, only the current object is remembered at any point in time, the others are free to be garbage collected. In the current version, all intermediate results are memoized until the function returns. Some of the operations I mentioned change the data substantially (i.e., R's copy-on-write doesn't help much), so there will be a footprint.

Eager pipes seem to work better with debug_pipe(), that's what I understood; and you don't need to keep the intermediate results. %T>% and %<>% doesn't need a lazy version imo, I don't know enough about %$% to judge.

@smbache

This comment has been minimized.

Member

smbache commented Mar 3, 2016

I'm not sure it's straight-forward since it's lazy. Maybe one can make some realistic benchmarks.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 4, 2016

I'm pretty sure the current lazy pipeline keeps all intermediate stages; on the other hand this is useful for debugging. I've seen a continuous increase in memory usage with

fun <- function(x) {
  gc()
  x <- c(x, NA)
  system(paste0("ps -p ", Sys.getpid(), " --format vsz"))
  x
}

in a pipeline like 1:1e7 %>% f %>% f %>% f; memory didn't increase in the pre-lazy "simplified" branch. (The system() call prints current virtual memory usage; the gc() calls make sure the measurements are valid.)

@smbache

This comment has been minimized.

Member

smbache commented Mar 5, 2016

Sure, forcing garbage collection will make one look better than the other...

Anyways, I made a new version of the lazy implementation that cleans up after itself, i.e. when a value is used in one step, it is rm'ed. This still works lazily. ._1 is available for the expression defining ._2 but not for the one defining ._3.

Also made it use %<-% rather than let (and it is this operation that is responsible for cleaning up).

using ..1 caused issues as R sees these names special in relation to ....

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 5, 2016

Thanks. GC is not for tweaking the results, but to allow for accurate measurements.

@smbache

This comment has been minimized.

Member

smbache commented Mar 5, 2016

How likely is it that gc is being done during execution of a pipeline?

Ps: one issue with the name increments is that functions making use of names (e.g. plot) will not use a bare . anymore. Can be fixed, I think, but will cost a function call overhead...

@smbache

This comment has been minimized.

Member

smbache commented Mar 5, 2016

Will you make your test with this new approach?

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 5, 2016

Still the same behavior. Perhaps removing in on.exit() is too late.

Your original approach -- a sequence of function calls -- might be not that bad in the end. Could something like this work:

. %>% sin %>% { . - 5 } %>% abs
## function(.) abs(._f2(sin(.)))

with .f2 <- function(.) . + 5 ?

@smbache

This comment has been minimized.

Member

smbache commented Mar 6, 2016

It would be possilble to construct the chain of function calls (unary functions of .) and retain the lazyness. The price here is a nasty stack trace.

Essentially,

iris %>% 
  subset(Species == "setosa") %>% 
  na.omit %$% Sepal.Length %>% 
  sum

becomes something like:

(function (.) 
  (function (.) sum(.))(
    (function (.) with(., Sepal.Length))(
      (function (.) na.omit(.))(
        (function (.) subset(., Species == "setosa"))(.)
      )
    )
  )
)(iris)

Pros:

  • No temporary variables
  • the . retains its name in all calls
  • no recursion/looping
  • it's lazy

Con:

  • Debugging is not made any easier; the stack trace is still long and nested.

Perhaps a tamper approach could solve that problem? Or a specialized error function, so one could append tryCatch(error = pipe_error) at the end of a pipe...

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 6, 2016

This looks convoluted. What are the drawbacks of my simpler approach from above (#120 (comment))?

@smbache

This comment has been minimized.

Member

smbache commented Mar 8, 2016

The . may be needed more than once in a call, in which case you need to wrap it in a function, or make temporary assignment. Figuring it out, and differentiating, seems worse than treating all RHSs the same...

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 8, 2016

We could always wrap (except if a plain function is passed).

BTW, bcccc02 (which introduces lazy eval) breaks a dtplyr test (a warning is now issued where there was none before), and also a dplyr test. I haven't investigated further.

@kevinykuo

This comment has been minimized.

kevinykuo commented Apr 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment