Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange issue using pipes and qplot #62

Closed
casallas opened this issue Dec 30, 2014 · 16 comments · Fixed by #63
Closed

strange issue using pipes and qplot #62

casallas opened this issue Dec 30, 2014 · 16 comments · Fixed by #63

Comments

@casallas
Copy link
Contributor

casallas commented Dec 30, 2014

All of the following calls

rnorm(1000) %>% qplot
rnorm(1000) %>% qplot(.)
rnorm(1000) %>% {qplot(.)}

Yield the following error

Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class "c("gg", "ggplot")" to a data.frame

Strangely, however, the following works

rnorm(1000) %>% {x=.; qplot(x)}

but that breaks the pipe magic... something else that works is debug(qplot), then calling str(x) once debugging, which makes me think there may be some sort of lazy evaluation happening (perhaps in ggplot2?) which doesn't play nicely with the . placeholder?

I can confirm that this wasn't an issue in magrittr 1.1.0

@smbache
Copy link
Member

smbache commented Dec 30, 2014

This is indeed weird! In freduce (an internal magrittr function) there is a line:

value <- withVisible(function_list[[k]](value))

In debug mode, evaluating the rhs of that expression works fine, but including the value <- part will mess it up! I'm not sure where the behaviour comes from, if it is qplot or withVisible or a combination. But changing the above line to

result <- withVisible(function_list[[k]](value))

(and using result rather than value in the next line of the same function) will make everything work.

@hadley do you know why this happens?

@smbache
Copy link
Member

smbache commented Dec 30, 2014

Here's a minimal example that replicates the behaviour without magrittr:

library(ggplot2)
x <- rnorm(100)
x <- withVisible(qplot(x))

> x
$value
Don't know how to automatically pick scale for object of type list. Defaulting to continuous
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  cannot coerce class "c("gg", "ggplot")" to a data.frame

changing the last line to

y <- withVisible(qplot(x))

makes it work. I guess in the magrittr case it is a pretty easy fix, but I'm not sure whether it is the best place to do it. Perhaps...

@smbache
Copy link
Member

smbache commented Dec 30, 2014

Actually, you don't even need withVisible... So as you said, it is probably some lazy / recursive behaviour that causes it.

@casallas
Copy link
Contributor Author

casallas commented Dec 30, 2014

That fix in freduce solved it! But I imagine that replacing

value <- withVisible(function_list[[k]](value))
if (value[["visible"]]) value[["value"]] else invisible(value[["value"]])

with

ans <- withVisible(function_list[[k]](value))
if (ans[["visible"]]) ans[["value"]] else invisible(ans[["value"]])

unnecessarily allocates the memory twice, which may be a problem for big objects, but what about

with(withVisible(function_list[[k]](value)),
    if (visible) value else invisible(value))

? Would that have some memory overhead as well? Or some other shady side-effect of using with?

@hadley
Copy link
Member

hadley commented Dec 30, 2014

@casallas why do you think there are two allocations happening there? I can only see one (depending on which branch of the if is taken)

@casallas
Copy link
Contributor Author

casallas commented Dec 31, 2014

I guess a major drawback in the with solution is readability

@casallas
Copy link
Contributor Author

casallas commented Dec 31, 2014

@hadley I meant that in the proposed

ans <- withVisible(function_list[[k]](value))
if (ans[["visible"]]) ans[["value"]] else invisible(ans[["value"]])

you would have both ans and value at one point in memory. Am I wrong?

casallas added a commit to casallas/magrittr that referenced this issue Dec 31, 2014
@casallas
Copy link
Contributor Author

casallas commented Dec 31, 2014

OK never mind R is smarter than I thought, the memory doesn't seem to be allocated twice in any of the variants, and they seem to take roughly the same time—at least based on my rudimentary benchmark. Sorry for the noise.

@smbache
Copy link
Member

smbache commented Dec 31, 2014

Maybe a recursive approach for freduce is nicer altogether...

@smbache
Copy link
Member

smbache commented Dec 31, 2014

A recursive version is a little aesthetically nicer, but marginally slower. Here's an implementation:

freduce <- function (value, function_list) 
{
  k <- length(function_list)
  if (k == 1L) {
    result <- withVisible(function_list[[1L]](value))
    if (result[["visible"]]) 
      result[["value"]]
    else 
      invisible(result[["value"]])
  } else {
    Recall(function_list[[1L]](value), function_list[-1L])
  }
}

@casallas
Copy link
Contributor Author

casallas commented Dec 31, 2014

that one looks great! I didn't see any performance issues and it also takes care of the case where qplot is not at the end of the chain, i.e.

rnorm(1000) %>% qplot %>% print

which wasn't possible with the iterative solution

@hadley
Copy link
Member

hadley commented Dec 31, 2014

(FWIW qplot() was written before I really understood NSE, so I'm not that surprised that it has strange failure modes)

@smbache
Copy link
Member

smbache commented Dec 31, 2014

I don't believe you ;-)

casallas added a commit to casallas/magrittr that referenced this issue Jan 13, 2015
@gaborcsardi
Copy link
Member

gaborcsardi commented Jul 3, 2015

[This was closed some time ago, but my question is really related to the recursive version.]

Do you have an estimate or just a guess what this means for memory allocation along the pipe?

To elaborate, without promises, the recursive version allocates all values along the pipe, at the same time.

Now, we of course have promises, so quite often this does not happen, and simply a promise is pushed through the pipe, and value is only evaluated at the last stage.

With the for version, there are no promises, but there is no nesting either, so no memory accumulation along the pipe.

[@casallas is this what you meant my allocating the memory twice?]

@smbache
Copy link
Member

smbache commented Jul 4, 2015

No I'm not sure how costly it is, although surely recursion is inefficient, but this is a problem linear in the number of pipes (and almost always negligible). However there were examples where the loop version failed for subtle reasons. I'm sure @kmillar and crew gets the recursion overhead down ;)

Did you experience trouble with the recursive version?

@gaborcsardi
Copy link
Member

gaborcsardi commented Jul 4, 2015

@smbache No, I haven't seen trouble, so, yes, this is theoretical.

I am not worried about the recursion per se, that's fine with me. I am "worried" about having a value at each level of the recursion.

Think about a typical use case of putting a (largish) data frame into a pipe, and manipulating and transforming it at each step. If you write it without pipes:

result <- manipulate1(DF)
result <- manipulate2(result)
result <- manipulate3(result)
...

then there is only one copy of result. (Well, usually two really, but the point is that it is O(1).)

However if you write it with pipes, then, because of the recursion, there will be a separate value at each level of the recursion. (AFAIK there is no tail-recursion optimization in R.)
If these values are large data frames, then you'll end up with multiple copies of them. I.e. if you have a pipe with 10 stages, you store the "data" 10 times, at the same time.

Of course this is usually not true because of the promises. If there are no side effects, R is just passing value as a promise down the pipe, and everything works effectively the same way as the non-pipey version.

So my question is: how often can we push a promise to the end of the pipe?

In other words, I am fairly sure that by introducing side effects, I can make R store the data as many times as many stages I have in the pipe, so I can "break" it. But does it really happen in real uses cases? Do people have side effects in the pipe stages?

I guess it might be also a good idea to promote pipey code without side effects. E.g. if the manual, readmes, vignettes, etc. have code with side effects, maybe we can reconsider them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants