Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross_df() is extremely slow compared to expand.grid() and crossing() #406

Closed
huftis opened this issue Nov 7, 2017 · 13 comments
Closed
Labels
bug an unexpected problem or unintended behavior

Comments

@huftis
Copy link
Contributor

huftis commented Nov 7, 2017

The cross_df() function seems to be extremely slow compared to the expand.grid() function from the base package and the crossing() function from the tidyr package. Example:

library(purrr)
library(tidyr)

k = 12
x = rep(list(1:3), k) %>%
  setNames(LETTERS[1:k])

system.time({ x %>% cross_df() })
#>    user  system elapsed 
#>   59.17    0.19   59.57
system.time({ x %>% expand.grid() })
#>    user  system elapsed 
#>    0.08    0.00    0.08
system.time({ x %>% invoke(crossing, .) })
#>    user  system elapsed 
#>    0.39    0.07    0.85

As you see, cross_df() takes almost a minute performing basically the same operation that expand.grid() and crossing() do in less than a second. (The output of the three functions are essentially the same, though they differ in returning a tibble or a data frame, and in the order of the rows.)

@hadley
Copy link
Member

hadley commented Feb 4, 2018

cross() and friends now feels out of scope in purrr to me. @lionel- what do you think?

@lionel-
Copy link
Member

lionel- commented Feb 5, 2018

I'm not sure where they belong. Would data frame stuff belong to vctrs?

@AshesITR
Copy link
Contributor

A profvis run suggests major overhead due to garbage collection.
If this is to be fixed, I suggest avoiding the loop over all result rows in cross() and use .filter outside of the construction - we are allocating the full memory for out anyway so there is no benefit to filter early.

@sanjmeh
Copy link

sanjmeh commented Mar 17, 2018

expand.grid is orders of magnitude faster ( 100 times faster) than cross().
What's the advantage of cross() other than we get a list instead of a data frame?

@lionel-
Copy link
Member

lionel- commented Mar 17, 2018

I think the idea was to take up less memory when crossing atomic vectors with a filtering function. We should revisit.

@hadley hadley added feature a feature request or enhancement and removed help-wanted feature a feature request or enhancement labels May 5, 2018
@lionel- lionel- added bug an unexpected problem or unintended behavior and removed performance labels Nov 29, 2018
@hadley
Copy link
Member

hadley commented Dec 17, 2018

Code to profile:

k <- 11
x <- rep(list(1:3), k) %>% setNames(LETTERS[1:k])

profvis::profvis(purrr::cross(x))

@cristianvaldez
Copy link

how tidyr::crossing is implemented, i got the feeling that is really slow with large data sets.

@grayskripko
Copy link

the problem is still here.

suppressPackageStartupMessages(library(tidyverse))
system.time(cross_df(list(a=1:300, b=1:300)))
#>    user  system elapsed 
#>    0.81    0.01    0.83
system.time(expand.grid(a=1:300, b=1:300))
#>    user  system elapsed 
#>       0       0       0

I wanted to create an issue but found this old one.
What is its status?

@grayskripko
Copy link

grayskripko commented Nov 6, 2019

maybe it's a strange suggestion but it would be great to mark somehow such functions which have known efficiency problems. Kind of "questioning" label but about efficiency. I lost a couple of hours just to know that the current implementation of cross_df is much worse than the basic expand.grid and I should stick with the latter.

@lionel-
Copy link
Member

lionel- commented Nov 6, 2019

A turtle label?

We should probably just deprecate cross_df() in favour of tidyr::expand_grid().

@grayskripko
Copy link

@lionel-
"A turtle label?"
Yes, it sounds like a bright image.

"We should probably just deprecate cross_df() in favour of tidyr::expand_grid()"
What about other cross_ functions? Are they effective enough?

@lionel-
Copy link
Member

lionel- commented Nov 7, 2019

Probably not.

@lionel-
Copy link
Member

lionel- commented Jun 5, 2020

We'll likely deprecate cross() and cross_df() in favour of tidyr::expand_grid(). See #768.

@lionel- lionel- closed this as completed Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

7 participants