-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filtering wide tibble is slow #3335
Comments
Thanks. The problem is that we need to set up the data mask/overscope/bindr environment for all the columns in advance. Using objectables would help here (#2922), but we're not there yet. |
There is some empirical evidence that the run-time may have a quadratic asymptote as the number of columns grows large. That could be good news as it seems probable that none of the set-up steps necessarily imply super-linear run-times. Such unexpected quadratic run times are typical of appending to a not pre-allocated list, and those sorts of issues are often minor and easy to fix once found (some notes). |
@JohnMount: Thanks a lot. Does the graph change if you |
The dev version of bindr makes things a bit better, but there's only so much we can do. It appears that using substantially more than 10000 columns currently will cause friction, which only can be improved by using an entirely different approach internally. Can you |
@krlmlr I just re-ran it with an explicit |
It's strange that the quadratic behavior kicks in so early on your system, but ultimately it doesn't matter. The dev version of bindr helps a little, to do better we need #2922. |
Thanks for updating the plot! |
I really think it might payoff to consider the possibility of the presence of simple, localized, performance bug before re-architecting so much. The example is "extreme", but that makes it easy to see if each piece is performing as expected. |
I agree we can improve a bit more. I just revisited the implementation of active bindings in the R core, we might be able to do better if the target environment had a larger hash table. Also, we should consider creating the bindings from C++ code via I'm open to contributions here, but I'd rather focus on object tables or ALTREP myself because that will offer unparalleled performance. |
Thank you for looking into this and all the suggestions!
I have tried |
Thanks. I wonder just how slow it is to |
Here are my times for gather and nest for 1k x 500, 10k x 500, 20k x 500 and 30k X 500. library(dplyr)
library(tidyr)
library(purrr)
generate_df <- function(n_features, n_samples){
bind_cols(
tibble(sample_id = paste0("sample_", 1:n_samples)),
1:n_features %>%
map(rnorm, n = n_samples) %>%
map(as_tibble) %>%
bind_cols()
)
}
n_features <- c(1000, 10000, 20000, 30000)
dfs <- n_features %>%
map(generate_df, n_samples = 500) First gather: gather_system.time <- function(df){
system.time(gather(df, variable, value, -sample_id))
}
map(dfs, gather_system.time)
Then nest: nest_system.time <- function(df){
system.time(nest(df, -sample_id))
}
map(dfs, nest_system.time)
|
Thanks! The jump from 20k to 30k looks suspicious, could you please extend to 40k, 50k and maybe beyond? Maybe use fewer rows to make this a bit faster. |
…t` method in a loop. more efficient `LazySubsets` constructor. #3335
Work in progress in the install_github("tidyverse/dplyr", ref = "feature-3335-wide-performance" ) |
…t` method in a loop. more efficient `LazySubsets` constructor. #3335
Performance fixes for wide data frames #3335
Getting these timings now.
|
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
I find that
filter
ing operations can be quite slow with widetibble
s.Here is an example of a 500 x 100,001 table (which is still quite modest), where the first column has a
sample_id
information.When I use
filter
it is quite slow:Perhaps for this simple purpose, it is best to switch to a simple
which
operation.Is it always advised to work with long tables in dplyr?
The text was updated successfully, but these errors were encountered: