Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filtering wide tibble is slow #3335

Closed
cnjr2 opened this issue Feb 3, 2018 · 16 comments
Closed

filtering wide tibble is slow #3335

cnjr2 opened this issue Feb 3, 2018 · 16 comments

Comments

@cnjr2
Copy link

@cnjr2 cnjr2 commented Feb 3, 2018

I find that filtering operations can be quite slow with wide tibbles.

Here is an example of a 500 x 100,001 table (which is still quite modest), where the first column has a sample_id information.

library(dplyr)
library(purrr)

n_samples <- 500
n_features <- 100000

df <- bind_cols(
  tibble(sample_id = paste0("sample_", 1:n_samples)),
  1:n_features %>%
    map(rnorm, n = n_samples) %>%
      map(as_tibble) %>%
      bind_cols()
)
df
# A tibble: 500 x 100,001
   sample_id  value value1 value2 value3 value4 value5 value6 value7 value8
   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 sample_1   1.56   2.33   0.571   5.20   3.15   4.15   6.82   6.17  10.8
 2 sample_2   0.891  1.64   4.83    3.32   4.30   6.82   7.07   9.94   9.02
 3 sample_3  -0.125  1.39   2.89    1.66   4.53   5.15   4.75   8.59   9.98
 4 sample_4   0.239  0.789  1.92    5.52   6.93   4.82   7.22   9.37   8.75
 5 sample_5   0.337  3.61   1.33    2.84   4.55   6.84   5.67   8.19   8.35
 6 sample_6   1.50   2.96   4.08    2.63   6.12   6.41   6.86   7.73   9.21
 7 sample_7   0.887  1.37   4.02    5.53   3.29   7.68   7.63   8.72   8.24
 8 sample_8   1.27   2.06   2.56    4.87   4.32   5.73   9.13   7.75   7.74
 9 sample_9  -0.565  2.19   3.23    3.01   3.45   6.18   7.63   8.35  10.2
10 sample_10  1.67   1.10   1.20    4.94   4.33   7.46   7.08   7.71   8.61
# ... with 490 more rows, and 99,991 more variables: value9 <dbl>,
#   value10 <dbl>, value11 <dbl>, value12 <dbl>, value13 <dbl>, value14 <dbl>,
#   value15 <dbl>, value16 <dbl>, value17 <dbl>, value18 <dbl>, value19 <dbl>,
#   value20 <dbl>, value21 <dbl>, value22 <dbl>, value23 <dbl>, value24 <dbl>,
#   value25 <dbl>, value26 <dbl>, value27 <dbl>, value28 <dbl>, value29 <dbl>,
#   value30 <dbl>, value31 <dbl>, value32 <dbl>, value33 <dbl>, value34 <dbl>,
#   value35 <dbl>, value36 <dbl>, value37 <dbl>, value38 <dbl>, value39 <dbl>,
#   value40 <dbl>, value41 <dbl>, value42 <dbl>, value43 <dbl>, value44 <dbl>,
#   value45 <dbl>, value46 <dbl>, value47 <dbl>, value48 <dbl>, value49 <dbl>,
#   value50 <dbl>, value51 <dbl>, value52 <dbl>, value53 <dbl>, value54 <dbl>,
#   value55 <dbl>, value56 <dbl>, value57 <dbl>, value58 <dbl>, value59 <dbl>,
#   value60 <dbl>, value61 <dbl>, value62 <dbl>, value63 <dbl>, value64 <dbl>,
#   value65 <dbl>, value66 <dbl>, value67 <dbl>, value68 <dbl>, value69 <dbl>,
#   value70 <dbl>, value71 <dbl>, value72 <dbl>, value73 <dbl>, value74 <dbl>,
#   value75 <dbl>, value76 <dbl>, value77 <dbl>, value78 <dbl>, value79 <dbl>,
#   value80 <dbl>, value81 <dbl>, value82 <dbl>, value83 <dbl>, value84 <dbl>,
#   value85 <dbl>, value86 <dbl>, value87 <dbl>, value88 <dbl>, value89 <dbl>,
#   value90 <dbl>, value91 <dbl>, value92 <dbl>, value93 <dbl>, value94 <dbl>,
#   value95 <dbl>, value96 <dbl>, value97 <dbl>, value98 <dbl>, value99 <dbl>,
#   value100 <dbl>, value101 <dbl>, value102 <dbl>, value103 <dbl>,
#   value104 <dbl>, value105 <dbl>, value106 <dbl>, value107 <dbl>,
#   value108 <dbl>, …

When I use filter it is quite slow:

system.time(fetched_sample <- filter(df, sample_id == "sample_1"))
user  system elapsed
165.060   0.110 165.446

Perhaps for this simple purpose, it is best to switch to a simple which operation.

system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
> system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
   user  system elapsed
  0.050   0.000   0.054

Is it always advised to work with long tables in dplyr?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 3, 2018

Thanks. The problem is that we need to set up the data mask/overscope/bindr environment for all the columns in advance. Using objectables would help here (#2922), but we're not there yet.

@JohnMount
Copy link

@JohnMount JohnMount commented Feb 4, 2018

There is some empirical evidence that the run-time may have a quadratic asymptote as the number of columns grows large. That could be good news as it seems probable that none of the set-up steps necessarily imply super-linear run-times. Such unexpected quadratic run times are typical of appending to a not pre-allocated list, and those sorts of issues are often minor and easy to fix once found (some notes).

shape-1

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 5, 2018

@JohnMount: Thanks a lot. Does the graph change if you gc() before calling filter()?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 5, 2018

The dev version of bindr makes things a bit better, but there's only so much we can do. It appears that using substantially more than 10000 columns currently will cause friction, which only can be improved by using an entirely different approach internally. Can you gather() or nest() the data?

@JohnMount
Copy link

@JohnMount JohnMount commented Feb 5, 2018

@krlmlr I just re-ran it with an explicit gc() outside the timing portion. Structurally quite similar.

shape-1

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 5, 2018

It's strange that the quadratic behavior kicks in so early on your system, but ultimately it doesn't matter. The dev version of bindr helps a little, to do better we need #2922.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 5, 2018

Thanks for updating the plot!

@JohnMount
Copy link

@JohnMount JohnMount commented Feb 5, 2018

I really think it might payoff to consider the possibility of the presence of simple, localized, performance bug before re-architecting so much. The example is "extreme", but that makes it easy to see if each piece is performing as expected.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 5, 2018

I agree we can improve a bit more. I just revisited the implementation of active bindings in the R core, we might be able to do better if the target environment had a larger hash table. Also, we should consider creating the bindings from C++ code via R_MakeActiveBinding().

I'm open to contributions here, but I'd rather focus on object tables or ALTREP myself because that will offer unparalleled performance.

@cnjr2
Copy link
Author

@cnjr2 cnjr2 commented Feb 6, 2018

Thank you for looking into this and all the suggestions!

Can you gather() or nest() the data?

I have tried gather()ing, but it was slow and thus I decided to use purrr to work with the wide table. That has been working very well, so I am happy :)

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 6, 2018

Thanks. I wonder just how slow it is to gather() or nest a wide 1k x 10, 10k x 10, 100k x 10 tibble.

@cnjr2
Copy link
Author

@cnjr2 cnjr2 commented Feb 6, 2018

Here are my times for gather and nest for 1k x 500, 10k x 500, 20k x 500 and 30k X 500.

library(dplyr)
library(tidyr)
library(purrr)

generate_df <- function(n_features, n_samples){
  bind_cols(
    tibble(sample_id = paste0("sample_", 1:n_samples)),
    1:n_features %>%
      map(rnorm, n = n_samples) %>%
        map(as_tibble) %>%
        bind_cols()
  )
}

n_features <- c(1000, 10000, 20000, 30000)

dfs <- n_features %>%
  map(generate_df, n_samples = 500)

First gather:

gather_system.time <- function(df){
  system.time(gather(df, variable, value, -sample_id))
}

map(dfs, gather_system.time)
[[1]]
   user  system elapsed
  0.070   0.000   0.071

[[2]]
   user  system elapsed
  1.800   0.030   1.868

[[3]]
   user  system elapsed
  6.620   0.040   6.699

[[4]]
   user  system elapsed
 14.660   0.450  15.211

Then nest:

nest_system.time <- function(df){
  system.time(nest(df, -sample_id))
}

map(dfs, nest_system.time)
[[1]]
   user  system elapsed
  0.520   0.000   0.526

[[2]]
   user  system elapsed
  7.250   0.000   7.286

[[3]]
   user  system elapsed
 14.980   0.140  15.143

[[4]]
   user  system elapsed
 26.390   3.210  30.316

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 6, 2018

Thanks! The jump from 20k to 30k looks suspicious, could you please extend to 40k, 50k and maybe beyond? Maybe use fewer rows to make this a bit faster.

romainfrancois added a commit that referenced this issue Apr 25, 2018
…t` method in a loop.

more efficient `LazySubsets` constructor.

#3335
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 25, 2018

Work in progress in the feature-3335-wide-performance branch, in case someone wants to have a look:

install_github("tidyverse/dplyr", ref = "feature-3335-wide-performance" )

romainfrancois added a commit that referenced this issue May 3, 2018
…t` method in a loop.

more efficient `LazySubsets` constructor.

#3335
romainfrancois added a commit that referenced this issue May 3, 2018
Performance fixes for wide data frames #3335
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 13, 2018

Getting these timings now.

> system.time(fetched_sample <- filter(df, sample_id == "sample_1"))
   user  system elapsed 
  0.166   0.006   0.172 
> 
> system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
   user  system elapsed 
  0.145   0.011   0.157 

@lock
Copy link

@lock lock bot commented Mar 12, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Mar 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants