filtering wide tibble is slow #3335

cnjr2 · 2018-02-03T18:24:15Z

I find that filtering operations can be quite slow with wide tibbles.

Here is an example of a 500 x 100,001 table (which is still quite modest), where the first column has a sample_id information.

library(dplyr)
library(purrr)

n_samples <- 500
n_features <- 100000

df <- bind_cols(
  tibble(sample_id = paste0("sample_", 1:n_samples)),
  1:n_features %>%
    map(rnorm, n = n_samples) %>%
      map(as_tibble) %>%
      bind_cols()
)
df

# A tibble: 500 x 100,001
   sample_id  value value1 value2 value3 value4 value5 value6 value7 value8
   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 sample_1   1.56   2.33   0.571   5.20   3.15   4.15   6.82   6.17  10.8
 2 sample_2   0.891  1.64   4.83    3.32   4.30   6.82   7.07   9.94   9.02
 3 sample_3  -0.125  1.39   2.89    1.66   4.53   5.15   4.75   8.59   9.98
 4 sample_4   0.239  0.789  1.92    5.52   6.93   4.82   7.22   9.37   8.75
 5 sample_5   0.337  3.61   1.33    2.84   4.55   6.84   5.67   8.19   8.35
 6 sample_6   1.50   2.96   4.08    2.63   6.12   6.41   6.86   7.73   9.21
 7 sample_7   0.887  1.37   4.02    5.53   3.29   7.68   7.63   8.72   8.24
 8 sample_8   1.27   2.06   2.56    4.87   4.32   5.73   9.13   7.75   7.74
 9 sample_9  -0.565  2.19   3.23    3.01   3.45   6.18   7.63   8.35  10.2
10 sample_10  1.67   1.10   1.20    4.94   4.33   7.46   7.08   7.71   8.61
# ... with 490 more rows, and 99,991 more variables: value9 <dbl>,
#   value10 <dbl>, value11 <dbl>, value12 <dbl>, value13 <dbl>, value14 <dbl>,
#   value15 <dbl>, value16 <dbl>, value17 <dbl>, value18 <dbl>, value19 <dbl>,
#   value20 <dbl>, value21 <dbl>, value22 <dbl>, value23 <dbl>, value24 <dbl>,
#   value25 <dbl>, value26 <dbl>, value27 <dbl>, value28 <dbl>, value29 <dbl>,
#   value30 <dbl>, value31 <dbl>, value32 <dbl>, value33 <dbl>, value34 <dbl>,
#   value35 <dbl>, value36 <dbl>, value37 <dbl>, value38 <dbl>, value39 <dbl>,
#   value40 <dbl>, value41 <dbl>, value42 <dbl>, value43 <dbl>, value44 <dbl>,
#   value45 <dbl>, value46 <dbl>, value47 <dbl>, value48 <dbl>, value49 <dbl>,
#   value50 <dbl>, value51 <dbl>, value52 <dbl>, value53 <dbl>, value54 <dbl>,
#   value55 <dbl>, value56 <dbl>, value57 <dbl>, value58 <dbl>, value59 <dbl>,
#   value60 <dbl>, value61 <dbl>, value62 <dbl>, value63 <dbl>, value64 <dbl>,
#   value65 <dbl>, value66 <dbl>, value67 <dbl>, value68 <dbl>, value69 <dbl>,
#   value70 <dbl>, value71 <dbl>, value72 <dbl>, value73 <dbl>, value74 <dbl>,
#   value75 <dbl>, value76 <dbl>, value77 <dbl>, value78 <dbl>, value79 <dbl>,
#   value80 <dbl>, value81 <dbl>, value82 <dbl>, value83 <dbl>, value84 <dbl>,
#   value85 <dbl>, value86 <dbl>, value87 <dbl>, value88 <dbl>, value89 <dbl>,
#   value90 <dbl>, value91 <dbl>, value92 <dbl>, value93 <dbl>, value94 <dbl>,
#   value95 <dbl>, value96 <dbl>, value97 <dbl>, value98 <dbl>, value99 <dbl>,
#   value100 <dbl>, value101 <dbl>, value102 <dbl>, value103 <dbl>,
#   value104 <dbl>, value105 <dbl>, value106 <dbl>, value107 <dbl>,
#   value108 <dbl>, …

When I use filter it is quite slow:

system.time(fetched_sample <- filter(df, sample_id == "sample_1"))

user  system elapsed
165.060   0.110 165.446

Perhaps for this simple purpose, it is best to switch to a simple which operation.

system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])

> system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
   user  system elapsed
  0.050   0.000   0.054

Is it always advised to work with long tables in dplyr?

The text was updated successfully, but these errors were encountered:

krlmlr · 2018-02-03T19:55:40Z

Thanks. The problem is that we need to set up the data mask/overscope/bindr environment for all the columns in advance. Using objectables would help here (#2922), but we're not there yet.

JohnMount · 2018-02-04T18:10:53Z

There is some empirical evidence that the run-time may have a quadratic asymptote as the number of columns grows large. That could be good news as it seems probable that none of the set-up steps necessarily imply super-linear run-times. Such unexpected quadratic run times are typical of appending to a not pre-allocated list, and those sorts of issues are often minor and easy to fix once found (some notes).

krlmlr · 2018-02-05T07:41:18Z

@JohnMount: Thanks a lot. Does the graph change if you gc() before calling filter()?

krlmlr · 2018-02-05T07:58:55Z

The dev version of bindr makes things a bit better, but there's only so much we can do. It appears that using substantially more than 10000 columns currently will cause friction, which only can be improved by using an entirely different approach internally. Can you gather() or nest() the data?

JohnMount · 2018-02-05T13:49:20Z

@krlmlr I just re-ran it with an explicit gc() outside the timing portion. Structurally quite similar.

krlmlr · 2018-02-05T14:42:43Z

It's strange that the quadratic behavior kicks in so early on your system, but ultimately it doesn't matter. The dev version of bindr helps a little, to do better we need #2922.

krlmlr · 2018-02-05T14:42:55Z

Thanks for updating the plot!

JohnMount · 2018-02-05T14:53:35Z

I really think it might payoff to consider the possibility of the presence of simple, localized, performance bug before re-architecting so much. The example is "extreme", but that makes it easy to see if each piece is performing as expected.

krlmlr · 2018-02-05T15:07:32Z

I agree we can improve a bit more. I just revisited the implementation of active bindings in the R core, we might be able to do better if the target environment had a larger hash table. Also, we should consider creating the bindings from C++ code via R_MakeActiveBinding().

I'm open to contributions here, but I'd rather focus on object tables or ALTREP myself because that will offer unparalleled performance.

cnjr2 · 2018-02-06T17:02:21Z

Thank you for looking into this and all the suggestions!

Can you gather() or nest() the data?

I have tried gather()ing, but it was slow and thus I decided to use purrr to work with the wide table. That has been working very well, so I am happy :)

krlmlr · 2018-02-06T17:04:20Z

Thanks. I wonder just how slow it is to gather() or nest a wide 1k x 10, 10k x 10, 100k x 10 tibble.

cnjr2 · 2018-02-06T17:32:02Z

Here are my times for gather and nest for 1k x 500, 10k x 500, 20k x 500 and 30k X 500.

library(dplyr)
library(tidyr)
library(purrr)

generate_df <- function(n_features, n_samples){
  bind_cols(
    tibble(sample_id = paste0("sample_", 1:n_samples)),
    1:n_features %>%
      map(rnorm, n = n_samples) %>%
        map(as_tibble) %>%
        bind_cols()
  )
}

n_features <- c(1000, 10000, 20000, 30000)

dfs <- n_features %>%
  map(generate_df, n_samples = 500)

First gather:

gather_system.time <- function(df){
  system.time(gather(df, variable, value, -sample_id))
}

map(dfs, gather_system.time)

[[1]]
   user  system elapsed
  0.070   0.000   0.071

[[2]]
   user  system elapsed
  1.800   0.030   1.868

[[3]]
   user  system elapsed
  6.620   0.040   6.699

[[4]]
   user  system elapsed
 14.660   0.450  15.211

Then nest:

nest_system.time <- function(df){
  system.time(nest(df, -sample_id))
}

map(dfs, nest_system.time)

[[1]]
   user  system elapsed
  0.520   0.000   0.526

[[2]]
   user  system elapsed
  7.250   0.000   7.286

[[3]]
   user  system elapsed
 14.980   0.140  15.143

[[4]]
   user  system elapsed
 26.390   3.210  30.316

krlmlr · 2018-02-06T18:00:07Z

Thanks! The jump from 20k to 30k looks suspicious, could you please extend to 40k, 50k and maybe beyond? Maybe use fewer rows to make this a bit faster.

…t` method in a loop. more efficient `LazySubsets` constructor. #3335

romainfrancois · 2018-04-25T16:00:38Z

Work in progress in the feature-3335-wide-performance branch, in case someone wants to have a look:

install_github("tidyverse/dplyr", ref = "feature-3335-wide-performance" )

…t` method in a loop. more efficient `LazySubsets` constructor. #3335

Performance fixes for wide data frames #3335

romainfrancois · 2018-09-13T15:01:04Z

Getting these timings now.

> system.time(fetched_sample <- filter(df, sample_id == "sample_1"))
   user  system elapsed 
  0.166   0.006   0.172 
> 
> system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
   user  system elapsed 
  0.145   0.011   0.157

lock · 2019-03-12T15:40:34Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

krlmlr added data frame performance 🚀 labels Feb 5, 2018

romainfrancois added a commit that referenced this issue Apr 25, 2018

additionnal SymbolMap constructor to avoid using the expensive `inser…

d21159a

…t` method in a loop. more efficient `LazySubsets` constructor. #3335

romainfrancois added the wip work in progress label Apr 25, 2018

romainfrancois mentioned this issue May 2, 2018

Performance fixes for wide data frames #3543

Merged

romainfrancois added a commit that referenced this issue May 3, 2018

additionnal SymbolMap constructor to avoid using the expensive `inser…

41faa44

…t` method in a loop. more efficient `LazySubsets` constructor. #3335

romainfrancois added a commit that referenced this issue May 3, 2018

Merge pull request #3543 from tidyverse/feature-3335-wide-performance

5da3d0b

Performance fixes for wide data frames #3335

romainfrancois mentioned this issue May 30, 2018

Performance of select_if() over wide dataframes #2932

Closed

romainfrancois closed this as completed Sep 13, 2018

lock bot locked and limited conversation to collaborators Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filtering wide tibble is slow #3335

filtering wide tibble is slow #3335

cnjr2 commented Feb 3, 2018

krlmlr commented Feb 3, 2018

JohnMount commented Feb 4, 2018 •

edited

Loading

krlmlr commented Feb 5, 2018

krlmlr commented Feb 5, 2018

JohnMount commented Feb 5, 2018

krlmlr commented Feb 5, 2018

krlmlr commented Feb 5, 2018

JohnMount commented Feb 5, 2018

krlmlr commented Feb 5, 2018

cnjr2 commented Feb 6, 2018

krlmlr commented Feb 6, 2018

cnjr2 commented Feb 6, 2018

krlmlr commented Feb 6, 2018

romainfrancois commented Apr 25, 2018 •

edited

Loading

romainfrancois commented Sep 13, 2018

lock bot commented Mar 12, 2019

filtering wide tibble is slow #3335

filtering wide tibble is slow #3335

Comments

cnjr2 commented Feb 3, 2018

krlmlr commented Feb 3, 2018

JohnMount commented Feb 4, 2018 • edited Loading

krlmlr commented Feb 5, 2018

krlmlr commented Feb 5, 2018

JohnMount commented Feb 5, 2018

krlmlr commented Feb 5, 2018

krlmlr commented Feb 5, 2018

JohnMount commented Feb 5, 2018

krlmlr commented Feb 5, 2018

cnjr2 commented Feb 6, 2018

krlmlr commented Feb 6, 2018

cnjr2 commented Feb 6, 2018

krlmlr commented Feb 6, 2018

romainfrancois commented Apr 25, 2018 • edited Loading

romainfrancois commented Sep 13, 2018

lock bot commented Mar 12, 2019

JohnMount commented Feb 4, 2018 •

edited

Loading

romainfrancois commented Apr 25, 2018 •

edited

Loading