Skip to content

filtering wide tibble is slow #3335

@cnjr2

Description

@cnjr2

I find that filtering operations can be quite slow with wide tibbles.

Here is an example of a 500 x 100,001 table (which is still quite modest), where the first column has a sample_id information.

library(dplyr)
library(purrr)

n_samples <- 500
n_features <- 100000

df <- bind_cols(
  tibble(sample_id = paste0("sample_", 1:n_samples)),
  1:n_features %>%
    map(rnorm, n = n_samples) %>%
      map(as_tibble) %>%
      bind_cols()
)
df
# A tibble: 500 x 100,001
   sample_id  value value1 value2 value3 value4 value5 value6 value7 value8
   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 sample_1   1.56   2.33   0.571   5.20   3.15   4.15   6.82   6.17  10.8
 2 sample_2   0.891  1.64   4.83    3.32   4.30   6.82   7.07   9.94   9.02
 3 sample_3  -0.125  1.39   2.89    1.66   4.53   5.15   4.75   8.59   9.98
 4 sample_4   0.239  0.789  1.92    5.52   6.93   4.82   7.22   9.37   8.75
 5 sample_5   0.337  3.61   1.33    2.84   4.55   6.84   5.67   8.19   8.35
 6 sample_6   1.50   2.96   4.08    2.63   6.12   6.41   6.86   7.73   9.21
 7 sample_7   0.887  1.37   4.02    5.53   3.29   7.68   7.63   8.72   8.24
 8 sample_8   1.27   2.06   2.56    4.87   4.32   5.73   9.13   7.75   7.74
 9 sample_9  -0.565  2.19   3.23    3.01   3.45   6.18   7.63   8.35  10.2
10 sample_10  1.67   1.10   1.20    4.94   4.33   7.46   7.08   7.71   8.61
# ... with 490 more rows, and 99,991 more variables: value9 <dbl>,
#   value10 <dbl>, value11 <dbl>, value12 <dbl>, value13 <dbl>, value14 <dbl>,
#   value15 <dbl>, value16 <dbl>, value17 <dbl>, value18 <dbl>, value19 <dbl>,
#   value20 <dbl>, value21 <dbl>, value22 <dbl>, value23 <dbl>, value24 <dbl>,
#   value25 <dbl>, value26 <dbl>, value27 <dbl>, value28 <dbl>, value29 <dbl>,
#   value30 <dbl>, value31 <dbl>, value32 <dbl>, value33 <dbl>, value34 <dbl>,
#   value35 <dbl>, value36 <dbl>, value37 <dbl>, value38 <dbl>, value39 <dbl>,
#   value40 <dbl>, value41 <dbl>, value42 <dbl>, value43 <dbl>, value44 <dbl>,
#   value45 <dbl>, value46 <dbl>, value47 <dbl>, value48 <dbl>, value49 <dbl>,
#   value50 <dbl>, value51 <dbl>, value52 <dbl>, value53 <dbl>, value54 <dbl>,
#   value55 <dbl>, value56 <dbl>, value57 <dbl>, value58 <dbl>, value59 <dbl>,
#   value60 <dbl>, value61 <dbl>, value62 <dbl>, value63 <dbl>, value64 <dbl>,
#   value65 <dbl>, value66 <dbl>, value67 <dbl>, value68 <dbl>, value69 <dbl>,
#   value70 <dbl>, value71 <dbl>, value72 <dbl>, value73 <dbl>, value74 <dbl>,
#   value75 <dbl>, value76 <dbl>, value77 <dbl>, value78 <dbl>, value79 <dbl>,
#   value80 <dbl>, value81 <dbl>, value82 <dbl>, value83 <dbl>, value84 <dbl>,
#   value85 <dbl>, value86 <dbl>, value87 <dbl>, value88 <dbl>, value89 <dbl>,
#   value90 <dbl>, value91 <dbl>, value92 <dbl>, value93 <dbl>, value94 <dbl>,
#   value95 <dbl>, value96 <dbl>, value97 <dbl>, value98 <dbl>, value99 <dbl>,
#   value100 <dbl>, value101 <dbl>, value102 <dbl>, value103 <dbl>,
#   value104 <dbl>, value105 <dbl>, value106 <dbl>, value107 <dbl>,
#   value108 <dbl>, …

When I use filter it is quite slow:

system.time(fetched_sample <- filter(df, sample_id == "sample_1"))
user  system elapsed
165.060   0.110 165.446

Perhaps for this simple purpose, it is best to switch to a simple which operation.

system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
> system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
   user  system elapsed
  0.050   0.000   0.054

Is it always advised to work with long tables in dplyr?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions