-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
Description
I find that filtering operations can be quite slow with wide tibbles.
Here is an example of a 500 x 100,001 table (which is still quite modest), where the first column has a sample_id information.
library(dplyr)
library(purrr)
n_samples <- 500
n_features <- 100000
df <- bind_cols(
tibble(sample_id = paste0("sample_", 1:n_samples)),
1:n_features %>%
map(rnorm, n = n_samples) %>%
map(as_tibble) %>%
bind_cols()
)
df# A tibble: 500 x 100,001
sample_id value value1 value2 value3 value4 value5 value6 value7 value8
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sample_1 1.56 2.33 0.571 5.20 3.15 4.15 6.82 6.17 10.8
2 sample_2 0.891 1.64 4.83 3.32 4.30 6.82 7.07 9.94 9.02
3 sample_3 -0.125 1.39 2.89 1.66 4.53 5.15 4.75 8.59 9.98
4 sample_4 0.239 0.789 1.92 5.52 6.93 4.82 7.22 9.37 8.75
5 sample_5 0.337 3.61 1.33 2.84 4.55 6.84 5.67 8.19 8.35
6 sample_6 1.50 2.96 4.08 2.63 6.12 6.41 6.86 7.73 9.21
7 sample_7 0.887 1.37 4.02 5.53 3.29 7.68 7.63 8.72 8.24
8 sample_8 1.27 2.06 2.56 4.87 4.32 5.73 9.13 7.75 7.74
9 sample_9 -0.565 2.19 3.23 3.01 3.45 6.18 7.63 8.35 10.2
10 sample_10 1.67 1.10 1.20 4.94 4.33 7.46 7.08 7.71 8.61
# ... with 490 more rows, and 99,991 more variables: value9 <dbl>,
# value10 <dbl>, value11 <dbl>, value12 <dbl>, value13 <dbl>, value14 <dbl>,
# value15 <dbl>, value16 <dbl>, value17 <dbl>, value18 <dbl>, value19 <dbl>,
# value20 <dbl>, value21 <dbl>, value22 <dbl>, value23 <dbl>, value24 <dbl>,
# value25 <dbl>, value26 <dbl>, value27 <dbl>, value28 <dbl>, value29 <dbl>,
# value30 <dbl>, value31 <dbl>, value32 <dbl>, value33 <dbl>, value34 <dbl>,
# value35 <dbl>, value36 <dbl>, value37 <dbl>, value38 <dbl>, value39 <dbl>,
# value40 <dbl>, value41 <dbl>, value42 <dbl>, value43 <dbl>, value44 <dbl>,
# value45 <dbl>, value46 <dbl>, value47 <dbl>, value48 <dbl>, value49 <dbl>,
# value50 <dbl>, value51 <dbl>, value52 <dbl>, value53 <dbl>, value54 <dbl>,
# value55 <dbl>, value56 <dbl>, value57 <dbl>, value58 <dbl>, value59 <dbl>,
# value60 <dbl>, value61 <dbl>, value62 <dbl>, value63 <dbl>, value64 <dbl>,
# value65 <dbl>, value66 <dbl>, value67 <dbl>, value68 <dbl>, value69 <dbl>,
# value70 <dbl>, value71 <dbl>, value72 <dbl>, value73 <dbl>, value74 <dbl>,
# value75 <dbl>, value76 <dbl>, value77 <dbl>, value78 <dbl>, value79 <dbl>,
# value80 <dbl>, value81 <dbl>, value82 <dbl>, value83 <dbl>, value84 <dbl>,
# value85 <dbl>, value86 <dbl>, value87 <dbl>, value88 <dbl>, value89 <dbl>,
# value90 <dbl>, value91 <dbl>, value92 <dbl>, value93 <dbl>, value94 <dbl>,
# value95 <dbl>, value96 <dbl>, value97 <dbl>, value98 <dbl>, value99 <dbl>,
# value100 <dbl>, value101 <dbl>, value102 <dbl>, value103 <dbl>,
# value104 <dbl>, value105 <dbl>, value106 <dbl>, value107 <dbl>,
# value108 <dbl>, …
When I use filter it is quite slow:
system.time(fetched_sample <- filter(df, sample_id == "sample_1"))user system elapsed
165.060 0.110 165.446
Perhaps for this simple purpose, it is best to switch to a simple which operation.
system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])> system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
user system elapsed
0.050 0.000 0.054
Is it always advised to work with long tables in dplyr?
Reactions are currently unavailable