Performance of select_if() over wide dataframes #2932

GrayAlex49 · 2017-06-30T19:38:25Z

I've noticed a performance issue with dplyr that I'm having trouble pinning down exactly. This is the first clear and reproducible one I've been able to capture. They are all roughly related to the select function using a particularly wide dataset.

The goal here is to drop all columns with less than five unique values (because of the way I generated this data nothing is actually dropped but the code works as expected on real data). I would then go on to drop all columns with an sd() below a certain value etc. In this example it takes more than twice as long to do the tidy all pipes approach using select_if() as it does to summarise to a new dataframe and subset that way.

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

n <- 200
reps <- 2000
n1 <- 1

set.seed(1002)

df <- as.data.frame(cbind(matrix(seq_len(n*n1), ncol=n1),
                          matrix(runif(n*reps, min=-500, max = 1000), ncol=reps)))
system.time({
tmp <- df %>% 
  select(-V1) %>% 
  summarise_all(funs(length(unique(.))))

tmp <- names(tmp)[tmp<=5]
test <- select(df, -one_of(tmp))
rm(tmp)
})
#>    user  system elapsed 
#>   34.14    0.00   34.33

system.time({
test2 <- df %>% 
  select(-V1) %>% 
  select_if(function(col) n_distinct(col) >= 5)
})
#>    user  system elapsed 
#>   80.27    0.00   80.34

lionel- · 2017-06-30T19:47:16Z

Thanks, this is a known problem, see this graph for a summarise_all() vs map() comparison: https://github.com/hadley/dplyr/files/867620/colwise-dplyr-purrr-nogroups.pdf

Performance of colwise functions is terrible for wide data frames. I think this is due to how the hybrid evaluator parses expressions for each column. We'll work on solving this in the future.

GrayAlex49 · 2017-06-30T22:01:21Z

Okay thank you, I think that matches up as well with some slowness I was trying to pin down in mutate_at(). Switching to using map was remarkably faster. Thanks for the help.

romainfrancois · 2018-05-30T09:40:15Z

I think this is fixed now. Probably as part of #3335 and #3543

library(tidyverse)

n <- 200
reps <- 2000
n1 <- 1

set.seed(1002)

df <- as.data.frame(cbind(matrix(seq_len(n*n1), ncol=n1),
  matrix(runif(n*reps, min=-500, max = 1000), ncol=reps)))

system.time({
  tmp <- df %>% 
    select(-V1) %>% 
    summarise_all(funs(length(unique(.))))
  
  tmp <- names(tmp)[tmp<=5]
  test <- select(df, -one_of(tmp))
  rm(tmp)
})
#>    user  system elapsed 
#>  24.000   0.048  24.069

system.time({
  test2 <- df %>% 
    select(-V1) %>% 
    select_if(function(col) n_distinct(col) >= 5)
})
#>    user  system elapsed 
#>   0.188   0.001   0.190

identical(test[,-1], test2)
#> [1] TRUE

lock · 2018-11-26T10:12:36Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

krlmlr added data frame performance 🚀 labels Jul 15, 2017

romainfrancois closed this as completed May 30, 2018

lock bot locked and limited conversation to collaborators Nov 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of select_if() over wide dataframes #2932

Performance of select_if() over wide dataframes #2932

GrayAlex49 commented Jun 30, 2017

lionel- commented Jun 30, 2017

GrayAlex49 commented Jun 30, 2017 •

edited

Loading

romainfrancois commented May 30, 2018

lock bot commented Nov 26, 2018

Performance of select_if() over wide dataframes #2932

Performance of select_if() over wide dataframes #2932

Comments

GrayAlex49 commented Jun 30, 2017

lionel- commented Jun 30, 2017

GrayAlex49 commented Jun 30, 2017 • edited Loading

romainfrancois commented May 30, 2018

lock bot commented Nov 26, 2018

GrayAlex49 commented Jun 30, 2017 •

edited

Loading