Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of select_if() over wide dataframes #2932

Closed
GrayAlex49 opened this issue Jun 30, 2017 · 4 comments
Closed

Performance of select_if() over wide dataframes #2932

GrayAlex49 opened this issue Jun 30, 2017 · 4 comments

Comments

@GrayAlex49
Copy link

I've noticed a performance issue with dplyr that I'm having trouble pinning down exactly. This is the first clear and reproducible one I've been able to capture. They are all roughly related to the select function using a particularly wide dataset.

The goal here is to drop all columns with less than five unique values (because of the way I generated this data nothing is actually dropped but the code works as expected on real data). I would then go on to drop all columns with an sd() below a certain value etc. In this example it takes more than twice as long to do the tidy all pipes approach using select_if() as it does to summarise to a new dataframe and subset that way.

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

n <- 200
reps <- 2000
n1 <- 1

set.seed(1002)

df <- as.data.frame(cbind(matrix(seq_len(n*n1), ncol=n1),
                          matrix(runif(n*reps, min=-500, max = 1000), ncol=reps)))
system.time({
tmp <- df %>% 
  select(-V1) %>% 
  summarise_all(funs(length(unique(.))))

tmp <- names(tmp)[tmp<=5]
test <- select(df, -one_of(tmp))
rm(tmp)
})
#>    user  system elapsed 
#>   34.14    0.00   34.33

system.time({
test2 <- df %>% 
  select(-V1) %>% 
  select_if(function(col) n_distinct(col) >= 5)
})
#>    user  system elapsed 
#>   80.27    0.00   80.34
@lionel-
Copy link
Member

lionel- commented Jun 30, 2017

Thanks, this is a known problem, see this graph for a summarise_all() vs map() comparison: https://github.com/hadley/dplyr/files/867620/colwise-dplyr-purrr-nogroups.pdf

Performance of colwise functions is terrible for wide data frames. I think this is due to how the hybrid evaluator parses expressions for each column. We'll work on solving this in the future.

@GrayAlex49
Copy link
Author

GrayAlex49 commented Jun 30, 2017

Okay thank you, I think that matches up as well with some slowness I was trying to pin down in mutate_at(). Switching to using map was remarkably faster. Thanks for the help.

@romainfrancois
Copy link
Member

I think this is fixed now. Probably as part of #3335 and #3543

library(tidyverse)

n <- 200
reps <- 2000
n1 <- 1

set.seed(1002)

df <- as.data.frame(cbind(matrix(seq_len(n*n1), ncol=n1),
  matrix(runif(n*reps, min=-500, max = 1000), ncol=reps)))

system.time({
  tmp <- df %>% 
    select(-V1) %>% 
    summarise_all(funs(length(unique(.))))
  
  tmp <- names(tmp)[tmp<=5]
  test <- select(df, -one_of(tmp))
  rm(tmp)
})
#>    user  system elapsed 
#>  24.000   0.048  24.069

system.time({
  test2 <- df %>% 
    select(-V1) %>% 
    select_if(function(col) n_distinct(col) >= 5)
})
#>    user  system elapsed 
#>   0.188   0.001   0.190

identical(test[,-1], test2)
#> [1] TRUE

@lock
Copy link

lock bot commented Nov 26, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants