-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of select_if() over wide dataframes #2932
Comments
Thanks, this is a known problem, see this graph for a Performance of colwise functions is terrible for wide data frames. I think this is due to how the hybrid evaluator parses expressions for each column. We'll work on solving this in the future. |
Okay thank you, I think that matches up as well with some slowness I was trying to pin down in mutate_at(). Switching to using map was remarkably faster. Thanks for the help. |
I think this is fixed now. Probably as part of #3335 and #3543 library(tidyverse)
n <- 200
reps <- 2000
n1 <- 1
set.seed(1002)
df <- as.data.frame(cbind(matrix(seq_len(n*n1), ncol=n1),
matrix(runif(n*reps, min=-500, max = 1000), ncol=reps)))
system.time({
tmp <- df %>%
select(-V1) %>%
summarise_all(funs(length(unique(.))))
tmp <- names(tmp)[tmp<=5]
test <- select(df, -one_of(tmp))
rm(tmp)
})
#> user system elapsed
#> 24.000 0.048 24.069
system.time({
test2 <- df %>%
select(-V1) %>%
select_if(function(col) n_distinct(col) >= 5)
})
#> user system elapsed
#> 0.188 0.001 0.190
identical(test[,-1], test2)
#> [1] TRUE |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
I've noticed a performance issue with dplyr that I'm having trouble pinning down exactly. This is the first clear and reproducible one I've been able to capture. They are all roughly related to the select function using a particularly wide dataset.
The goal here is to drop all columns with less than five unique values (because of the way I generated this data nothing is actually dropped but the code works as expected on real data). I would then go on to drop all columns with an sd() below a certain value etc. In this example it takes more than twice as long to do the tidy all pipes approach using select_if() as it does to summarise to a new dataframe and subset that way.
The text was updated successfully, but these errors were encountered: