New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with filter #879

ozagordi opened this Issue Jan 10, 2015 · 6 comments


None yet
4 participants

ozagordi commented Jan 10, 2015

I noticed a tremendously slow performance with filter that I can't understand. I want to extract some rows based on the value of column X.

with_filter = function(mydata){
  sesa = mydata %>%
    filter(X != "Bad sample-Do not use")
without_filter = function(mydata){
  sesa = mydata[mydata$X != "Bad sample-Do not use", ]

I have a data frame with 70 rows and more than 23 thousand columns, and this is the performance

> dim(gexp)
[1]    70 23640
> system.time(with_filter(gexp))
   user  system elapsed 
 14.267   0.050  14.337 
> system.time(without_filter(gexp))
   user  system elapsed 
  0.491   0.005   0.495 

The non-dplyr method is about 30 times faster, how come?



This comment has been minimized.


hadley commented Jan 12, 2015

Hmmm, I bet we never considered the case of that many columns. We'll take a look for dplyr 0.5

@romainfrancois romainfrancois added this to the 0.5 milestone Jan 12, 2015

@romainfrancois romainfrancois self-assigned this Jan 12, 2015


This comment has been minimized.


romainfrancois commented Jan 12, 2015

I just had a brief look and what happens is that we evaluate the expression to a logical vector, which we then use to subset all of the columns, and the problem is we translate from the logical vector to the indices each and every time, where we only need to do it once.

This was supposedly to save the hosting of the integer vector that holds the indices.

I guess this calls for adding this method in DataFrameVisitors:

template <>
DataFrame subset<LogicalVector>( const Logical& index, const CharacterVector& classes ) const ; 

This comment has been minimized.

ozagordi commented Jan 12, 2015

Thanks for the quick response (and for deployer dplyr, of course)!


This comment has been minimized.

marciz commented Feb 4, 2015

Does this issue also affects joining of tables? I found performance decrease in dplyr 0.4 compared to 0.3, where dplyr 0.4 join is slower than standard merge function.

s1 <- 1e6
s2 <- 1e4

a <- data.frame(x1 = factor(sample(s1, replace = TRUE)), 
                y1 = factor(sample(s1, replace = TRUE)),
                z1 = 1)

b <- data.frame(x2 = factor(sample(s2, replace = TRUE)), 
                y2 = factor(sample(s2, replace = TRUE)),
                z2 = 1)

  c <- a %>% inner_join(b, by = c('x1' = 'x2', 'y1' = 'y2'))

  d <- a %>% merge(b, by.x = c('x1', 'y1'), by.y = c('x2', 'y2'))



This comment has been minimized.


romainfrancois commented Apr 24, 2015

Besides the issue of making the index only once, there was a very insidious problem I had a hard time to track, essentially related to garbage collection.

Anyway, I'm getting this now:

> system.time( res1 <- with_filter(data) )
utilisateur     système      écoulé
      0.140       0.003       0.142
> system.time( res2 <- without_filter(data) )
utilisateur     système      écoulé
      0.328       0.009       0.337

This is probably going to be improved further when use some parallelization as the core of filter is embarrassingly parallel.

@marciz what you report here is unrelated to the the initial issue. Can you open another one please.


This comment has been minimized.

marciz commented Apr 24, 2015

@romainfrancois I think #984 is about the same what I mentioned here


@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.