Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with filter #879

Closed
ozagordi opened this issue Jan 10, 2015 · 6 comments
Closed

Performance issue with filter #879

ozagordi opened this issue Jan 10, 2015 · 6 comments
Assignees
Milestone

Comments

@ozagordi
Copy link

I noticed a tremendously slow performance with filter that I can't understand. I want to extract some rows based on the value of column X.

with_filter = function(mydata){
  sesa = mydata %>%
    filter(X != "Bad sample-Do not use")
}
without_filter = function(mydata){
  sesa = mydata[mydata$X != "Bad sample-Do not use", ]
}

I have a data frame with 70 rows and more than 23 thousand columns, and this is the performance

> dim(gexp)
[1]    70 23640
> system.time(with_filter(gexp))
   user  system elapsed 
 14.267   0.050  14.337 
> system.time(without_filter(gexp))
   user  system elapsed 
  0.491   0.005   0.495 

The non-dplyr method is about 30 times faster, how come?

Thanks.

@hadley
Copy link
Member

hadley commented Jan 12, 2015

Hmmm, I bet we never considered the case of that many columns. We'll take a look for dplyr 0.5

@romainfrancois romainfrancois added this to the 0.5 milestone Jan 12, 2015
@romainfrancois romainfrancois self-assigned this Jan 12, 2015
@romainfrancois
Copy link
Member

I just had a brief look and what happens is that we evaluate the expression to a logical vector, which we then use to subset all of the columns, and the problem is we translate from the logical vector to the indices each and every time, where we only need to do it once.

This was supposedly to save the hosting of the integer vector that holds the indices.

I guess this calls for adding this method in DataFrameVisitors:

template <>
DataFrame subset<LogicalVector>( const Logical& index, const CharacterVector& classes ) const ; 

@ozagordi
Copy link
Author

Thanks for the quick response (and for deployer dplyr, of course)!

@marciz
Copy link

marciz commented Feb 4, 2015

Does this issue also affects joining of tables? I found performance decrease in dplyr 0.4 compared to 0.3, where dplyr 0.4 join is slower than standard merge function.

library(dplyr)
s1 <- 1e6
s2 <- 1e4

a <- data.frame(x1 = factor(sample(s1, replace = TRUE)), 
                y1 = factor(sample(s1, replace = TRUE)),
                z1 = 1)

b <- data.frame(x2 = factor(sample(s2, replace = TRUE)), 
                y2 = factor(sample(s2, replace = TRUE)),
                z2 = 1)

system.time({
  c <- a %>% inner_join(b, by = c('x1' = 'x2', 'y1' = 'y2'))
})

system.time({
  d <- a %>% merge(b, by.x = c('x1', 'y1'), by.y = c('x2', 'y2'))
})

Thanks

@romainfrancois
Copy link
Member

Besides the issue of making the index only once, there was a very insidious problem I had a hard time to track, essentially related to garbage collection.

Anyway, I'm getting this now:

> system.time( res1 <- with_filter(data) )
utilisateur     système      écoulé
      0.140       0.003       0.142
> system.time( res2 <- without_filter(data) )
utilisateur     système      écoulé
      0.328       0.009       0.337

This is probably going to be improved further when use some parallelization as the core of filter is embarrassingly parallel.

@marciz what you report here is unrelated to the the initial issue. Can you open another one please.

@marciz
Copy link

marciz commented Apr 24, 2015

@romainfrancois I think #984 is about the same what I mentioned here

Thanks

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants