Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Performance issue with filter #879
I noticed a tremendously slow performance with filter that I can't understand. I want to extract some rows based on the value of column X.
I have a data frame with 70 rows and more than 23 thousand columns, and this is the performance
The non-dplyr method is about 30 times faster, how come?
I just had a brief look and what happens is that we evaluate the expression to a logical vector, which we then use to subset all of the columns, and the problem is we translate from the logical vector to the indices each and every time, where we only need to do it once.
This was supposedly to save the hosting of the integer vector that holds the indices.
I guess this calls for adding this method in
Does this issue also affects joining of tables? I found performance decrease in dplyr 0.4 compared to 0.3, where dplyr 0.4 join is slower than standard merge function.
Besides the issue of making the index only once, there was a very insidious problem I had a hard time to track, essentially related to garbage collection.
Anyway, I'm getting this now:
This is probably going to be improved further when use some parallelization as the core of
@marciz what you report here is unrelated to the the initial issue. Can you open another one please.