-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with filter #879
Comments
Hmmm, I bet we never considered the case of that many columns. We'll take a look for dplyr 0.5 |
I just had a brief look and what happens is that we evaluate the expression to a logical vector, which we then use to subset all of the columns, and the problem is we translate from the logical vector to the indices each and every time, where we only need to do it once. This was supposedly to save the hosting of the integer vector that holds the indices. I guess this calls for adding this method in
|
Thanks for the quick response (and for |
Does this issue also affects joining of tables? I found performance decrease in dplyr 0.4 compared to 0.3, where dplyr 0.4 join is slower than standard merge function.
Thanks |
Besides the issue of making the index only once, there was a very insidious problem I had a hard time to track, essentially related to garbage collection. Anyway, I'm getting this now:
This is probably going to be improved further when use some parallelization as the core of @marciz what you report here is unrelated to the the initial issue. Can you open another one please. |
@romainfrancois I think #984 is about the same what I mentioned here Thanks |
I noticed a tremendously slow performance with filter that I can't understand. I want to extract some rows based on the value of column X.
I have a data frame with 70 rows and more than 23 thousand columns, and this is the performance
The non-dplyr method is about 30 times faster, how come?
Thanks.
The text was updated successfully, but these errors were encountered: