-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataArray.sel extremely slow #2452
Comments
Thanks for the issue @mschrimpf
While there's an overhead, the time is fairly consistent regardless of the number of items it's selecting. For example:
So, as is often the case in the pandas / python ecosystem, if you can write code in a vectorized way, without using python in the tight loops, it's fast. If you need to run python in each loop, it's much slower. Does that resonate? While I think not the main point here, there might be some optimizations on
```
1077 function calls (1066 primitive calls) in 0.002 seconds
Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function)
|
Thanks @max-sixty, The reason I'm looking into this is actually multi-dimensional grouping (#2438) which is unfortunately not implemented (the above code is essentially a step towards trying to implement that). Is there a way of vectorizing these calls with that in mind? I.e. apply a method for each group. |
I can't think of anything immediately, and doubt there's an easy way given it doesn't exist yet (though that logic can be a trap!). There's some hacky pandas reshaping you may be able to do to solve this as a one-off. Otherwise it does likely require a concerted effort with numbagg. I occasionally hit this issue too, so as keen as you are to find a solution. Thanks for giving it a try. |
I posted a manual solution to the multi-dimensional grouping in the stackoverflow thread. |
Thanks @mschrimpf. Hopefully we can get multi-dimensional groupbys, too. |
Problem description
.sel
is an xarray method I use a lot and I would have expected it to fairly efficient.However, even on tiny DataArrays, it takes seconds.
Code Sample, a copy-pastable example if possible
Expected Output
I would have expected the above code to run in milliseconds.
However, it takes over 10 seconds!
Adding an additional
d = d.stack(aa=['a'], bb=['b'])
makes it even slower, about twice as slow.For reference, a naive dict-indexing implementation in Python takes 0.01 seconds:
Output of
xr.show_versions()
this is a follow-up from #2438
The text was updated successfully, but these errors were encountered: