Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataArray.sel extremely slow #2452

Closed
mschrimpf opened this issue Oct 1, 2018 · 5 comments
Closed

DataArray.sel extremely slow #2452

mschrimpf opened this issue Oct 1, 2018 · 5 comments

Comments

@mschrimpf
Copy link

Problem description

.sel is an xarray method I use a lot and I would have expected it to fairly efficient.
However, even on tiny DataArrays, it takes seconds.

Code Sample, a copy-pastable example if possible

import timeit

setup = """
import itertools
import numpy as np
import xarray as xr
import string

a = list(string.printable)
b = list(string.ascii_lowercase)
d = xr.DataArray(np.random.rand(len(a), len(b)), coords={'a': a, 'b': b}, dims=['a', 'b'])
d.load()
"""

run = """
for _a, _b in itertools.product(a, b):
    d.sel(a=_a, b=_b)
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("xarray", running_times)  # e.g. [14.792144000064582, 15.19372400001157, 15.345327000017278]

Expected Output

I would have expected the above code to run in milliseconds.
However, it takes over 10 seconds!
Adding an additional d = d.stack(aa=['a'], bb=['b']) makes it even slower, about twice as slow.

For reference, a naive dict-indexing implementation in Python takes 0.01 seconds:

setup = """
import itertools
import numpy as np
import string

a = list(string.printable)
b = list(string.ascii_lowercase)

d = np.random.rand(len(a), len(b))
indexers = {'a': {coord: index for (index, coord) in enumerate(a)},
            'b': {coord: index for (index, coord) in enumerate(b)}}
"""

run = """
for _a, _b in itertools.product(a, b):
    index_a, index_b = indexers['a'][_a], indexers['b'][_b]
    item = d[index_a][index_b]
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("dicts", running_times)  # e.g. [0.015355999930761755, 0.01466800004709512, 0.014295000000856817]

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-17134-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.2.0 pip: 10.0.1 conda: None pytest: 3.7.4 IPython: 6.5.0 sphinx: None

this is a follow-up from #2438

@max-sixty
Copy link
Collaborator

Thanks for the issue @mschrimpf

.sel is slow per operation, mainly because it's a python function call (although not the only reason - it's also doing a set of checks / potential alignments / etc). When I say 'slow', I mean about 0.5ms:

In [6]: %timeit d.sel(a='a', b='a')
1000 loops, best of 3: 522 µs per loop

While there's an overhead, the time is fairly consistent regardless of the number of items it's selecting. For example:

In [11]: %timeit d.sel(a=d['a'], b=d['b'])
1000 loops, best of 3: 1 ms per loop

So, as is often the case in the pandas / python ecosystem, if you can write code in a vectorized way, without using python in the tight loops, it's fast. If you need to run python in each loop, it's much slower.

Does that resonate?


While I think not the main point here, there might be some optimizations on sel. It runs isinstance 144 times! And initializes a collection 13 times? Here's the %prun of the 0.5ms command:

``` 1077 function calls (1066 primitive calls) in 0.002 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
6 0.000 0.000 0.000 0.000 coordinates.py:169()
13 0.000 0.000 0.000 0.000 collections.py:50(init)
14 0.000 0.000 0.000 0.000 _abcoll.py:548(update)
33 0.000 0.000 0.000 0.000 _weakrefset.py:70(contains)
2 0.000 0.000 0.001 0.000 dataset.py:881(_construct_dataarray)
144 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.001 0.001 dataset.py:1496(isel)
18 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 dataset.py:92(calculate_dimensions)
13 0.000 0.000 0.000 0.000 abc.py:128(instancecheck)
36 0.000 0.000 0.000 0.000 common.py:183(setattr)
2 0.000 0.000 0.000 0.000 coordinates.py:167(variables)
2 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects}
26 0.000 0.000 0.000 0.000 variable.py:271(shape)
65 0.000 0.000 0.000 0.000 collections.py:90(iter)
5 0.000 0.000 0.000 0.000 variable.py:136(as_compatible_data)
3 0.000 0.000 0.000 0.000 dataarray.py:165(init)
2 0.000 0.000 0.000 0.000 indexing.py:1255(getitem)
3 0.000 0.000 0.000 0.000 variable.py:880(isel)
14 0.000 0.000 0.000 0.000 collections.py:71(setitem)
1 0.000 0.000 0.000 0.000 dataset.py:1414(_validate_indexers)
6 0.000 0.000 0.000 0.000 coordinates.py:38(iter)
3 0.000 0.000 0.000 0.000 variable.py:433(_broadcast_indexes)
2 0.000 0.000 0.000 0.000 variable.py:1826(to_index)
3 0.000 0.000 0.000 0.000 dataset.py:636(_construct_direct)
2 0.000 0.000 0.000 0.000 indexing.py:122(convert_label_indexer)
15 0.000 0.000 0.000 0.000 utils.py:306(init)
3 0.000 0.000 0.000 0.000 indexing.py:17(expanded_indexer)
28 0.000 0.000 0.000 0.000 collections.py:138(iteritems)
1 0.000 0.000 0.001 0.001 indexing.py:226(remap_label_indexers)
15 0.000 0.000 0.000 0.000 numeric.py:424(asarray)
1 0.000 0.000 0.001 0.001 indexing.py:193(get_dim_indexers)
80/70 0.000 0.000 0.000 0.000 {len}

</details>

@mschrimpf
Copy link
Author

mschrimpf commented Oct 1, 2018

Thanks @max-sixty,
the checks per call make sense, although I still find 0.5 ms insane for a single-value lookup (python dict-indexing takes about a 50th to index every single item in the array).

The reason I'm looking into this is actually multi-dimensional grouping (#2438) which is unfortunately not implemented (the above code is essentially a step towards trying to implement that). Is there a way of vectorizing these calls with that in mind? I.e. apply a method for each group.

@max-sixty
Copy link
Collaborator

Is there a way of vectorizing these calls with that in mind? I.e. apply a method for each group.

I can't think of anything immediately, and doubt there's an easy way given it doesn't exist yet (though that logic can be a trap!). There's some hacky pandas reshaping you may be able to do to solve this as a one-off. Otherwise it does likely require a concerted effort with numbagg.

I occasionally hit this issue too, so as keen as you are to find a solution. Thanks for giving it a try.

@mschrimpf
Copy link
Author

I posted a manual solution to the multi-dimensional grouping in the stackoverflow thread.
Hopefully, .sel can be made more efficient though, it's such an everyday method.

@max-sixty
Copy link
Collaborator

Thanks @mschrimpf. Hopefully we can get multi-dimensional groupbys, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants