DataArray.sel extremely slow #2452

mschrimpf · 2018-10-01T23:09:47Z

Problem description

.sel is an xarray method I use a lot and I would have expected it to fairly efficient.
However, even on tiny DataArrays, it takes seconds.

Code Sample, a copy-pastable example if possible

import timeit

setup = """
import itertools
import numpy as np
import xarray as xr
import string

a = list(string.printable)
b = list(string.ascii_lowercase)
d = xr.DataArray(np.random.rand(len(a), len(b)), coords={'a': a, 'b': b}, dims=['a', 'b'])
d.load()
"""

run = """
for _a, _b in itertools.product(a, b):
    d.sel(a=_a, b=_b)
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("xarray", running_times)  # e.g. [14.792144000064582, 15.19372400001157, 15.345327000017278]

Expected Output

I would have expected the above code to run in milliseconds.
However, it takes over 10 seconds!
Adding an additional d = d.stack(aa=['a'], bb=['b']) makes it even slower, about twice as slow.

For reference, a naive dict-indexing implementation in Python takes 0.01 seconds:

setup = """
import itertools
import numpy as np
import string

a = list(string.printable)
b = list(string.ascii_lowercase)

d = np.random.rand(len(a), len(b))
indexers = {'a': {coord: index for (index, coord) in enumerate(a)},
            'b': {coord: index for (index, coord) in enumerate(b)}}
"""

run = """
for _a, _b in itertools.product(a, b):
    index_a, index_b = indexers['a'][_a], indexers['b'][_b]
    item = d[index_a][index_b]
"""
running_times = timeit.repeat(run, setup, repeat=3, number=10)
print("dicts", running_times)  # e.g. [0.015355999930761755, 0.01466800004709512, 0.014295000000856817]

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-17134-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.2.0 pip: 10.0.1 conda: None pytest: 3.7.4 IPython: 6.5.0 sphinx: None

this is a follow-up from #2438

The text was updated successfully, but these errors were encountered:

max-sixty · 2018-10-01T23:25:01Z

Thanks for the issue @mschrimpf

.sel is slow per operation, mainly because it's a python function call (although not the only reason - it's also doing a set of checks / potential alignments / etc). When I say 'slow', I mean about 0.5ms:

In [6]: %timeit d.sel(a='a', b='a')
1000 loops, best of 3: 522 µs per loop

While there's an overhead, the time is fairly consistent regardless of the number of items it's selecting. For example:

In [11]: %timeit d.sel(a=d['a'], b=d['b'])
1000 loops, best of 3: 1 ms per loop

So, as is often the case in the pandas / python ecosystem, if you can write code in a vectorized way, without using python in the tight loops, it's fast. If you need to run python in each loop, it's much slower.

Does that resonate?

While I think not the main point here, there might be some optimizations on sel. It runs isinstance 144 times! And initializes a collection 13 times? Here's the %prun of the 0.5ms command:

``` 1077 function calls (1066 primitive calls) in 0.002 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
6 0.000 0.000 0.000 0.000 coordinates.py:169()
13 0.000 0.000 0.000 0.000 collections.py:50(init)
14 0.000 0.000 0.000 0.000 _abcoll.py:548(update)
33 0.000 0.000 0.000 0.000 _weakrefset.py:70(contains)
2 0.000 0.000 0.001 0.000 dataset.py:881(_construct_dataarray)
144 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.001 0.001 dataset.py:1496(isel)
18 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 dataset.py:92(calculate_dimensions)
13 0.000 0.000 0.000 0.000 abc.py:128(instancecheck)
36 0.000 0.000 0.000 0.000 common.py:183(setattr)
2 0.000 0.000 0.000 0.000 coordinates.py:167(variables)
2 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects}
26 0.000 0.000 0.000 0.000 variable.py:271(shape)
65 0.000 0.000 0.000 0.000 collections.py:90(iter)
5 0.000 0.000 0.000 0.000 variable.py:136(as_compatible_data)
3 0.000 0.000 0.000 0.000 dataarray.py:165(init)
2 0.000 0.000 0.000 0.000 indexing.py:1255(getitem)
3 0.000 0.000 0.000 0.000 variable.py:880(isel)
14 0.000 0.000 0.000 0.000 collections.py:71(setitem)
1 0.000 0.000 0.000 0.000 dataset.py:1414(_validate_indexers)
6 0.000 0.000 0.000 0.000 coordinates.py:38(iter)
3 0.000 0.000 0.000 0.000 variable.py:433(_broadcast_indexes)
2 0.000 0.000 0.000 0.000 variable.py:1826(to_index)
3 0.000 0.000 0.000 0.000 dataset.py:636(_construct_direct)
2 0.000 0.000 0.000 0.000 indexing.py:122(convert_label_indexer)
15 0.000 0.000 0.000 0.000 utils.py:306(init)
3 0.000 0.000 0.000 0.000 indexing.py:17(expanded_indexer)
28 0.000 0.000 0.000 0.000 collections.py:138(iteritems)
1 0.000 0.000 0.001 0.001 indexing.py:226(remap_label_indexers)
15 0.000 0.000 0.000 0.000 numeric.py:424(asarray)
1 0.000 0.000 0.001 0.001 indexing.py:193(get_dim_indexers)
80/70 0.000 0.000 0.000 0.000 {len}

</details>

mschrimpf · 2018-10-01T23:47:43Z

Thanks @max-sixty,
the checks per call make sense, although I still find 0.5 ms insane for a single-value lookup (python dict-indexing takes about a 50th to index every single item in the array).

The reason I'm looking into this is actually multi-dimensional grouping (#2438) which is unfortunately not implemented (the above code is essentially a step towards trying to implement that). Is there a way of vectorizing these calls with that in mind? I.e. apply a method for each group.

max-sixty · 2018-10-02T00:21:17Z

Is there a way of vectorizing these calls with that in mind? I.e. apply a method for each group.

I can't think of anything immediately, and doubt there's an easy way given it doesn't exist yet (though that logic can be a trap!). There's some hacky pandas reshaping you may be able to do to solve this as a one-off. Otherwise it does likely require a concerted effort with numbagg.

I occasionally hit this issue too, so as keen as you are to find a solution. Thanks for giving it a try.

mschrimpf · 2018-10-02T15:58:21Z

I posted a manual solution to the multi-dimensional grouping in the stackoverflow thread.
Hopefully, .sel can be made more efficient though, it's such an everyday method.

max-sixty · 2018-10-02T16:15:00Z

Thanks @mschrimpf. Hopefully we can get multi-dimensional groupbys, too.

mschrimpf closed this as completed Oct 2, 2018

joshmoore mentioned this issue Jan 2, 2019

SeqFISH Experiment loads slowly (large codebook, many tiles) spacetx/starfish#912

Closed

kujaku11 mentioned this issue Apr 19, 2024

TF.has_impedance is slow kujaku11/mt_metadata#207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataArray.sel extremely slow #2452

DataArray.sel extremely slow #2452

mschrimpf commented Oct 1, 2018

max-sixty commented Oct 1, 2018

mschrimpf commented Oct 1, 2018 •

edited

Loading

max-sixty commented Oct 2, 2018

mschrimpf commented Oct 2, 2018

max-sixty commented Oct 2, 2018

DataArray.sel extremely slow #2452

DataArray.sel extremely slow #2452

Comments

mschrimpf commented Oct 1, 2018

Problem description

Code Sample, a copy-pastable example if possible

Expected Output

Output of xr.show_versions()

max-sixty commented Oct 1, 2018

mschrimpf commented Oct 1, 2018 • edited Loading

max-sixty commented Oct 2, 2018

mschrimpf commented Oct 2, 2018

max-sixty commented Oct 2, 2018

Output of `xr.show_versions()`

mschrimpf commented Oct 1, 2018 •

edited

Loading