-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array indexing with dask arrays #2511
Comments
For reference, here's the current stacktrace/error message: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-74fe4ba70f9d> in <module>()
----> 1 da[{'dim_1' : indc}]
/usr/local/lib/python3.6/dist-packages/xarray/core/dataarray.py in __getitem__(self, key)
472 else:
473 # xarray-style array indexing
--> 474 return self.isel(indexers=self._item_key_to_dict(key))
475
476 def __setitem__(self, key, value):
/usr/local/lib/python3.6/dist-packages/xarray/core/dataarray.py in isel(self, indexers, drop, **indexers_kwargs)
817 """
818 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, 'isel')
--> 819 ds = self._to_temp_dataset().isel(drop=drop, indexers=indexers)
820 return self._from_temp_dataset(ds)
821
/usr/local/lib/python3.6/dist-packages/xarray/core/dataset.py in isel(self, indexers, drop, **indexers_kwargs)
1537 for name, var in iteritems(self._variables):
1538 var_indexers = {k: v for k, v in indexers_list if k in var.dims}
-> 1539 new_var = var.isel(indexers=var_indexers)
1540 if not (drop and name in var_indexers):
1541 variables[name] = new_var
/usr/local/lib/python3.6/dist-packages/xarray/core/variable.py in isel(self, indexers, drop, **indexers_kwargs)
905 if dim in indexers:
906 key[i] = indexers[dim]
--> 907 return self[tuple(key)]
908
909 def squeeze(self, dim=None):
/usr/local/lib/python3.6/dist-packages/xarray/core/variable.py in __getitem__(self, key)
614 array `x.values` directly.
615 """
--> 616 dims, indexer, new_order = self._broadcast_indexes(key)
617 data = as_indexable(self._data)[indexer]
618 if new_order:
/usr/local/lib/python3.6/dist-packages/xarray/core/variable.py in _broadcast_indexes(self, key)
487 return self._broadcast_indexes_outer(key)
488
--> 489 return self._broadcast_indexes_vectorized(key)
490
491 def _broadcast_indexes_basic(self, key):
/usr/local/lib/python3.6/dist-packages/xarray/core/variable.py in _broadcast_indexes_vectorized(self, key)
599 new_order = None
600
--> 601 return out_dims, VectorizedIndexer(tuple(out_key)), new_order
602
603 def __getitem__(self, key):
/usr/local/lib/python3.6/dist-packages/xarray/core/indexing.py in __init__(self, key)
423 else:
424 raise TypeError('unexpected indexer type for {}: {!r}'
--> 425 .format(type(self).__name__, k))
426 new_key.append(k)
427
TypeError: unexpected indexer type for VectorizedIndexer: dask.array<xarray-<this-array>, shape=(10,), dtype=int64, chunksize=(2,)> It looks like we could support this relatively easily since dask.array supports indexing with dask arrays now. This would be a welcome enhancement! |
It seem's working fine with the following change but it has a lot of dublicated code...
|
As of version 0.12 indexing with dask arrays works out of the box... I think this can be closed now. |
Even though the example from above does work, sadly, the following does not: import xarray as xr
import dask.array as da
import numpy as np
da = xr.DataArray(np.random.rand(3*4*5).reshape((3,4,5))).chunk(dict(dim_0=1))
idcs = da.argmax('dim_2')
da[dict(dim_2=idcs)] results in ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-3542cdd6d61c> in <module>
----> 1 da[dict(dim_2=idcs)]
~/src/xarray/xarray/core/dataarray.py in __getitem__(self, key)
604 else:
605 # xarray-style array indexing
--> 606 return self.isel(indexers=self._item_key_to_dict(key))
607
608 def __setitem__(self, key: Any, value: Any) -> None:
~/src/xarray/xarray/core/dataarray.py in isel(self, indexers, drop, **indexers_kwargs)
986 """
987 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "isel")
--> 988 ds = self._to_temp_dataset().isel(drop=drop, indexers=indexers)
989 return self._from_temp_dataset(ds)
990
~/src/xarray/xarray/core/dataset.py in isel(self, indexers, drop, **indexers_kwargs)
1901 indexes[name] = new_index
1902 else:
-> 1903 new_var = var.isel(indexers=var_indexers)
1904
1905 variables[name] = new_var
~/src/xarray/xarray/core/variable.py in isel(self, indexers, drop, **indexers_kwargs)
984 if dim in indexers:
985 key[i] = indexers[dim]
--> 986 return self[tuple(key)]
987
988 def squeeze(self, dim=None):
~/src/xarray/xarray/core/variable.py in __getitem__(self, key)
675 array `x.values` directly.
676 """
--> 677 dims, indexer, new_order = self._broadcast_indexes(key)
678 data = as_indexable(self._data)[indexer]
679 if new_order:
~/src/xarray/xarray/core/variable.py in _broadcast_indexes(self, key)
532 if isinstance(k, Variable):
533 if len(k.dims) > 1:
--> 534 return self._broadcast_indexes_vectorized(key)
535 dims.append(k.dims[0])
536 elif not isinstance(k, integer_types):
~/src/xarray/xarray/core/variable.py in _broadcast_indexes_vectorized(self, key)
660 new_order = None
661
--> 662 return out_dims, VectorizedIndexer(tuple(out_key)), new_order
663
664 def __getitem__(self, key):
~/src/xarray/xarray/core/indexing.py in __init__(self, key)
460 raise TypeError(
461 "unexpected indexer type for {}: {!r}".format(
--> 462 type(self).__name__, k
463 )
464 )
TypeError: unexpected indexer type for VectorizedIndexer: dask.array<arg_agg-aggregate, shape=(3, 4), dtype=int64, chunksize=(1, 4)> |
Yes, something seems to be going wrong here... |
I think the problem is somewhere here: Lines 85 to 103 in aaeea62
I don't think |
I'm having similar issue, here is an example:
|
I'm just curious if there's been any progress on this issue. I'm also getting the same error: |
I don't think any one is working on it. We would appreciate it if you could try to fix it. |
I wrote a very naive fix, it works but seems to perform really slowly, I would appreciate some feedback (I'm a beginner with Dask). The patch:
|
@bzah I've been testing your solution and doesn't seems to slow as you are mentioning. Do you have a specific test to be conducted so that we can make a more robust comparison? |
This comment has been minimized.
This comment has been minimized.
What I noticed, on my use case, is that it provoke a computation. Is that the reason for what you consider slow? Could be possible that is related to #3237 ? |
This comment has been minimized.
This comment has been minimized.
@bzah I tested your patch with the following code:
In my case seems that with or without it takes the same time but I would like to know if is the same for you. L. |
@pl-marasco Thanks for the example ! However, I could construct an example giving very different results. It is quite close to my original code:
(Basically I want for each month the first event occurring in it). Without the patch and uncommenting |
Hello! First off thank you for all the hard work on xarray! Use it every day and love it :) I am also having issues indexing with dask arrays and get the following error.
In order to get it to work, I first need to manually call compute to load to NumPy array before using argmax with isel. Not sure what info I can provide to help solve the issue please let me know and ill send whatever I can. |
@bzah I've been testing your code and I can confirm the increment of timing once the .compute() isn't in use. Assuming that we have only one sample object after the resample the expected result should be 1 compute and that's what we obtain if we call the computation before the .argmax() I still don't know the reason and if is correct or not but sounds weird to me; though it could explain the time increase. @dcherian @shyer do you know if all this make any sense? should the .isel() automatically trig the computation or should give back a lazy array? Here is the code I've been using (works only adding the modification proposed by @bzah)
|
I'll drop a PR, it might be easier to try and play with this than a piece of code lost in an issue. |
IIUC this cannot work lazily in most cases if you have dimension coordinate variables. When xarray constructs the output after indexing, it will try to index those coordinate variables so that it can associate the right timestamp (for e.g) with the output. The example from @ulijh should work though (it has no dimension coordinate or indexed variables) import xarray as xr
import dask.array as da
import numpy as np
da = xr.DataArray(np.random.rand(3*4*5).reshape((3,4,5))).chunk(dict(dim_0=1))
idcs = da.argmax('dim_2')
da[dict(dim_2=idcs)] The example by @rafa-guedes (thanks for that one!) could be made to work I think. import numpy as np
import dask.array as da
import xarray as xr
darr = xr.DataArray(data=[0.2, 0.4, 0.6], coords={"z": range(3)}, dims=("z",))
good_indexer = xr.DataArray(
data=np.random.randint(0, 3, 8).reshape(4, 2).astype(int),
coords={"y": range(4), "x": range(2)},
dims=("y", "x")
)
bad_indexer = xr.DataArray(
data=da.random.randint(0, 3, 8).reshape(4, 2).astype(int),
coords={"y": range(4), "x": range(2)},
dims=("y", "x")
)
In [5]: darr
Out[5]:
<xarray.DataArray (z: 3)>
array([0.2, 0.4, 0.6])
Coordinates:
* z (z) int64 0 1 2
In [6]: good_indexer
Out[6]:
<xarray.DataArray (y: 4, x: 2)>
array([[0, 1],
[2, 2],
[1, 2],
[1, 0]])
Coordinates:
* y (y) int64 0 1 2 3
* x (x) int64 0 1
In [7]: bad_indexer
Out[7]:
<xarray.DataArray 'reshape-417766b2035dcb1227ddde8505297039' (y: 4, x: 2)>
dask.array<reshape, shape=(4, 2), dtype=int64, chunksize=(4, 2), chunktype=numpy.ndarray>
Coordinates:
* y (y) int64 0 1 2 3
* x (x) int64 0 1
In [8]: darr[good_indexer]
Out[8]:
<xarray.DataArray (y: 4, x: 2)>
array([[0.2, 0.4],
[0.6, 0.6],
[0.4, 0.6],
[0.4, 0.2]])
Coordinates:
z (y, x) int64 0 1 2 2 1 2 1 0
* y (y) int64 0 1 2 3
* x (x) int64 0 1 We can copy the dimension coordinates of the output (x,y) directly from the indexer. And the dimension coordinate on the input (z) should be a dask array in the output (since z is not a dimension coordinate in the output, this should be fine) |
Code example
Problem description
Indexing with chunked arrays fails, whereas it's fine with "normal" arrays. In case the indices are the result of a lazy calculation, I would like to continue lazily.
Expected Output
I would expect an output just like in the "un-chunked" case:
Output of
xr.show_versions()
xarray: 0.10.9
pandas: 0.23.4
numpy: 1.15.2
scipy: 1.1.0
netCDF4: None
h5netcdf: 0.6.2
h5py: 2.8.0
Nio: None
zarr: None
cftime: None
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.19.4
distributed: None
matplotlib: 2.2.3
cartopy: 0.16.0
seaborn: None
setuptools: 40.4.3
pip: 18.0
conda: None
pytest: 3.8.2
IPython: 6.5.0
sphinx: 1.8.0
The text was updated successfully, but these errors were encountered: