-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow grouping by dask variables #2852
Comments
It is very hard to make this sort of groupby lazy, because you are grouping over the variable In this specific example, it sounds like what you want is to compute the histogram of labels. That could be accomplished without groupby. For example, you could use apply_ufunc together with So my recommendation is to think of a way to accomplish what you want that does not involve groupby. |
The current design of This makes operations that group over large keys stored in dask inefficient. This could be done efficiently ( |
Many thanks for your answers @shoyer and @rabernat . I am relatively new to I will give a try to I also had the following idea. Given that:
I do not actually need the discovery of unique labels that Maybe there is already something like that in xarray, or maybe this is something I can derive from the implementation of |
It sounds like there is an apply_ufunc solution to your problem but I dont know how to write it! ;) |
Roughly how many unique labels do you have? |
That's a tough question ;) In the current dataset I have 950 unique labels, but in my use cases it can be be a lot more (e.g. agricultaral crops) or a lot less (adminstrative boundaries or regions). |
I'm going to share a code snippet that might be useful to people reading this issue. I wanted to group my data by month and year, and take the mean for each group. I did not want to use My solution was to use Here is the code:
|
👀 cc @chiaral |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
You can do this with flox now. Eventually we can update xarray to support grouping by a dask variable. The limitation will be that the user will have to provide "expected groups" so that we can construct the output coordinate. |
Code Sample, a copy-pastable example if possible
I am using
xarray
in combination todask distributed
on a cluster, so a mimimal code sample demonstrating my problem is not easy to come up with.Problem description
Here is what I observe:
In my environment,
dask distributed
is correctly set-up with auto-scaling. I can verify this by loading data intoxarray
and using aggregation functions likemean()
. This triggers auto-scaling and the dask dashboard shows that the processing is spread accross slave nodes.I have the following
xarray
dataset calledgeoms_ds
:Which I load with the following code sample:
This
array
holds a finite number of integer values denoting groups (or classes if you like). I would like to perform statistics on groups (with additional variables) such as the mean value of a given variable for each group for instance.I can do this perfectly for a single group using
.where(label=xxx).mean('variable')
, this behaves as expected, triggering auto-scaling and dask graph of task.The problem is that I have a lot of groups (or classes) and looping through all of them and apply
where()
is not very efficient. From my reading ofxarray
documentation,groupby
is what I need, to perform stats on all groups at once.When I try to use
geoms_ds.groupby('label').size()
for instance, here is what I observe:Which I assume comes from the fact that the process is killed by pbs for excessive memory usage.
Expected Output
I would except the following:
groupby
lazily evaluated,dask distributed
Output of
xr.show_versions()
xarray: 0.11.3
pandas: 0.24.1
numpy: 1.16.1
scipy: 1.2.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: 1.0.15
cfgrib: None
iris: None
bottleneck: None
cyordereddict: None
dask: 1.1.1
distributed: 1.25.3
matplotlib: 3.0.2
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 40.7.1
pip: 19.0.1
conda: None
pytest: None
IPython: 7.1.1
sphinx: None
The text was updated successfully, but these errors were encountered: