numpy_groupies #4540

max-sixty · 2020-10-26T03:37:19Z

Closes Wrap numpy-groupies to speed up Xarray's groupby aggregations #4473
Tests added
Passes isort . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

Very early effort — I found this harder than I expected — I was trying to use the existing groupby infra, but think I maybe should start afresh. The result of the numpy_groupies operation is a fully formed array, whereas we're used to handling an iterable of results which need to be concat.

I also added some type signature / notes and I was going through the existing code; mostly for my own understanding

If anyone has any thoughts, feel free to comment — otherwise I'll resume this soon

max-sixty · 2020-10-26T17:02:26Z

xarray/core/groupby.py

@@ -404,6 +418,8 @@ def __init__(
        self._groups = None
        self._dims = None

+    # TODO: is this correct? Should we be returning the dims of the result? This
+    # will use the original dim where we're grouping by a coord.


Is the existing code for this property correct? Currently x.groupby(foo).dims != x.groupby(foo).sum(...).dims when we're grouping by an non-indexed coord

It returns the dims for the first group, so you can decide what dim values can be passed to .mean. But I agree it is confusing. Maybe we should deprecate and remove. This use-case is also served by implementing GroupBy.__getitem__

# Conflicts: # asv_bench/benchmarks/unstacking.py # xarray/core/options.py

dcherian

@max-sixty let me know how I can help move this forward.

dcherian · 2021-03-07T15:39:12Z

xarray/core/groupby.py

+
+        # The remainder is mostly copied from `_combine`
+
+        # FIXME: this part seems broken at the moment — the `_infer_concat_args`


Since npg returns a full array, the concat bit isn't needed any more so combined = applied. I think you could just delete all the concat code.

dcherian · 2021-03-07T15:40:49Z

xarray/core/groupby.py

+                return grouped
+
+
+def npg_aggregate(


could be moved down to Variable.

dcherian · 2021-03-07T15:42:26Z

xarray/core/groupby.py

@@ -404,6 +418,8 @@ def __init__(
        self._groups = None
        self._dims = None

+    # TODO: is this correct? Should we be returning the dims of the result? This
+    # will use the original dim where we're grouping by a coord.


It returns the dims for the first group, so you can decide what dim values can be passed to .mean. But I agree it is confusing. Maybe we should deprecate and remove. This use-case is also served by implementing GroupBy.__getitem__

dcherian · 2021-03-07T15:45:53Z

xarray/core/groupby.py

+
+        def mean(self, dim=None):
+            grouped = self._npg_groupby(func="mean")
+            if dim:


Not right for mean. Would it work if we broadcast indices along all reduction dims before passing to npg?

dcherian · 2021-03-07T15:46:43Z

xarray/core/groupby.py

+
+        applied = applied_example = type(self._obj)(
+            data=array,
+            dims=tuple(self.dims_()),


need to assign coordinate variables from self._obj here or maybe the apply_ufunc version will solve that

dcherian · 2021-03-07T15:47:59Z

xarray/core/groupby.py

+    from numpy_groupies.aggregate_numba import aggregate
+
+    axis = da.get_axis_num(dim)
+    return aggregate(group_idx=group_idx, a=da, func=func, axis=axis)


Suggested change

return aggregate(group_idx=group_idx, a=da, func=func, axis=axis)

return aggregate(group_idx=group_idx, a=da.data, func=func, axis=axis)

Could make this the following (from @shoyer's notebook) or do that later...

def _binned_agg( array: np.ndarray, indices: np.ndarray, num_bins: int, *, func, fill_value, dtype, ) -> np.ndarray: """NumPy helper function for aggregating over bins.""" mask = np.logical_not(np.isnan(indices)) int_indices = indices[mask].astype(int) shape = array.shape[: -indices.ndim] + (num_bins,) result = numpy_groupies.aggregate_numpy.aggregate( int_indices, array[..., mask], func=func, size=num_bins, fill_value=fill_value, dtype=dtype, axis=-1, ) return result def groupby_bins_agg( array: xarray.DataArray, group: xarray.DataArray, bins, func="sum", fill_value=0, dtype=None, **cut_kwargs, ) -> xarray.DataArray: """Faster equivalent of Xarray's groupby_bins(...).sum().""" # TODO: implement this upstream in xarray: # https://github.com/pydata/xarray/issues/4473 binned = pd.cut(np.ravel(group), bins, **cut_kwargs) new_dim_name = group.name + "_bins" indices = group.copy(data=binned.codes.reshape(group.shape)) result = xarray.apply_ufunc( _binned_agg, array, indices, input_core_dims=[indices.dims, indices.dims], output_core_dims=[[new_dim_name]], output_dtypes=[array.dtype], dask_gufunc_kwargs=dict( output_sizes={new_dim_name: binned.categories.size}, ), kwargs={ "num_bins": binned.categories.size, "func": func, "fill_value": fill_value, "dtype": dtype, }, dask="parallelized", ) result.coords[new_dim_name] = binned.categories return result

max-sixty · 2021-03-08T04:53:33Z

Cheers @dcherian . I've been a bit absent from xarray features for the past couple of months as you know.

If you want to take this on, please feel free. It's still at the top of my list — so I would get to it — but I really don't want to slow the progress.

max-sixty · 2021-06-20T18:32:11Z

@dcherian I know we discussed this a couple of weeks ago, and I said I was hoping to take another pass at this using https://github.com/dcherian/dask_groupby

For transparency, I haven't managed to spend any time on this yet. I certainly don't want to block you if you're interested in taking this forward.

FWIW my selfish motivation is more around speeding up in-memory groupbys than dask groupbys — the elegance of numpy_groupies & dask_groupby is that they can potentially be unified.

dcherian · 2021-06-21T23:02:20Z

No worries @max-sixty! It looks like @andersy005 will be trying to get this completed soon

tlogan2000 · 2021-08-19T18:35:48Z

Hello all, my fellow xclim https://github.com/Ouranosinc/xclim devs and I would be interested to know if this PR is still moving forward? xr.resample is pretty fundamental to much of our package so we are definitely interested in seeing the possibility of improved performance.

cheers

dcherian · 2021-08-19T21:22:52Z

Hi @tlogan2000 , this is moving forward but slowly.

The good news is that I got resampling working this week (at least for some minimal test cases). It works by rechunking so that every chunk boundary lines up with a group boundary, and then applies numpy_groupies blockwise.

https://github.com/dcherian/dask_groupby/blob/8a30c4b20f23acacfc3b53dbeab2a7b268ecd3fc/dask_groupby/xarray.py#L268-L272

You can test it out with something like this

dask_groupby.xarray.resample_reduce(ds.resample(time="M"), func="mean")

For general groupby use

dask_groupby.xarray.xarray_reduce(...)

This still fails a few xarray tests but a lot of them pass! Let me know how it goes and please file bug reports over at the dask_groupby repo if you find any.

tlogan2000 · 2021-08-20T12:54:07Z

FYI @aulemahal @Zeitsperre @huard ... re: xclim discussion yesterday. If we have spare moments in the following weeks we could try a few tests on our to end to benchmark and provide bug reports

WIP on npg

5985f2f

max-sixty commented Oct 26, 2020

View reviewed changes

_

e310cc3

rabernat mentioned this pull request Nov 25, 2020

Add histogram method #4610

Open

max-sixty added 2 commits December 30, 2020 12:09

Merge branch 'master' into npg

668c48a

Merge branch 'master' into npg

a681996

# Conflicts: # asv_bench/benchmarks/unstacking.py # xarray/core/options.py

max-sixty mentioned this pull request Jan 31, 2021

Unreasonable default fill_values ml31415/numpy-groupies#32

Open

dcherian reviewed Mar 7, 2021

View reviewed changes

andersy005 mentioned this pull request Aug 24, 2021

Enable flox in GroupBy and resample #5734

Closed

10 tasks

max-sixty closed this Oct 24, 2021

max-sixty deleted the npg branch February 5, 2022 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy_groupies #4540

numpy_groupies #4540

max-sixty commented Oct 26, 2020

max-sixty Oct 26, 2020

dcherian Mar 7, 2021

dcherian left a comment

dcherian Mar 7, 2021

dcherian Mar 7, 2021

dcherian Mar 7, 2021

dcherian Mar 7, 2021

dcherian Mar 7, 2021 •

edited

Loading

dcherian Mar 7, 2021

max-sixty commented Mar 8, 2021

max-sixty commented Jun 20, 2021

dcherian commented Jun 21, 2021

tlogan2000 commented Aug 19, 2021

dcherian commented Aug 19, 2021

tlogan2000 commented Aug 20, 2021


		# The remainder is mostly copied from `_combine`

		# FIXME: this part seems broken at the moment — the `_infer_concat_args`

	return aggregate(group_idx=group_idx, a=da, func=func, axis=axis)
	return aggregate(group_idx=group_idx, a=da.data, func=func, axis=axis)

numpy_groupies #4540

numpy_groupies #4540

Conversation

max-sixty commented Oct 26, 2020

max-sixty Oct 26, 2020

Choose a reason for hiding this comment

dcherian Mar 7, 2021

Choose a reason for hiding this comment

dcherian left a comment

Choose a reason for hiding this comment

dcherian Mar 7, 2021

Choose a reason for hiding this comment

dcherian Mar 7, 2021

Choose a reason for hiding this comment

dcherian Mar 7, 2021

Choose a reason for hiding this comment

dcherian Mar 7, 2021

Choose a reason for hiding this comment

dcherian Mar 7, 2021 • edited Loading

Choose a reason for hiding this comment

dcherian Mar 7, 2021

Choose a reason for hiding this comment

max-sixty commented Mar 8, 2021

max-sixty commented Jun 20, 2021

dcherian commented Jun 21, 2021

tlogan2000 commented Aug 19, 2021

dcherian commented Aug 19, 2021

tlogan2000 commented Aug 20, 2021

dcherian Mar 7, 2021 •

edited

Loading