You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, grouping by multiple coordinates (e.g., time.year and time.season) requires creating a new set of coordinates before grouping due to the xarray limitations described below.
Xarray's GroupBy operations are currently limited:
One can only group by a single variable.
When grouping by a dask array, that array will be computed to discover the unique group labels, and their locations
Current temporal averaging logic (workaround for multi-variable grouping):
Preprocess time coordinates (e.g., drop leap days, subset based on reference climatology)
Transform time coordinates from an xarray.DataArray to a pandas.DataFrame,
a. Keep only the DataFrame columns needed for grouping (e.g., "year" and "season" for seasonal group averages), essentially "labeling" coordinates with their groups
b. Process the DataFrame including:
Mapping of months to custom seasons for custom seasonal grouping
Correction of "DJF" seasons by shifting Decembers over to the next year
Mapping of seasons to their mid months to create cftime coordinates (season strings aren't supported in cftime/datetime objects)
Convert DataFrame to cftime objects to represent new time coordinates
Replace existing time coordinates in the DataArray with new time coordinates
Group DataArray with new time coordinates for the mean
Describe the solution you'd like
It is would be simpler, cleaner, and probably more performant to call something like .groupby(["time.year", "time.season"]) instead (waiting on xarray to support this with flox). This solution will reduce a lot of the internal complexities involved with the temporal averaging API.
We might able to achieve this using flox directly:
These limitations can be avoided by using {py:func}flox.xarray.xarray_reduce which allows grouping by multiple variables, lazy grouping by dask variables, as well as an arbitrary combination of categorical grouping and binning. For example,
Additionally, would need to figure out a way to easily perform the processing steps for time coordinates directly in xarray objects described in 2b if we move away from using pandas.DataFrame.
Describe alternatives you've considered
Multi-variable grouping was originally done using pd.MultiIndex but we shifted away from this approach because this object cannot be written out to netcdf4. Also pd.MultiIndex is not the standard object type for representing time coordinates in xarray. The standard object types are np.datetime64 and cftime.
tomvothecoder
changed the title
[FEATURE]: Improve temporal averaging grouping without the use of pandas MultiIndex
[FEATURE]: Improve temporal averaging grouping logic
Apr 6, 2022
Is your feature request related to a problem?
Currently, grouping by multiple coordinates (e.g.,
time.year
andtime.season
) requires creating a new set of coordinates before grouping due to the xarray limitations described below.Related code in
xcdat
for temporal grouping:xcdat/xcdat/temporal.py
Lines 1266 to 1322 in c9bcbcd
Current temporal averaging logic (workaround for multi-variable grouping):
xarray.DataArray
to apandas.DataFrame
,a. Keep only the DataFrame columns needed for grouping (e.g., "year" and "season" for seasonal group averages), essentially "labeling" coordinates with their groups
b. Process the DataFrame including:
cftime
coordinates (season strings aren't supported incftime
/datetime
objects)cftime
objects to represent new time coordinatesDescribe the solution you'd like
It is would be simpler, cleaner, and probably more performant to call something like
.groupby(["time.year", "time.season"])
instead (waiting onxarray
to support this withflox
). This solution will reduce a lot of the internal complexities involved with the temporal averaging API.We might able to achieve this using
flox
directly:Additionally, would need to figure out a way to easily perform the processing steps for time coordinates directly in xarray objects described in 2b if we move away from using
pandas.DataFrame
.Describe alternatives you've considered
Multi-variable grouping was originally done using
pd.MultiIndex
but we shifted away from this approach because this object cannot be written out tonetcdf4
. Alsopd.MultiIndex
is not the standard object type for representing time coordinates in xarray. The standard object types arenp.datetime64
andcftime
.Additional context
Future solution through
xarray
+flox
:xarray
version in Update GroupBy constructor for grouping by multiple variables, dask arrays pydata/xarray#6610, we should be able to do this.flox
inGroupBy
andresample
pydata/xarray#5734 is now merged which improves.groupby()
performance significantly.The text was updated successfully, but these errors were encountered: