-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem?
I've tried to confirm that this has not been raised elsewhere; apologies if I have missed it. At first I thought it might be related to #7545, but I do not believe it is after all. I cannot identify anyway to change the behavior of DataArray.resample()
regarding the boundaries of the array, like what exists for DataArray.coarsen()
. I imagine the current behavior might be the desired behavior in many cases. However. I did not realize that, if the number of times in an array does not neatly map onto the desired output frequency, the final element of the output will aggregate whatever the remainder is, even if it is less than needed (when down-sampling). For example, if you are resampling a 31-day month of daily data to every seven days using DataArray.resample().mean()
, the output times will be four periods which contain an average of seven days of data, while the fifth period will only use the remaining three days for the average. I imagine this should be the default use case, but in some cases, I think it would be helpful to have the option to simply discard the final period if it does not contain enough points. Below is an extreme example illustrating how this can lead to some potentially unexpected (maybe naively so) results.
xarray version: 2025.9.0
>>> times = xr.date_range("2015-01-01", "2015-2-01", freq="D", name="time")
>>> data = xr.DataArray(np.arange(times.size).astype("float"), coords={"time": times})
>>> data
<xarray.DataArray (time: 32)> Size: 256B
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.,
13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
26., 27., 28., 29., 30., 31.])
Coordinates:
* time (time) datetime64[ns] 256B 2015-01-01 2015-01-02 ... 2015-02-01
>>> data.resample(time="ME").mean()
<xarray.DataArray (time: 2)> Size: 16B
array([15., 31.])
Coordinates:
* time (time) datetime64[ns] 16B 2015-01-31 2015-02-28
>>> data.resample(time='ME').count()
<xarray.DataArray (time: 2)> Size: 16B
array([31, 1])
Coordinates:
* time (time) datetime64[ns] 16B 2015-01-31 2015-02-28
I think I understand why resample
currently operates in this way (imagine you have unevenly sampled data), but if you are downsampling evenly sampled data, which I think is a reasonably common use case, the last period potentially having a much different number of samples is a problem, and, so far as I could tell, there is no documentation or warnings that alert the user that their output data at the boundaries is based off of a different number of samples. However, coarsen
does currently implement a warning when your data does not evenly fit the sample size and allows you to trim the data before coarsening. I think consistency between coarsen and resample would be ideal.
Describe the solution you'd like
Either:
- warn the user that the number of samples in each resampling window are not identical (thus users can decide whether or not this what they expect)
- update the documentation for resample to make it clearer how data boundaries are handled (the current documentation says very little in this regard)
or:
- add support for
trim
and potentiallyboundaries
parameters, like those incoarsen
, which give the user full control over if and how resampling is done at the data boundaries
Describe alternatives you've considered
The only work around I can see at present, requires the user to determine if the data will cleanly map onto the desired output frequency prior to resampling, and if not (and if they want samples only of a certain size), to discard the undesired output. In practice, I use DataArray.resample().count()
to get the number of samples in each period, and then check if all elements are equal. It seems like something like this could easily be implemented under the hood to provide the warning I suggested earlier.
Additional context
No response