-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
This is not a bug report, rather a pitfall that should maybe be documented.
I noticed that the order of aggregations matters if nans are present, skipna=True
(default), and the aggregation is done in separate calls. This is only a problem for aggregations that scale with N, e.g., mean
, but not sum
.
Example:
da = xr.DataArray(np.array([
[1, 2, 3],
[4, 5, 6],
[7, np.nan, 9]
]), dims=["height", "lat"])
da.mean(["lat", "height"]) -> 4.625 (correct)
da.mean(["height", "lat"]) -> 4.625 (correct)
da.mean("lat").mean("height") -> 5.0
da.mean("height").mean("lat") -> 4.5
The same is the case when taking nanmean
s with numpy, so this is not an xarray-only issue. The reason is that all data in the second operation have equal weights, even though they do not represent the same number of data points in the first operation (some rows/columns have 2, other 3 data points).
Xarray seems to be behaving correctly, and there may be no way around it without carrying weights across operations. However, I was still surprised by this behavior, so it might be worth documenting a warning since it is not uncommon that users perform aggregations in multiple steps, and skipna
is True
by default. The differences are largest when averaging over dimensions along which the number of nans varies a lot.