# Experimenting with Masked Data and Weighting in xarray

This notebook experiments with how weighting with masked data can affect the results of reduction operations such as simple averages and grouped averages. Specifically, this notebook looks into weighted temporal averaging.

In [12]:
# flake8: noqa f401
import numpy as np
import xarray as xr

import xcdat

## Create a sample monthly time series dataset

In [13]:
ds = xr.Dataset(
    coords={
        "lat": [-90],
        "lon": [0],
        "time": xr.DataArray(
            data=np.array(
                [
                    "2000-01-01T00:00:00.000000000",
                    "2000-02-01T00:00:00.000000000",
                    "2000-03-01T00:00:00.000000000",
                    "2000-04-01T00:00:00.000000000",
                    "2001-01-01T00:00:00.000000000",
                ],
                dtype="datetime64[ns]",
            ),
            dims=["time"],
            attrs={
                "axis": "T",
                "long_name": "time",
                "standard_name": "time",
                "bounds": "time_bnds",
            },
        ),
    }
)
ds["time_bnds"] = xr.DataArray(
    name="time_bnds",
    data=np.array(
        [
            ["2000-01-01T00:00:00.000000000", "2000-02-01T00:00:00.000000000"],
            ["2000-02-01T00:00:00.000000000", "2000-03-01T00:00:00.000000000"],
            ["2000-03-01T00:00:00.000000000", "2000-04-01T00:00:00.000000000"],
            ["2000-04-01T00:00:00.000000000", "2000-05-01T00:00:00.000000000"],
            ["2000-12-01T00:00:00.000000000", "2001-01-01T00:00:00.000000000"],
        ],
        dtype="datetime64[ns]",
    ),
    coords={"time": ds.time},
    dims=["time", "bnds"],
    attrs={"is_generated": "True"},
)

ds["ts"] = xr.DataArray(
    data=np.array([[[2]], [[np.nan]], [[1]], [[1]], [[2]]]),
    coords={"lat": ds.lat, "lon": ds.lon, "time": ds.time},
    dims=["time", "lat", "lon"],
)

In [14]:
ds

## 1. Set the TemporalAccessor class attributes to call internal methods.

In [15]:
ds.temporal._time_bounds = ds.time_bnds
ds.temporal._mode = "average"
ds.temporal._freq = ds.temporal._infer_freq()

## 2. Calculate the weights for monthly data (uses time bounds)

In [16]:
weights = ds.temporal._get_weights()
weights

## 3. Means with weights (no masking)

In [17]:
# Grouped weighted average (annual cycle)
dv_gb_avg1 = (ds.ts * weights).groupby("time.month").sum()
dv_gb_avg1

In [18]:
# Simple average with monthly weights
dv_avg1 = ds.ts.weighted(weights).mean(dim="time")
dv_avg1

## 4. Means with weights (masked with `np.nan` for missing values)

In [19]:
masked_weights = weights.copy()
masked_weights.data[1] = np.nan
masked_weights

In [20]:
dv_gb_avg2 = (ds.ts * masked_weights).groupby("time.month").sum()
dv_gb_avg2

In [21]:
# ValueError: `weights` cannot contain missing values. Missing values can be replaced by `weights.fillna(0)`.
dv_avg2 = ds.ts.weighted(masked_weights).mean(dim="time")

ValueError: `weights` cannot contain missing values. Missing values can be replaced by `weights.fillna(0)`.

## 5. Means with weights (masked with 0 for missing values)

In [22]:
filled_weights = masked_weights.copy().fillna(0)
filled_weights

In [23]:
dv_gb_avg3 = (ds.ts * filled_weights).groupby("time.month").sum()
dv_gb_avg3

In [24]:
dv_avg3 = ds.ts.weighted(filled_weights).mean(dim="time")
dv_avg3

## Compare results

In [25]:
# Generated weights vs. generated weights with np.nan for missing values
dv_gb_avg1.identical(dv_gb_avg2)

True

In [26]:
# dv_avg1.identical(dv_avg2)  # Does not work since missing weight values must be filled

In [27]:
# Generated weights vs. generated weights with 0 for missing values
dv_gb_avg1.identical(dv_gb_avg3)

True

In [28]:
dv_avg1.identical(dv_avg3)

True

# Key Takeaways

- `weights.fillna(0)` is required if weights contain `np.nan` and the `.weighted().mean()` API is used for calculating simple weighted averages (e.g., `ds.spatial.average()`)
  - `ValueError: 'weights' cannot contain missing values. Missing values can be replaced by 'weights.fillna(0)'.`
  - **Question: Do the weights generated by our spatial averaging methods include `np.nan`?**
- `weights` generated from time coordinates do not contain `np.nan` for missing values, so `weights.fillna(0)` is not required for temporal averaging

**In any case, multiplying any weight value (`np.nan`, 0, 1, etc.) with a missing value (represented in xarray as `np.nan`) results in `np.nan`.**

**This has no affect on the final reduction operation. (Refer to below cells)**


In [29]:
ds.ts * weights

In [30]:
ds.ts * filled_weights

In [31]:
ds.ts * masked_weights