# Temporal Averaging API Validation Notebook

This notebook compares the outputs of CDAT's and XCDAT's temporal averaging and departures APIs to determine their floating point differences. The goal is to ensure that the differences do not exceed our specified absolute and relative tolerance levels.

## How to Use This Notebook

1. Create and activate the conda development environment
   - `conda env create -f conda-env/test_dev.yml`
   - `conda activate xcdat_test_dev`
2. Clone the `xcdat` repo
   - `git clone https://github.com/XCDAT/xcdat.git`
3. Install the `feature/47-climatology` branch build of `xcdat`
   - `cd xcdat`
   - `git checkout feature/47-climatology`
   - `pip install .`
4. Attach the `xcdat_test_dev` env kernel to this notebook
5. Run cells


## Time Frequencies Table
This table compares the time frequencies that can be used for grouping.

| Output Type 	| Averaging Type 	| XCDAT Frequency 	| CDAT Frequency 	| Groups 	|
|---	|---	|---	|---	|---	|
| time series 	| Yearly means 	| “Year” 	| YEAR() 	| (year,) 	|
|  	| Monthly means 	| “month” 	| JAN(), FEB(), MAR(), …., DEC()  	| (year, month) 	|
|  	| Seasonal means 	| “season” 	| DJF(), MAM(), JJA(), SON() 	| (year, season) 	|
|  	| Custom seasonal means 	| custom season, e.g. “JFM,AMJ,JAS,OND” 	| cdutil.times.Seasons() 	| (year, custom_season) 	|
|  	| Daily means 	| “day” 	| N/A (unsupported) 	| (year, month, day) 	|
|  	| Hourly means 	| “hour” 	| N/A (unsupported) 	| (year, month, day, hour) 	|
|  	| N hourly means 	|  Nhour<br> (e.g. 6hour, 3hour, …) 	| N/A (unsupported) 	| (year, month, day, Nhour) 	|
| Climatology 	| Annual cycle climatology 	| “month” 	| ANNUALCYCLE.departures() 	| (month,) 	|
|  	| Daily cycle climatology 	| “day” 	| N/A (unsupported) 	| (month, day) 	|
|  	| Seasonal cycle climatology 	| “season” 	| SEASONALCYCLE.departures() 	| (season,) 	|
|  	| Custom seasonal cycle climatology 	| custom season 	| cdutil.times.Seasons() 	| (season,) 	|
|  	| Diurnal cycle climatology 	| *-diurnalNNN 	| N/A (unsupported) 	| Append TOD where TOD is diurnal time index corresponding to NNN 	|
| Departures 	| Annual cycle departures 	| “month” 	| ANNUALCYCLE.departures() 	| (month,) 	|
|  	| Daily cycle departures 	| “day” 	| N/A (unsupported) 	| (month, day) 	|
|  	| Seasonal cycle departures 	| “season” 	| SEASONALCYCLE.departures() 	| (season,) 	|
|  	| Custom seasonal cycle departures 	| custom season 	| cdutil.times.Seasons() 	| (season,) 	|
|  	| Diurnal cycle departures 	| *-diurnalNNN 	| N/A (unsupported) 	| Append TOD where TOD is diurnal time index corresponding to NNN 	|

## Comparison Methodology
This notebook loops through a list of netCDF dataset files and calls the CDAT and XCDAT temporal averaging/departures APIs on a data variable within the dataset using the aforementioned time frequencies.

Afterwards, the `np.assert.allclose()` method is used to check whether outputs meet the specified relative and absolute tolerances below.

  - For time series averaging and climatologies: `rtol=1e-7` and `atol=0`
  - For departures: `rtol=0` and `atol=1e-5`

### What do relative and absolute tolerance mean?

- Relative tolerance (`rtol`):
  - Within tolerance if absolute(a - b) <= rtol * absolute(b)
  - You generally want to use rtol: since the precision of numbers and calculations is very much finite, larger numbers will almost always be less precise than smaller ones, and the difference scales linearly (again, in general).
  - `rtol` compares significant figures.

- Absolute tolerance (`atol`)
  - Within tolerance if absolute(a - b) <= atol
  - Use `atol` for numbers that are so close to zero that rounding errors are liable to be larger than the number itself (e.g., departures)
  - `atol` compares fixed decimal places.

- Sources:
  - https://stackoverflow.com/questions/57063555/numpy-allclose-compare-arrays-with-floating-points
  - https://stackoverflow.com/questions/61839984/relative-difference-in-numpy-testing-assert-allclose
  - https://stackoverflow.com/a/4029397
  - https://stackoverflow.com/a/65909907

## Setup Code

In [4]:
import re
from typing import Dict, Tuple, Union

import cdms2
import cdutil
import numpy as np
import pandas as pd
import xarray as xr
from cdms2.tvariable import TransientVariable

import xcdat  # noqa: F401

MONTH_STR_TO_INT = {
    "JAN": 1,
    "FEB": 2,
    "MAR": 3,
    "APR": 4,
    "MAY": 5,
    "JUN": 6,
    "JUL": 7,
    "AUG": 8,
    "SEP": 9,
    "OCT": 10,
    "NOV": 11,
    "DEC": 12,
}
MONTH_INT_TO_STR = dict(zip(MONTH_STR_TO_INT.values(), MONTH_STR_TO_INT.keys()))

SEASONS = ["DJF", "MAM", "JJA", "SON"]


In [5]:
def cdutil_outputs(
    t_var: TransientVariable,
) -> Tuple[Dict[str, TransientVariable], ...]:
    """Temporal averaging and departures outputs using ``cdutil``.

    Parameters
    ----------
    t_var : TransientVariable.
        A TransientVariable

    Returns
    -------
    Tuple[Dict[str, TransientVariable]]
        Temporal averaging outputs.
    """
    avgs = {
        "year": cdutil.YEAR(t_var),
        "season": cdutil.SEASONALCYCLE(t_var),
        "month": cdutil.ANNUALCYCLE(t_var),
    }
    climos = {
        "month": cdutil.ANNUALCYCLE.climatology(t_var),
        "season": cdutil.SEASONALCYCLE.climatology(t_var),
    }
    departures = {
        "month": cdutil.ANNUALCYCLE.departures(t_var),
        "season": cdutil.SEASONALCYCLE.departures(t_var),
    }

    # Split outputs by month
    for mon_str in MONTH_STR_TO_INT.keys():
        cdutil_attr = getattr(cdutil, mon_str)
        avgs[mon_str] = cdutil_attr(t_var)
        climos[mon_str] = cdutil_attr.climatology(t_var)
        departures[mon_str] = cdutil_attr.departures(t_var)

    # Split outputs by season
    for season in SEASONS:
        cdutil_attr = getattr(cdutil, season)
        avgs[season] = cdutil_attr(t_var)
        climos[season] = cdutil_attr.climatology(t_var)
        departures[season] = cdutil_attr.departures(t_var)

    return climos, departures, avgs


def xcdat_outputs(dataset: xr.Dataset, data_var) -> Tuple[Dict[str, xr.Dataset], ...]:
    """Temporal averaging and departures outputs using xcdat.

    To get a specific season or month, use `.sel()`.

    Parameters
    ----------
    dataset : xr.Dataset
        A Dataset.

    Returns
    -------
    Tuple[Dict[str, xr.Dataset], ...]
        Temporal averaging outputs.
    """
    # Default Parameters for CDAT, used in XCDAT
    WEIGHTED = True
    CENTER_TIME = True
    SEASON_CONFIG = {"dec_mode": "DJF", "drop_incomplete_djf": False}

    avgs: Dict[str, xr.Dataset] = {
        "year": dataset.temporal.temporal_avg(
            data_var,
            "timeseries",
            "year",
            weighted=WEIGHTED,
            center_times=CENTER_TIME,
        ),
        "season": dataset.temporal.temporal_avg(
            data_var,
            "timeseries",
            "season",
            weighted=WEIGHTED,
            center_times=CENTER_TIME,
            season_config=SEASON_CONFIG,
        ),
        "month": dataset.temporal.temporal_avg(
            data_var,
            "timeseries",
            "month",
            weighted=WEIGHTED,
            center_times=CENTER_TIME,
        ),
        # Temporarily comment out since CDAT doesn't support these freqs
        # "day": dataset.temporal.temporal_avg(
        #     data_var, "timeseries", "day", weighted=WEIGHTED
        # ),
        # "hour": dataset.temporal.temporal_avg(
        #     data_var, "timeseries", "month", weighted=WEIGHTED
        # ),
    }
    climos: Dict[str, xr.Dataset] = {
        "month": dataset.temporal.temporal_avg(
            data_var,
            "climatology",
            "month",
            weighted=WEIGHTED,
            center_times=CENTER_TIME,
        ),
        "season": dataset.temporal.temporal_avg(
            data_var,
            "climatology",
            "season",
            weighted=WEIGHTED,
            center_times=CENTER_TIME,
            season_config=SEASON_CONFIG,
        ),
        # Temporarily comment out since CDAT doesn't support these freqs
        # "day": dataset.temporal.temporal_avg(
        #     data_var, "climatology", "day", weighted=WEIGHTED
        # ),
    }
    departures: Dict[str, np.ndarray] = {
        "month": climos["month"].temporal.departures(data_var),
        "season": climos["season"].temporal.departures(data_var),
        # Temporarily comment out since CDAT doesn't support these freqs
        # "day": climos["day"].temporal.departures(data_var),
    }

    # Split outputs by month
    for mon_str, mon_int in MONTH_STR_TO_INT.items():
        avgs[mon_str] = avgs["month"].sel(year_month_level_1=mon_int)
        climos[mon_str] = climos["month"].sel(month=mon_int)
        departures[mon_str] = departures["month"].isel(
            time=(departures["month"].time.dt.month) == mon_int
        )

    # Split outputs by season
    for season in SEASONS:
        avgs[season] = avgs["season"].sel(year_season_level_1=season)
        climos[season] = climos["season"].sel(season=season)
        departures[season] = departures["season"].isel(
            time=(departures["season"].time.dt.season) == season
        )

    return climos, departures, avgs


def compare_outputs(
    df: pd.DataFrame,
    file: str,
    var,
    avg_type: Tuple[str, str],
    lib_a: Tuple[str, Dict[str, Union[np.array, TransientVariable]]],
    lib_b: Tuple[str, Dict[str, xr.Dataset]],
    rtol: float = 1e-07,
    atol: float = 0.0,
) -> pd.DataFrame:
    df_new = df.copy()
    lib_a_name, lib_a_outputs = lib_a
    lib_b_name, lib_b_outputs = lib_b

    rows = []
    for freq, output_a in lib_a_outputs.items():
        if output_a is not None:
            output_b = lib_b_outputs[freq]
            if type(output_b) == xr.Dataset:
                output_b = output_b[var]

            abs_sum_a = np.sum(abs(output_a.data))
            abs_sum_b = np.sum(abs(output_b.values))
            row = {
                "lib_a": lib_a_name,
                "lib_b": lib_b_name,
                "file": file,
                "var": var,
                "avg_type": avg_type,
                "freq": freq,
                "rtol": rtol,
                "atol": atol,
                "shape_a": output_a.shape,
                "shape_b": output_b.shape,
                "abs_sum_a": abs_sum_a,
                "abs_sum_b": abs_sum_b,
                "abs_sum_diff": abs_sum_b - abs_sum_a,
                "mismatching_shapes": output_a.shape != output_b.shape,
            }
            try:
                np.testing.assert_allclose(
                    output_b.data,
                    output_a.data,
                    rtol=rtol,
                    atol=atol,
                    equal_nan=True,
                )
                row.update({"equal_to_tolerance": True})
            except AssertionError as e:
                msg = str(e)
                row.update(
                    {
                        "equal_to_tolerance": False,
                    }
                )

                mismatching_text = re.search(
                    "(?<=Mismatched elements: )(.*)(?=\\nMax absolute)", msg
                )
                if mismatching_text:
                    mismatching_elements = mismatching_text.group(0)
                    mismatching_pct = float(
                        re.search("(?<=\()(.*)(?=%\))", msg).group(0)
                    )
                    max_abs_diff = float(
                        re.search(
                            "(?<=Max absolute difference: )(.*)(?=\\nMax relative)",
                            msg,
                        ).group(0)
                    )
                    max_rel_diff = float(
                        re.search(
                            "(?<=Max relative difference: )(.*)(?=\\n x)", msg
                        ).group(0)
                    )
                    row.update(
                        {
                            "mismatching_elements": mismatching_elements,
                            "mismatching_pct": mismatching_pct,
                            "max_abs_diff": max_abs_diff,
                            "max_rel_diff": max_rel_diff,
                            "max_rel_diff_pct": max_rel_diff * 100,
                        }
                    )
            rows.append(row)
        df_rows = pd.DataFrame(rows)
        df_new = pd.concat([df, df_rows])
    return df_new


## Compare Temporal Avg Outputs

In [6]:
# Dictionary for storing outputs
outputs = {}
# DataFrame for comparing output closeness
df = pd.DataFrame(
    columns=[
        "lib_a",
        "lib_b",
        "file",
        "var",
        "avg_type",
        "freq",
        "rtol",
        "atol",
        "shape_a",
        "shape_b",
        "abs_sum_a",
        "abs_sum_b",
        "abs_sum_diff",
        "equal_to_tolerance",
        "mismatching_elements",
        "mismatching_percent",
        "max_abs_diff",
        "max_rel_diff",
        "max_rel_diff_pct",
    ]
)

vars_to_files = {
    "psl": "./input/demo_data/CMIP5_demo_data/psl_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc",
    "ts": "./input/demo_data/CMIP5_demo_data/ts_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc",
    "TS": "/p/user_pub/work/E3SM/1_0/historical/1deg_atm_60-30km_ocean/atmos/180x360/time-series/mon/ens1/v5/TS_185001_201412.nc"
}

for var, file in vars_to_files.items():
    print(f"Comparing outputs for {file}, {var}")
    x_ds = xcdat.open_dataset(file)
    c_ds = cdms2.open(file)
    t_var = c_ds(var)

    # Calculate the temporal averaging outputs using each library.
    c_climos, c_departs, c_avgs = cdutil_outputs(t_var)
    x_climos, x_departs, x_avgs = xcdat_outputs(x_ds, var)

    # Compare the results of the outputs and add results to the DataFrame.
    df = compare_outputs(
        df,
        file,
        var,
        "timeseries",
        ("cdutil", c_avgs),
        ("xcdat", x_avgs),
        rtol=1e-7,
        atol=0,
    )
    df = compare_outputs(
        df,
        file,
        var,
        "climatology",
        ("cdutil", c_climos),
        ("xcdat", x_climos),
        rtol=1e-7,
        atol=0,
    )
    df = compare_outputs(
        df,
        file,
        var,
        "departures",
        ("cdutil", c_departs),
        ("xcdat", x_departs),
        rtol=0,
        atol=1e-5,
    )

    # Add results to the outputs dictionary for more granular analysis.
    outputs[var] = {
        "cdat": {"climos": c_climos, "departs": c_departs, "avgs": c_avgs},
        "xcdat": {"climos": x_climos, "departs": x_departs, "avgs": x_avgs},
    }


Comparing outputs for ./input/demo_data/CMIP5_demo_data/psl_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc, psl
Comparing outputs for ./input/demo_data/CMIP5_demo_data/ts_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc, ts
Comparing outputs for /p/user_pub/work/E3SM/1_0/historical/1deg_atm_60-30km_ocean/atmos/180x360/time-series/mon/ens1/v5/TS_185001_201412.nc, TS


#### Process Results DataFrame

In [7]:
pd.set_option("display.float_format", lambda x: "%.8f" % x)
df2 = df.copy()
cdat_freq = {
    "season": "SEASONALCYCLE",
    "year": "YEAR",
    "month": "ANNUALCYCLE",
}

df2["cdat_freq"] = df2["freq"].apply(lambda x: cdat_freq.get(x, x))
df2 = df2.set_index(["lib_a", "lib_b", "var", "avg_type", "freq", "cdat_freq"])

df2 = df2.sort_values(["lib_a", "lib_b", "var"], ascending=True)
df2["mismatching_shapes"] = df2["mismatching_shapes"].astype(bool)
df2["max_rel_diff_pct"] = df2["max_rel_diff"] * 100
df2 = df2[
    [
        "shape_a",
        "shape_b",
        "abs_sum_a",
        "abs_sum_b",
        "abs_sum_diff",
        "equal_to_tolerance",
        "mismatching_elements",
        "mismatching_shapes",
        "mismatching_percent",
        "max_abs_diff",
        "max_rel_diff",
        "max_rel_diff_pct"
    ]
]

final_df = df2.sort_values(by=["max_rel_diff"], ascending=False)

### Validation 1 - Check for mismatching shapes

In [8]:
mismatching_shapes = df2.loc[df2.mismatching_shapes == True]
mismatching_shapes = mismatching_shapes[["shape_a", "shape_b"]]
mismatching_shapes

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,shape_a,shape_b
lib_a,lib_b,var,avg_type,freq,cdat_freq,Unnamed: 6_level_1,Unnamed: 7_level_1
cdutil,xcdat,TS,departures,season,SEASONALCYCLE,"(661, 180, 360)","(1980, 180, 360)"
cdutil,xcdat,TS,departures,DJF,DJF,"(166, 180, 360)","(495, 180, 360)"
cdutil,xcdat,TS,departures,MAM,MAM,"(165, 180, 360)","(495, 180, 360)"
cdutil,xcdat,TS,departures,JJA,JJA,"(165, 180, 360)","(495, 180, 360)"
cdutil,xcdat,TS,departures,SON,SON,"(165, 180, 360)","(495, 180, 360)"
cdutil,xcdat,psl,departures,season,SEASONALCYCLE,"(625, 145, 192)","(1872, 145, 192)"
cdutil,xcdat,psl,departures,DJF,DJF,"(157, 145, 192)","(468, 145, 192)"
cdutil,xcdat,psl,departures,MAM,MAM,"(156, 145, 192)","(468, 145, 192)"
cdutil,xcdat,psl,departures,JJA,JJA,"(156, 145, 192)","(468, 145, 192)"
cdutil,xcdat,psl,departures,SON,SON,"(156, 145, 192)","(468, 145, 192)"


CDAT does grouping a bit different for `SEASONALCYCLE` for departures.

- CDAT removes the climatology from the observation data based on the group, then groups by year and season (resulting in 625 coordinates instead of 1872).
- XCDAT uses xarray's groupby arithmetic to subtract the climatology from the grouped observation data. This restores the original shape of the data (1872 coordinates instead of 625).
  - An additional averaging operation is need to get the year and season grouping (FUTURE WORK).

### Validation 2 - Compare closeness of floating point outputs

Rows with mismatching shapes are dropped from the DataFrame before analysis since we are focusing on shapes that align. `np.assert.all_close()` does not work with mismatching shapes.

#### Split Results DataFrame by Operation Type

In [9]:
# Only get results with mismatching elements
mismatching_elements = df.copy().loc[(df.mismatching_shapes == False)]
mismatching_elements["cdat_freq"] = mismatching_elements["freq"].apply(
    lambda x: cdat_freq.get(x, x)
)

index = ["lib_a", "lib_b", "var", "avg_type", "freq", "cdat_freq"]
mismatching_elements = mismatching_elements[
    [
        *index,
        "mismatching_elements",
        "mismatching_shapes",
        "equal_to_tolerance",
        "max_abs_diff",
        "max_rel_diff",
        "max_rel_diff_pct",
        "abs_sum_a",
        "abs_sum_b",
        "abs_sum_diff",
    ]
]

m_elems_departs = mismatching_elements.loc[
    mismatching_elements["avg_type"] == "departures"
]
m_elems_climos = mismatching_elements.loc[
    mismatching_elements["avg_type"] == "climatology"
]
m_elems_avgs = mismatching_elements.loc[
    mismatching_elements["avg_type"] == "timeseries"
]

# Set the index and sort the values by variable and max relative diff
m_elems_departs = m_elems_departs.set_index(index)
m_elems_climos = m_elems_climos.set_index(index)
m_elems_avgs = m_elems_avgs.set_index(index)

m_elems_departs = m_elems_departs.sort_values(
    by=["var", "max_rel_diff"], ascending=False
)
m_elems_climos = m_elems_climos.sort_values(by=["var", "max_rel_diff"], ascending=False)
m_elems_avgs = m_elems_avgs.sort_values(by=["var", "max_rel_diff"], ascending=False)

#### Compare Timeseries Avg Outputs

Legend for the index of the DataFrame
1. `lib_a`: Library A outputs (the reference values)
2. `lib_b`: Library B outputs (the actual values)
3. `var`:  Name of the data variable from a netCDF file
4. `avg_type`: time series averaging, climatology, or departures
5. `freq`: operation frequency
6. `cdat_freq`: equivalent operation frequency for CDAT

Legend for the columns of the DataFrame

1. `mismatching_elements`: The number of elements that don't meet the specified relative and absolute tolerance levels
2. `mismatching_shapes`: True if the shape of the outputs for an operation don't align
3. `equal_to_tolerance`: True if the floating point comparison meets the set absolute and relative tolerances
4. `max_abs_diff`: The maximum absolute difference, expressed as floating point
   - absolute diff = abs(actual - reference)
5. `max_rel_diff`: The maximum relative difference, expressed as a fraction
   - relative_diff = abs(actual - reference) / abs (reference)
6. `max_rel_diff_pct`: The maximum relative difference, expressed as percentage
   - relative_diff * 100
7. `abs_sum_a`: Absolute sum of all output values for library A
8. `abs_sum_b`: Absolute sum of all output values for library B
9. `abs_sum_diff`: abs_sum_a - abs_sum_b


In [37]:
m_elems_avgs = m_elems_avgs.fillna(0)
m_elems_avgs_final = m_elems_avgs.drop(columns=["mismatching_shapes", 'abs_sum_a', 'abs_sum_b', 'abs_sum_diff'])
m_elems_avgs_final

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,mismatching_elements,equal_to_tolerance,max_abs_diff,max_rel_diff,max_rel_diff_pct
lib_a,lib_b,var,avg_type,freq,cdat_freq,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
cdutil,xcdat,ts,timeseries,year,YEAR,297586 / 4343040 (6.85%),False,7.299e-05,2.4e-07,2.435e-05
cdutil,xcdat,ts,timeseries,season,SEASONALCYCLE,72759 / 17400000 (0.418%),False,3.391e-05,1.3e-07,1.313e-05
cdutil,xcdat,ts,timeseries,MAM,MAM,17531 / 4343040 (0.404%),False,3.383e-05,1.3e-07,1.313e-05
cdutil,xcdat,ts,timeseries,JJA,JJA,18188 / 4343040 (0.419%),False,3.383e-05,1.3e-07,1.313e-05
cdutil,xcdat,ts,timeseries,DJF,DJF,17812 / 4370880 (0.408%),False,3.391e-05,1.3e-07,1.311e-05
cdutil,xcdat,ts,timeseries,SON,SON,19228 / 4343040 (0.443%),False,3.387e-05,1.3e-07,1.297e-05
cdutil,xcdat,ts,timeseries,month,ANNUALCYCLE,0,True,0.0,0.0,0.0
cdutil,xcdat,ts,timeseries,JAN,JAN,0,True,0.0,0.0,0.0
cdutil,xcdat,ts,timeseries,FEB,FEB,0,True,0.0,0.0,0.0
cdutil,xcdat,ts,timeseries,MAR,MAR,0,True,0.0,0.0,0.0


Conclusion
- The highest max relative difference percentage (`max_rel_diff_pct`) looks great for time series averaging
- High confidence that algorithm for calculating time series averaging is working as intended

#### Compare Climatology Outputs

In [36]:
m_elems_climos_final = m_elems_climos.drop(columns=["mismatching_shapes", "equal_to_tolerance", 'abs_sum_a', 'abs_sum_b', 'abs_sum_diff'])
m_elems_climos_final

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,mismatching_elements,max_abs_diff,max_rel_diff,max_rel_diff_pct
lib_a,lib_b,var,avg_type,freq,cdat_freq,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cdutil,xcdat,ts,climatology,season,SEASONALCYCLE,90382 / 111360 (81.2%),0.09688165,0.00038946,0.038946
cdutil,xcdat,ts,climatology,DJF,DJF,27778 / 27840 (99.8%),0.09688165,0.00038946,0.038946
cdutil,xcdat,ts,climatology,month,ANNUALCYCLE,205064 / 334080 (61.4%),0.01593771,6.587e-05,0.00658737
cdutil,xcdat,ts,climatology,FEB,FEB,27053 / 27840 (97.2%),0.01593771,6.587e-05,0.00658737
cdutil,xcdat,ts,climatology,MAM,MAM,21032 / 27840 (75.5%),0.00043393,1.48e-06,0.00014795
cdutil,xcdat,ts,climatology,JJA,JJA,20727 / 27840 (74.5%),0.00043048,1.43e-06,0.00014327
cdutil,xcdat,ts,climatology,SON,SON,20845 / 27840 (74.9%),0.00039709,1.36e-06,0.00013629
cdutil,xcdat,ts,climatology,JUL,JUL,16461 / 27840 (59.1%),0.00026918,9e-07,8.967e-05
cdutil,xcdat,ts,climatology,DEC,DEC,15958 / 27840 (57.3%),0.00024747,8.3e-07,8.289e-05
cdutil,xcdat,ts,climatology,MAY,MAY,16356 / 27840 (58.8%),0.00024277,8e-07,8.02e-05


Conclusion
- DJF and Feb show the highest max relative difference (`max_rel_diff_pct`), but it is still small (less than 0.04% across the variables).
  - This result could be influenced by the CDAT bug Jiwoo found, which relates to Feb not being weighted properly for leap years
- Everything else is below that threshold, which is good news

#### Compare Departures Outputs

In [39]:
m_elems_departs_final = m_elems_departs.drop(columns=["mismatching_shapes", "equal_to_tolerance", 'abs_sum_a', 'abs_sum_b', 'abs_sum_diff'])
m_elems_departs_final

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,mismatching_elements,max_abs_diff,max_rel_diff,max_rel_diff_pct
lib_a,lib_b,var,avg_type,freq,cdat_freq,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
cdutil,xcdat,ts,departures,month,ANNUALCYCLE,43825392 / 52116480 (84.1%),0.00026918,469.00002789,46900.002789
cdutil,xcdat,ts,departures,JAN,JAN,3668184 / 4343040 (84.5%),0.0002101,469.00002789,46900.002789
cdutil,xcdat,ts,departures,AUG,AUG,3671616 / 4343040 (84.5%),0.00023729,467.00002789,46700.002789
cdutil,xcdat,ts,departures,FEB,FEB,3644004 / 4343040 (83.9%),0.00021515,401.54531428,40154.531428
cdutil,xcdat,ts,departures,MAY,MAY,3661164 / 4343040 (84.3%),0.00024277,235.00001395,23500.001395
cdutil,xcdat,ts,departures,JUN,JUN,3618264 / 4343040 (83.3%),0.00020873,206.99999225,20699.999225
cdutil,xcdat,ts,departures,JUL,JUL,3678012 / 4343040 (84.7%),0.00026918,195.99999746,19599.999746
cdutil,xcdat,ts,departures,MAR,MAR,3629496 / 4343040 (83.6%),0.00022595,157.0000093,15700.00093
cdutil,xcdat,ts,departures,DEC,DEC,3621072 / 4343040 (83.4%),0.00024747,157.0000093,15700.00093
cdutil,xcdat,ts,departures,SEP,SEP,3656328 / 4343040 (84.2%),0.00020365,156.99999419,15699.999419


#### The max relative differences are large, so we need to investigate why.

##### Setup Code

In [22]:
def df_comparison(var_name) -> pd.DataFrame:
    """Calculates a variable's absolute and relative differences for departures.

    This function flattens the departure matrices and the absolute and relative
    differences are calculated. These values are added to a dictionary, which
    is converted to a DataFrame that is concatenated to a final results
    DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    df = pd.DataFrame()
    for freq, cdat_depart in outputs[var_name]["cdat"]["departs"].items():
        # Only compare departures for individual months in annual cycle
        if freq in MONTH_STR_TO_INT.keys():
            # Get the equivalent XCDAT departures
            xcdat_depart = outputs[var_name]["xcdat"]["departs"][freq]

            # Flatten both departures outputs
            cdat_depart_flat = cdat_depart.data.flatten()
            xcdat_depart_flat = xcdat_depart[var_name].values.flatten()

            row = {
                "var": var_name,
                "month": freq,
                "original_value": xcdat_depart[f"{var_name}_original"].values.flatten(),
                "xcdat_depart": xcdat_depart_flat,
                "cdat_depart": cdat_depart_flat,
                "abs_diff": abs(xcdat_depart_flat - cdat_depart_flat),
                "rel_diff": None,
            }

            df_row = pd.DataFrame(row)
            df = pd.concat([df, df_row])

    df["rel_diff"] = df["abs_diff"] / abs(df["cdat_depart"])
    # Replace infinites (caused by divide by zero)
    df["rel_diff"] = df["rel_diff"].replace([np.inf, -np.inf], 0)
    return df


In [23]:
def departs_max_rel_diff_by_month(df_departs: pd.DataFrame) -> pd.DataFrame:
    """Gets the exact coordinate points for the max relative diffs by month.

    Parameters
    ----------
    df_departs : pd.DataFrame
        The departures comparison DataFrame.

    Returns
    -------
    pd.DataFrame
    """
    df_max_values = df_departs.loc[
        df_departs.groupby(["month"])["rel_diff"].idxmax()
    ].reset_index(drop=True)

    # Map integer string to the month string to sort by month, then drop it
    df_max_values["month_int"] = df_max_values["month"].apply(
        lambda x: MONTH_STR_TO_INT.get(x, x)
    )
    df_max_vals_by_month = df_max_values.sort_values(
        "rel_diff", ascending=False
    ).drop_duplicates(["month"])
    df_max_vals_by_month = df_max_vals_by_month.sort_values(["month_int"]).drop(
        columns=["month_int"]
    )

    return df_max_vals_by_month


##### Get the anomaly values at those exact coordinates for each variable and compare their values.

  - We also need to determine whether relative differences or absolute differences should be used in the case of comparing departures.

##### 1. `ts` departures - max relative diffs by month

In [25]:
df_ts_departs = df_comparison("ts")

In [26]:
df_ts_departs_by_month = departs_max_rel_diff_by_month(df_ts_departs)

In [27]:
df_ts_departs_by_month

Unnamed: 0,var,month,original_value,xcdat_depart,cdat_depart,abs_diff,rel_diff
48,ts,JAN,299.79449463,-9.155e-05,2e-07,9.175e-05,469.00002789
37,ts,FEB,284.89569092,-3.052e-05,8e-08,3.059e-05,401.54531428
86,ts,MAR,301.62802124,-6.104e-05,3.9e-07,6.143e-05,157.0000093
3,ts,APR,299.84307861,6.104e-05,5.9e-07,6.045e-05,102.99999613
100,ts,MAY,281.6703186,9.155e-05,-3.9e-07,9.194e-05,235.00001395
77,ts,JUN,229.60081482,6.104e-05,2.9e-07,6.074e-05,206.99999225
66,ts,JUL,300.45132446,0.00015259,-7.8e-07,0.00015337,195.99999746
19,ts,AUG,305.1078186,9.155e-05,2e-07,9.136e-05,467.00002789
140,ts,SEP,281.80938721,-9.155e-05,5.9e-07,9.214e-05,156.99999419
129,ts,OCT,295.52883911,3.052e-05,2e-07,3.032e-05,155.0000093


Findings for `ts` departures:
- `xcdat_depart` and `cdat_depart` are close to 0 (relative diffs can be extremely large, refer to absolute diffs instead).
- The max absolute diff floating points are 1e-3 for July and 1e-4 for the all other months.

##### 2. `psl` departures - max relative differences by month

In [28]:
df_psl_departs = df_comparison("psl")

In [29]:
df_psl_departs_by_month = departs_max_rel_diff_by_month(df_psl_departs)

In [30]:
df_psl_departs_by_month

Unnamed: 0,var,month,original_value,xcdat_depart,cdat_depart,abs_diff,rel_diff
48,psl,JAN,101224.5625,0.0234375,-5.008e-05,0.02348758,469.00002789
37,psl,FEB,100942.390625,-0.0078125,-1.241e-05,0.00780009,628.42835102
86,psl,MAR,101636.90625,0.0546875,0.00030048,0.05438702,181.00000203
3,psl,APR,98369.4609375,0.0390625,0.00015024,0.03891226,258.99999031
100,psl,MAY,101698.5625,0.046875,-5.008e-05,0.04692508,937.00005579
77,psl,JUN,102397.7109375,-0.0390625,-0.00015024,0.03891226,258.99999031
66,psl,JUL,100645.6171875,0.0234375,5.008e-05,0.02338742,467.00002789
19,psl,AUG,101291.7890625,0.015625,-5.008e-05,0.01567508,313.0000186
140,psl,SEP,101325.78125,-0.0234375,5.008e-05,0.02348758,469.00002789
129,psl,OCT,101850.09375,-0.0234375,0.00015024,0.02358774,156.99999419


Findings for `psl` departures:
- `xcdat_depart` and `cdat_depart` are close to 0 (relative diffs can be extremely large, refer to absolute diffs instead).
- As a result, relative diffs can be extremely large.
- The max absolute diff floating points range from 1e-1 to 1e-2.

##### 3. `TS` departures - max relative diffs by month

In [31]:
df_TS_departs = df_comparison("TS")

In [32]:
df_TS_departs_by_month = departs_max_rel_diff_by_month(df_TS_departs)

In [50]:
df_TS_departs_by_month

Unnamed: 0,var,month,original_value,xcdat_depart,cdat_depart,abs_diff,rel_diff
48,TS,JAN,302.563842773,6.1035e-05,-3.7e-07,6.1405e-05,166.000005225
37,TS,FEB,301.345550537,9.1553e-05,-1.85e-07,9.1738e-05,496.000015674
86,TS,MAR,296.959686279,-0.000213623,-1.85e-07,0.000213438,1154.000036573
3,TS,APR,301.120300293,-0.00012207,5.55e-07,0.000122625,221.000006966
100,TS,MAY,296.640045166,9.1553e-05,5.55e-07,9.0998e-05,164.000005225
77,TS,JUN,300.943725586,-9.1553e-05,-1.85e-07,9.1368e-05,494.000015674
66,TS,JUL,299.070648193,6.1035e-05,1.85e-07,6.085e-05,329.000010449
19,TS,AUG,294.628540039,6.1035e-05,-1.85e-07,6.122e-05,331.000010449
140,TS,SEP,301.139129639,-6.1035e-05,-1.85e-07,6.085e-05,329.000010449
129,TS,OCT,287.184967041,0.00012207,3.7e-07,0.0001217,329.000010449


Findings for `TS` departures:
- `xcdat_depart` and `cdat_depart` are close to 0 (relative diffs can be extremely large, refer to absolute diffs instead).
- The max absolute diff floating points range from 1e-3 to 1e-4.

#### Conclusion

1. Absolute diffs should be used as the comparison benchmark for departures because rounding errors 
can become a liability when numbers are close to zero.
    - Calculating departures involves a floating point subtraction operation using the climatology. 
    - Any floating point diffs in climatology values can have a significant influence on the floating point closeness of departures.
2. After analyzing the departure values for each data variable, we find that the max absolute floating point diffs range from 1e-1 to 1e-4. Ideally, we'd want the threshold to be 1e-5 or lower for increased accuracy.
3. However, 1e-1 to 1e-4 is just the MAX absolute diff.
   - For example, for `TS`, the max absolute diff for `MAR` is `1e-3`. This does not mean all mismatching floating point values are for `MAR` is `1e-3`. The diffs for mismatching elements can range from `1e-3` to the specified absolute tolerance of `1e-5`.

Finally, let's take a step back and look at the absolute sum of departure values.
  - The absolute differences between the sums are neglible (refer to table below and `abs_sum_diff` column). 
  - That means the cumulative effect of the floating point differences are basically nonexistent.

In [51]:
# Limit display to just 3 floating points
pd.options.display.float_format = "{:,.3f}".format

m_elems_departs_sum = m_elems_departs[["abs_sum_a", "abs_sum_b", "abs_sum_diff"]]
m_elems_departs_sum

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,abs_sum_a,abs_sum_b,abs_sum_diff
lib_a,lib_b,var,avg_type,freq,cdat_freq,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cdutil,xcdat,ts,departures,month,ANNUALCYCLE,60036155.888,60036132.0,-23.888
cdutil,xcdat,ts,departures,JAN,JAN,5237970.6,5237969.0,-1.6
cdutil,xcdat,ts,departures,AUG,AUG,4434669.064,4434669.0,-0.064
cdutil,xcdat,ts,departures,FEB,FEB,5779039.084,5779040.5,1.416
cdutil,xcdat,ts,departures,MAY,MAY,4832960.653,4832960.0,-0.653
cdutil,xcdat,ts,departures,JUN,JUN,4359488.843,4359491.5,2.657
cdutil,xcdat,ts,departures,JUL,JUL,4378089.995,4378090.5,0.505
cdutil,xcdat,ts,departures,MAR,MAR,5796252.346,5796252.0,-0.346
cdutil,xcdat,ts,departures,DEC,DEC,5010442.029,5010441.0,-1.029
cdutil,xcdat,ts,departures,SEP,SEP,4988027.363,4988027.0,-0.363


## Footer Notes
- Feature Specification Document: https://docs.google.com/document/d/1klHh5LLYcmNSopvptSVYmcgtUVLFnLMueRhfCF6tbaU/edit#