# Temporal Averaging Runtime Comparison (CDAT vs. XCDAT)

This notebook compares the sequential runtimes of temporal averaging between CDAT and XCDAT.

Performance of XCDAT with parallelization is not measured because CDAT does not support parellization and there are many factors that can influence the runtimes (e.g., chunk size, data size, computational resources).


### How to Use This Notebook

1. Create and activate the conda development environment
   - `conda env create -f conda-env/test_dev.yml`
   - `conda activate xcdat_test_dev`
2. Clone the `xcdat` repo
   - `git clone https://github.com/XCDAT/xcdat.git`
3. Install the `feature/47-climatology` branch build of `xcdat`
   - `cd xcdat`
   - `git checkout feature/47-climatology`
   - `pip install .`
4. Attach the `xcdat_test_dev` env kernel to this notebook
5. Run cells


### Time Frequencies Table
This table compares the time frequencies that can be used for grouping.

| Output Type 	| Averaging Type 	| XCDAT Frequency 	| CDAT Frequency 	| Groups 	|
|---	|---	|---	|---	|---	|
| time series 	| Yearly means 	| “Year” 	| YEAR() 	| (year,) 	|
|  	| Monthly means 	| “month” 	| JAN(), FEB(), MAR(), …., DEC()  	| (year, month) 	|
|  	| Seasonal means 	| “season” 	| DJF(), MAM(), JJA(), SON() 	| (year, season) 	|
|  	| Custom seasonal means 	| custom season, e.g. “JFM,AMJ,JAS,OND” 	| cdutil.times.Seasons() 	| (year, custom_season) 	|
|  	| Daily means 	| “day” 	| N/A (unsupported) 	| (year, month, day) 	|
|  	| Hourly means 	| “hour” 	| N/A (unsupported) 	| (year, month, day, hour) 	|
|  	| N hourly means 	|  Nhour<br> (e.g. 6hour, 3hour, …) 	| N/A (unsupported) 	| (year, month, day, Nhour) 	|
| Climatology 	| Annual cycle climatology 	| “month” 	| ANNUALCYCLE.departures() 	| (month,) 	|
|  	| Daily cycle climatology 	| “day” 	| N/A (unsupported) 	| (month, day) 	|
|  	| Seasonal cycle climatology 	| “season” 	| SEASONALCYCLE.departures() 	| (season,) 	|
|  	| Custom seasonal cycle climatology 	| custom season 	| cdutil.times.Seasons() 	| (season,) 	|
|  	| Diurnal cycle climatology 	| *-diurnalNNN 	| N/A (unsupported) 	| Append TOD where TOD is diurnal time index corresponding to NNN 	|
| Departures 	| Annual cycle departures 	| “month” 	| ANNUALCYCLE.departures() 	| (month,) 	|
|  	| Daily cycle departures 	| “day” 	| N/A (unsupported) 	| (month, day) 	|
|  	| Seasonal cycle departures 	| “season” 	| SEASONALCYCLE.departures() 	| (season,) 	|
|  	| Custom seasonal cycle departures 	| custom season 	| cdutil.times.Seasons() 	| (season,) 	|
|  	| Diurnal cycle departures 	| *-diurnalNNN 	| N/A (unsupported) 	| Append TOD where TOD is diurnal time index corresponding to NNN 	|

### Methodology
This notebook loops through a list of netCDF dataset files. For each dataset's data variable, the `timeit.repeat` module runs the CDAT and XCDAT temporal averaging/departures APIs with the aforementioned time frequencies for 1 function call that is repeated for 5 samples.

The min and max runtimes of the 5 samples are recorded into a DataFrame.

### Sample Data
Files from the `/CMIP5_demo_data` directory of PMCDI's PMP repo is used.
https://github.com/XCDAT/xcdat_test#demo-input-preparation
https://github.com/PCMDI/pcmdi_metrics/blob/main/doc/jupyter/Demo/Demo_0_download_data.ipynb

## Setup Code

In [1]:
from typing import List, Tuple, Dict
import timeit

import numpy as np
import pandas as pd


def compare_runtimes(var_to_file: Dict[str, str]) -> pd.DataFrame:
    df = pd.DataFrame(
        columns=[
            "lib",
            "file",
            "var",
            "avg_type",
            "freq",
            "min_time",
            "max_time",
        ]
    )
    for var, file_name in var_to_file.items():
        file_path = f"{files_dir}/{file_name}"
        df = measure_xcdat(df, var, file_path, file_name)
        df = measure_cdutil(df, var, file_path, file_name)

    return df




def measure_cdutil(
    df: pd.DataFrame, var: str, file_path: str, file: str
) -> pd.DataFrame:
    setup = (
        "import cdms2\n"
        "import cdutil\n"
        f"file_path = '{file_path}'\n"
        "cdat_ds = cdms2.open(file_path)\n"
        f"t_var = cdat_ds('{var}')\n"
        # Uncomment this for a time slice and unit adjustment
        # f"t_var = cdat_ds('{var}', time=slice(0,48)) - 273.15\n"
    )
    runs = {
        "climatology": [
            {"freq": "season", "stmt": "cdutil.SEASONALCYCLE.climatology(t_var)"},
            {"freq": "month", "stmt": "cdutil.ANNUALCYCLE.climatology(t_var)"},
            {"freq": "day", "stmt": None},
        ],
        "timeseries_avg": [
            {"freq": "year", "stmt": "cdutil.YEAR(t_var)"},
            {
                "freq": "season",
                "stmt": "cdutil.SEASONALCYCLE(t_var)",
            },
            {
                "freq": "month",
                "stmt": "cdutil.ANNUALCYCLE(t_var)",
            },
            {"freq": "jan", "stmt": "cdutil.JAN(t_var)"},
            {"freq": "day", "stmt": None},
        ],
    }

    df = get_runtimes(df, "cdutil", file, var, runs, setup)
    return df


def measure_xcdat(df, var: str, file_path: str, file: str) -> pd.DataFrame:
    setup = (
        "import xarray as xr\n"
        "import xcdat\n"
        f"file_path = '{file_path}'\n"
        f"xcdat_ds = xcdat.open_dataset('{file_path}')\n"
        # Uncomment these lines for a time slice and unit adjustment
        # "xcdat_ds = xcdat_ds_seq.isel(time=slice(0, 48))\n"
        # f"xcdat_ds['{var}'] = xcdat_ds['{var}'] - 273.15\n"
    )
    runs = {
        "climatology": [
            {
                "freq": "season",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'climatology', 'season', "
                    "center_times=True, "
                    "season_config={'dec_mode': 'DJF', 'drop_incomplete_djf': False})"
                    f"['{var}']"
                ),
            },
            {
                "freq": "month",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'climatology', 'month', "
                    "center_times=True)"
                    f"['{var}']"
                ),
            },
            {
                "freq": "day",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'climatology', 'day', "
                    "center_times=True)"
                    f"['{var}']"
                ),
            },
        ],
        "timeseries_avg": [
            {
                "freq": "year",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'timeseries', 'year', "
                    "center_times=True)"
                    f"['{var}']"
                ),
            },
            {
                "freq": "season",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'timeseries', 'season', "
                    "center_times=True, "
                    "season_config={'dec_mode': 'DJF', 'drop_incomplete_djf': False})"
                    f"['{var}']"
                ),
            },
            {
                "freq": "month",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'timeseries', 'month', "
                    "center_times=True)"
                    f"['{var}']"
                ),
            },
            {
                "freq": "day",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'timeseries', 'day', "
                    "center_times=True)"
                    f"['{var}']"
                ),
            },
            {
                "freq": "hour",
                "stmt": (
                    f"xcdat_ds.temporal.temporal_avg('{var}', 'timeseries', 'hour', "
                    "center_times=True)"
                    f"['{var}']"
                ),
            },
        ],
    }
    df = get_runtimes(df, "xcdat", file, var, runs, setup)

    return df

def get_runtimes(
    df: pd.DataFrame,
    lib: str,
    file: str,
    var: str,
    runs: Dict[str, List[Dict[str, str]]],
    setup: str,
) -> pd.DataFrame:

    rows = []
    for avg_type, freqs in runs.items():
        for freq in freqs:
            stmt = freq["stmt"]
            row = {
                "lib": lib,
                "file": file,
                "var": var,
                "avg_type": avg_type,
                "freq": freq["freq"],
                "min_time": None,
                "max_time": None,
            }
            if stmt is not None:
                row["min_time"], row["max_time"] = get_runtime(setup, stmt)

            rows.append(row)
    df_rows = pd.DataFrame(rows)
    df = pd.concat([df, df_rows])

    return df


def get_runtime(
    setup: str, stmt: str, repeat: int = 5, number: int = 1
) -> Tuple[float, float]:
    runtimes: List[float] = timeit.repeat(
        setup=setup,
        stmt=stmt,
        repeat=repeat,
        number=number,
    )
    min = np.around(np.min(runtimes), decimals=6)
    max = np.around(np.max(runtimes), decimals=6)
    return min, max


## Dataset Metadata Information

In [26]:
import xcdat

files_dir = "./input/demo_data/CMIP5_demo_data"
vars_to_files = {
    "psl": "psl_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc",
    "ts": "ts_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc",
}

rows = []
for var, file_name in vars_to_files.items():
    ds = xcdat.open_dataset(f"{files_dir}/{file_name}")
    row = {'file': file_name, 'var': var, 'shape': ds[var].shape, 'size': ds[var].size}

    rows.append(row)
df_metadata = pd.DataFrame(rows)

In [27]:
df_metadata

Unnamed: 0,file,var,shape,size
0,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,"(1872, 145, 192)",52116480
1,ts_Amon_ACCESS1-0_historical_r1i1p1_185001-200...,ts,"(1872, 145, 192)",52116480


## Compare Runtimes

### Raw Results DataFrame

In [2]:
df = compare_runtimes(vars_to_files)

In [3]:
df

Unnamed: 0,lib,file,var,avg_type,freq,min_time,max_time
0,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,climatology,season,2.532664,3.416886
1,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,climatology,month,9.518738,11.28766
2,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,climatology,day,8.907434,10.317531
3,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,timeseries_avg,year,2.50493,3.31126
4,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,timeseries_avg,season,4.630906,6.920708
5,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,timeseries_avg,month,10.683318,11.37106
6,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,timeseries_avg,day,11.177002,11.401181
7,xcdat,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,timeseries_avg,hour,11.337089,11.461201
0,cdutil,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,climatology,season,5.262722,5.294317
1,cdutil,psl_Amon_ACCESS1-0_historical_r1i1p1_185001-20...,psl,climatology,month,8.824945,8.873011


## Final Processed DataFrame

In [13]:
df2 = df.copy()

# Map CDAT frequencies
cdat_freq = {
    "season": "SEASONALCYCLE",
    "year": "YEAR",
    "month": "ANNUALCYCLE",
}
df2["cdat_freq"] = df2["freq"].apply(lambda x: cdat_freq.get(x, x))

# Set index and drop columns
df2 = df2.set_index(["var", "avg_type", "freq", "cdat_freq"]).drop(columns=["file"])

# Round the floating points
df2 = df2.round({"min_time": 4, "max_time": 4})

# Turn into a pivot table for easier analysis
df_final = pd.pivot_table(
    df2,
    values=["min_time", "max_time", "lib"],
    columns=["lib"],
    index=["avg_type", "var", "freq", "cdat_freq"],
)
df_final

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,max_time,max_time,min_time,min_time
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lib,cdutil,xcdat,cdutil,xcdat
avg_type,var,freq,cdat_freq,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
climatology,psl,day,day,,10.3175,,8.9074
climatology,psl,month,ANNUALCYCLE,8.873,11.2877,8.8249,9.5187
climatology,psl,season,SEASONALCYCLE,5.2943,3.4169,5.2627,2.5327
climatology,ts,day,day,,9.1053,,8.9215
climatology,ts,month,ANNUALCYCLE,8.7716,9.4722,8.6621,8.8031
climatology,ts,season,SEASONALCYCLE,5.2865,2.6014,5.1639,2.5726
timeseries_avg,psl,day,day,,11.4012,,11.177
timeseries_avg,psl,hour,hour,,11.4612,,11.3371
timeseries_avg,psl,jan,jan,0.8279,,0.7971,
timeseries_avg,psl,month,ANNUALCYCLE,10.0525,11.3711,9.907,10.6833


## Conclusion

Time series average

  - CDAT is faster for `month`/`ANNUALCYCLE` by about 1-3 seconds
  - XCDAT is faster for `SEASONALCYCLE` and `YEAR` by about 2-3 seconds

Climatology

 - CDAT is faster for `month`/`ANNUALCYCLE` by about 1-2 seconds
 - XCDAT is faster for `season`/`SEASONALCYCLE` by about 0.5-1 second

