# Quality control on indices

This notebook performs some quality control on the data products produced in the CORDEX climate indices data pipeline.

Run the cell below to set up the environment:

In [1]:
import numpy as np
import pandas as pd
import xarray as xr
# project
from config import *

## Annual indices

This section will spot check the dataset of indices derived at an annual time scale, which are saved to the `annual_indices.nc` file.

We will look at some test locations and sport check calcualtions for all indices that are currently computed.

Define some locations and years to check:

In [2]:
check_coords = [(-146.3499, 61.1309), (-156.7886, 71.2906)] # (lat, lon)

Open connection to annual indices dataset:

In [3]:
ds = xr.open_dataset(indices_fp)

Set a couple year-model-scenario combinations to check:

In [4]:
# arbitrary
check_args = [
    # (model, scenario, year)
    {"model": "NCC-NorESM1-M_SMHI-RCA4", "scenario": "hist", "year": 1995},
    {"model": "MPI-M-MPI-ESM-LR_SMHI-RCA4-SN", "scenario": "rcp85", "year": 2095},
]

And define a function perform the task of extracting the cordex data from `cordex_dir`:

In [5]:
def extract_cordex(args, varname, coords):
    model, scenario, year = list(args.values())
    base_fp = cordex_dir.joinpath(scenario, varname, temp_fn.format(scenario, varname, model))
    with xr.open_dataset(base_fp) as cdx_ds:
        da = (
            cdx_ds[varname]
            .sel(time=slice(f"{year}-01-01", f"{year}-12-31"))
            .sel(lat=coords[0], lon=coords[1], method="nearest")
        )
    
    return da.values

We are now ready to begin checking the values.

Each of the below code blocks include assertions for each of the coordinate / `args` combinations. If they run without exception, then we can consider that a passing test.

#### `rx1day` - max 1-day precip

In [6]:
varname = "pr"
index = "rx1day"
for args in check_args:
    for coords in check_coords:
        arr = extract_cordex(args, varname, coords)
        test = ds[index].sel(args).sel(lat=coords[0], lon=coords[1], method="nearest")
        calc = round(np.max(arr) * 86400, 1)
        assert calc == test

#### `hsd` - heavy snow days

In [7]:
varname = "prsn"
index = "hsd"
for args in check_args:
    for coords in check_coords:
        arr = extract_cordex(args, varname, coords)
        test = ds[index].sel(args).sel(lat=coords[0], lon=coords[1], method="nearest")
        calc = ((arr * 8640) > 10).sum()
        assert calc == test

#### `hd` - hot day

In [8]:
varname = "tasmax"
index = "hd"
for args in check_args:
    for coords in check_coords:
        arr = extract_cordex(args, varname, coords)
        test = ds[index].sel(args).sel(lat=coords[0], lon=coords[1], method="nearest")
        calc = np.round(np.sort(arr)[-6] - 273.15, 1)
        # dang floats being weird..
        assert np.isclose(calc, test)

#### `cd` - cold day

In [9]:
varname = "tasmin"
index = "cd"
for args in check_args:
    for coords in check_coords:
        arr = extract_cordex(args, varname, coords)
        test = ds[index].sel(args).sel(lat=coords[0], lon=coords[1], method="nearest")
        calc = np.round(np.sort(arr)[5] - 273.15, 1)
        # dang floats being weird..
        assert np.isclose(calc, test)

## Location extrations

This section will validate the decadal and era-based extractions from the annual indices dataset made at select locations and saved to excel and CSV format.

We will just rely on the same model-scenario combinations as above for simplicity, but we will then check the aggregated values - min, mean and max - for all available summary periods, for both types of temporal summaries. Run this monster loop:

In [10]:
# list of the index variable names available from config
index_list = [name for index_list in idx_varname_lu.values() for name in index_list]
for loc in ["Kaktovik", "Ketchikan"]:
    lat, lon = locations[loc]
    for args in check_args:
        model, scenario, _ = args
        era_df = pd.read_excel(idx_era_summary_fp, loc)
        decade_df = pd.read_excel(idx_decade_summary_fp, loc) 
        for index in index_list:
            query_str = f"model == '{model}' & scenario == '{scenario}' & idx_var == '{index}'"
            # check era summaries
            temp_df = era_df.query(query_str)
            eras = np.unique(temp_df.era)
            for era in eras:
                start_year, end_year = era.split("-")
                arr = ds[index].sel(lat=lat, lon=lon, method="nearest").sel(
                    model=model, scenario=scenario, year=slice(start_year, end_year)
                )
                assert arr.mean() == temp_df.query(f"era == {era}")["mean"]
                assert arr.min() == temp_df.query(f"era == {era}")["min"]
                assert arr.max() == temp_df.query(f"era == {era}")["max"]
                
            temp_df = decade_df.query(query_str)
            decades = np.unique(temp_df.decade)
            for decade in decades:
                start_year, end_year = decade.split("-")
                arr = ds[index].sel(lat=lat, lon=lon, method="nearest").sel(
                    model=model, scenario=scenario, year=slice(start_year, end_year)
                )
                assert arr.mean() == temp_df.query(f"decade == {era}")["mean"]
                assert arr.min() == temp_df.query(f"decade == {era}")["min"]
                assert arr.max() == temp_df.query(f"decade == {era}")["max"]

And if there are no errors, we've verified all summaries for two model-scenario combinations, for two locations.

That concludes the validation of data products. Close the connection to the annual indices dataset:

In [11]:
ds.close()