### Quality Control
This notebook examines the output netCDF files, and does some basic quality control checks to make sure values in the netCDFs match their sources in the CSV files.

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import os
import random
from pathlib import Path
from luts import *

# source file directory
stats_dir = Path('/beegfs/CMIP6/jdpaul3/hydroviz_data/stats')
# output directory
nc_dir = Path('/beegfs/CMIP6/jdpaul3/hydroviz_data/nc')

# get stats list
stats = list(stat_vars_dict.keys())


In [2]:
# first read stats CSVs and do some filtering of results ...
# this is similar to the code in filter_files() in functions.py, but runs independently here
files = stats_dir.glob('*.csv')
seg_files = []
hru_files = []
for f in files:
    if "Maurer" in f.name: pass
    elif "diff" in f.name: pass
    elif "1952_2005" in f.name: pass
    elif "hru" in f.name:
        hru_files.append(f)
    else: 
        seg_files.append(f)

In [3]:
# load netCDFs
seg = xr.open_dataset(os.path.join(nc_dir, "seg.nc"))
hru = xr.open_dataset(os.path.join(nc_dir, "hru.nc"))

In [4]:
# check out the datasets - structure should be identical except for length of geom_ids
seg

In [5]:
hru

In [6]:
# create a testing function that parses the file name and uses the coords to check the contents of the netCDF
# contents for every statistic are compared to data in the CSV

def test_nc(ds, test_csvs):
    for csv in test_csvs:
        df = pd.read_csv(csv, usecols=stats)
        df.replace(-99999, np.nan, inplace=True)


        parts = csv.name.split('_')
        try:
            lc, model, scenario, era = parts[0], parts[1], parts[2], "_".join([parts[5], parts[6].split(".")[0]])
        except:
            print(f"Error parsing file: {csv.name}")
            continue

        for stat in stats:
            sel_dict = {"lc": lc, "model": model, "scenario": scenario, "era": era_lookup[era]}
            values = ds[stat].sel(sel_dict).load().values
            assert np.allclose(values, df[stat].values, equal_nan=True), f"Error in dataset: values for {stat} do not match value in {csv.name}"
            break


In [7]:
# for each geometry, pick 30 random files (~10%) and run the tests
# a successful test will produce no output

seg_test_files = random.sample(seg_files, 20)
test_nc(seg, seg_test_files)

hru_test_files = random.sample(hru_files, 20)
test_nc(hru, hru_test_files)

### Important notes about eras / scenarios
These netCDFs have a structure that is partially empty (ie, includes NaNs).

This is due to the fact that all the models each have their own separate modeled historical data values, and do not have a shared historical baseline. Additionally, the projected scenarios do not have modeled historical data. This creates a situation where we have certain dimensional combinations that come up empty. For instance, you could query for the `historical` era of the `rcp60` scenario, but that would just return NaN.

Also, two of the models `ACCESS1-0` and `BNU-ESM` do not have data for scenarios `rcp26` or `rcp60`. So if you were to query for those scenarios, you'd simply get NaN values in return.

See some examples below:

In [8]:
# a query that returns actual values: an emissions scenario and future era
seg["dh3"].sel(
    {"lc": "dynamic", "model": "GFDL-ESM2M", "scenario": "rcp60", "era": "mid_century"}
    ).load().values

array([1003.7, 1897.2, 5853. , ...,    nan,    nan,    nan])

In [9]:
# a query that returns NaN: a historical scenario and future era (combo doesn't really make sense!)
seg["dh3"].sel(
    {"lc": "dynamic", "model": "GFDL-ESM2M", "scenario": "historical", "era": "mid_century"}
    ).load().values

array([nan, nan, nan, ..., nan, nan, nan])

In [10]:
# a query that returns NaN: an emissions scenario and historical era (combo doesn't really make sense!)
seg["dh3"].sel(
    {"lc": "dynamic", "model": "GFDL-ESM2M", "scenario": "rcp60", "era": "historical"}
    ).load().values

array([nan, nan, nan, ..., nan, nan, nan])

In [11]:
# a query that returns actual values: a historical scenario and historical era
seg["dh3"].sel(
    {"lc": "dynamic", "model": "GFDL-ESM2M", "scenario": "historical", "era": "historical"}
    ).load().values

array([1120.6, 2158.1, 6431.8, ...,    nan,    nan,    nan])