# Validate new restacked data with production data

This notebook is for ensuring that the newest restacked WRF data, created as an upgrade replacement for the existing dataset, matches the existing production dataset (where expected - some productiond data do have errors and should not match new data). 

This notebook is NOT expected to maintain the same functionality, as the existing production dataset will be overwritten after ensuring the new data match it where expected. Rather, it will serve as a historical record for the Great Restacking of 2022.

## Strategy

This notebook will check consistency by comparing some number of random slices in time between new outputs and existing production data for every combination of model, scenario, year, and variable.

Given space constraints, it will not be feasible to run a comparison with all potential model/scenario combinations at once it seems to make the most sense to update this notebook after the completion of restacking each of the five WRF groups. So, once a WRF group has been restacked, run the validation in the correct section below before overwriting the production data. 

**Make sure to commit changes for the correct WRF group.**

## Validation

### Setup

Execute the cells in this section to set up the environment before attempting to validate any of the WRF groups.

Set up directories:

In [4]:
from pathlib import Path
import luts


# these paths should be constant for any SNAPer running this pipeline
# assumes all folders are created in restack_20km.ipynb
base_dir = Path("/import/SNAP/wrf_data/project_data/wrf_data")
# final output directory for data
restack_prod_dir = base_dir.joinpath("hourly_fix")
anc_dir = base_dir.joinpath("ancillary")

scratch_dir = Path("/center1/DYNDOWN/kmredilla/wrf_data")
# where initially restacked data are stored on scratch_space
restack_scratch_dir = scratch_dir.joinpath("restacked")

Define a function that will compare randomly selected slices in time 

In [None]:
import numpy as np
import xarray as xr


def validate_slice(restack_fp, restack_dir):
    """Compares the values of restacked data with those in
    an existing production data file
    
    Args:
        restack_scratch_fp (pathlib.PosixPath): path to file containing
            restacked data to check that is on scratch space
        restack_prod_dir (pathlib.PosixPath): path to the directory containing
            the full production dataset
    
    Returns:
        dict with keys variable, timestamp, and result as keys
    """
    varname = restack_fp.parent.name
    with xr.open_dataset(restack_fp) as ds:
        idx = np.random.randint(ds.time.values.shape[0])
        check_time = ds.time.values[idx]
        check_arr = ds[varname].sel(time=check_time).values
        
    fix_fp = base_dir.joinpath(f"{varname}/{restack_fp.name}")
    with xr.open_dataset(fix_fp) as ds:
        fix_arr = ds[varname].sel(time=check_time).values
        
    check = np.all(fix_arr == check_arr)

    wrf_time_str = str(check_time.astype("datetime64[h]")).replace("T", "_")
    result = {
        "varname": varname,
        "timestamp": wrf_time_str,
        "match": check
    }
    
    return result

In [2]:
import numpy as np
import xarray as xr


def validate(restack_fp, restack_dir):
    """Compares the values of restacked data with those in
    an existing production data file
    
    Args:
        restack_scratch_fp (pathlib.PosixPath): path to file containing
            restacked data to check that is on scratch space
        restack_prod_dir (pathlib.PosixPath): path to the directory containing
            the full production dataset
    
    Returns:
        dict with keys variable, timestamp, and result as keys
    """
    varname = restack_fp.parent.name
    fix_fp = base_dir.joinpath(f"{varname}/{restack_fp.name}")

    with xr.open_dataset(restack_fp) as ds:
        # idx = np.random.randint(ds.time.values.shape[0])
        # check_time = ds.time.values[idx]
        check_arr = ds[varname].values
        
        with xr.open_dataset(fix_fp) as ds:
            prod_arr = ds[varname].values
            
            check = np.all(prod_arr == check_arr)

    result = {
        "varname": varname,
        "match": check
    }
    
    return result

Define a function to iterate over all files of the WRF group and run the comparisons:

In [None]:

tic = time.perf_counter()

results = []
for varname in luts.varnames:
    group = luts.group_fn_lu[f"{model}_{scenario}"]
    years = luts.groups[group]["years"]
    for year in years:
        fn = f"{varname}_hourly_wrf_{model}_{scenario}_{year}.nc"
        restack_scratch_fp = restack_scratch_dir.joinpath(fn)
        result_di = validate_slice(restack_scratch_fp, restack_prod_dir)
        result_di.update({"model": model, "scenario": scenario})

### ERA-Interim

In [None]:
wrf_group = "erain_hist"

### NCAR-CCSM Historical

In [8]:
group = "ccsm_hist"
years = luts.groups[group]["years"]

In [None]:
 = 
fn = f"{varname}_hourly_wrf_{luts.groups[group]["fn_s}_{year}.nc"
restack_scratch_fp = restack_scratch_dir.joinpath(fn)