# Compare new restacked data with production data

This notebook is for comparing the newest restacked WRF data, created as an upgrade replacement for the existing dataset, with the existing production dataset.
It will create a table of comparison results that can be summarized and explored to help avoid erroneously replacing production data with good data.

This notebook is NOT expected to maintain functionality, as the existing production dataset could be permanently deleted/archived after ensuring the new data match it where expected. Rather, tables generated using it for each WRF group will serve as historical records for the Great Restacking of 2022, in case they may be useful at a later date.

## Strategy

This notebook will check consistency by comparing a random slice in time between new outputs and existing production data for every new restacked and resampled file created for a particular WRF group.

Given space constraints, it will not be feasible to run a comparison for all WRF groups at once. So this notebook should be executed after the completion of restacking, resampling, and quality checking each of the five WRF groups. Simply run this notebook as a final step before replacing the old production data with new data.

Set up directories:

In [1]:
from multiprocessing.pool import Pool
from pathlib import Path
import tqdm
import numpy as np
import pandas as pd
import xarray as xr
import luts
from config import *
# for a type of warning that can occur when comparing times between files
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)


# these paths should be constant for any SNAPer running this pipeline
# assumes all folders are created in restack_20km.ipynb
base_dir = Path("/import/SNAP/wrf_data/project_data/wrf_data")
# final output directory for hourly data
restack_prod_dir = base_dir.joinpath("hourly_fix")
# final output directory for daily data
resample_prod_dir = base_dir.joinpath("daily")

Define a function that will compare randomly selected slices in time 

In [2]:
def compare_scratch(args):
    """Run a comparison between a scratch file and a production file. 
    Test the data values of a single time slice in each of the two restacked files for equivalence, one on scratch space and the corresponding "production" file.
    
    Args:
        restack_scratch_fp (path_like): path to file containing restacked data to check that is on scratch space
        restack_prod_fp (path_like): path to production file containing restacked data to compare with
    
    Returns:
        dict with keys variable, timestamp, and result as keys
    """
    restack_scratch_fp, restack_prod_fp = args
    varname = restack_scratch_fp.parent.name
    with xr.open_dataset(restack_scratch_fp) as check_ds:
        idx = np.random.randint(check_ds["time"].values.shape[0])
        # save time from each because those should match
        check_time = check_ds["time"].values[idx]
        check_arr = check_ds[varname].sel(time=check_time).values
    del check_ds
    
    try:
        with xr.open_dataset(restack_prod_fp) as prod_ds:
            # using this dataset's time values in case times don't match (has happened at least once)
            prod_time = prod_ds["time"].values[idx]
            prod_arr = prod_ds[varname].sel(time=prod_time).values

            # checks to see whether all data values in single time slice match between new file and production
            arr_result = np.all(prod_arr == check_arr)
            prod_exists = True
        del prod_ds
            
    except FileNotFoundError:
        # if "production" version does not exist, make a note of it
        arr_result = False
        prod_exists = False
        prod_time = np.datetime64("2022-01-01T00:00:00")
    
    # check to see whether time values match (they should)
    time_result = prod_time == check_time
    
    # wrf_time_str = str(check_time.astype("datetime64[h]")).replace("T", "_")
    model, scenario = restack_scratch_fp.name.split("_")[-3:-1]
    result = {
        "varname": varname,
        "scratch_filename": restack_scratch_fp,
        "prod_exists": prod_exists,
        "timestamp": check_time,
        "arr_result": arr_result,
        "time_result": time_result,
    }
    
    return result

### Hourly restacked data

Iterate over all new files of the WRF group and generate args for `Pool`-ing:

In [3]:
args = []
for varname in [v.lower() for v in luts.varnames]:
    fn_str = luts.groups[group]["fn_str"]
    years = luts.groups[group]["years"]
    for year in years:
        fn = f"{varname}_hourly_wrf_{fn_str}_{year}.nc"
        restack_scratch_fp = hourly_dir.joinpath(varname, fn)
        restack_prod_fp = restack_prod_dir.joinpath(varname, fn)
        args.append((restack_scratch_fp, restack_prod_fp))

Run the comparison in parallel:

In [None]:
np.random.seed(99709)
with Pool(20) as pool:
    new_rows = [
        result for result in tqdm.tqdm(
            pool.imap_unordered(compare_scratch, args), total=len(args))
    ]

 69%|███████████████████████████████████████████████████████████████████████▎                                | 1135/1656 [55:49<17:04,  1.97s/it]

Store results in a table:

In [None]:
hourly_results_df = pd.DataFrame(new_rows)
hourly_results_fp = anc_dir.joinpath(
    "production_data_comparisons",
    f"prod_comparison_{luts.groups[group]['fn_str']}_hourly.csv"
)
hourly_results_df.to_csv(hourly_results_fp, index=False)

### Daily resampled data

Again, iterate over all possible variable names/year combinations to get the new and production daily files:

In [7]:
args = []
for varname in luts.resample_varnames:
    fn_str = luts.groups[group]["fn_str"]
    years = luts.groups[group]["years"]
    for year in years:
        fn = f"{varname}_daily_wrf_{fn_str}_{year}.nc"
        resample_scratch_fp = daily_dir.joinpath(varname, fn)
        resample_prod_fp = resample_prod_dir.joinpath(varname, fn)
        args.append((resample_scratch_fp, resample_prod_fp))

And run the comparison

In [8]:
np.random.seed(99709)
with Pool(20) as pool:
    new_rows = [
        result for result in tqdm.tqdm(
            pool.imap_unordered(compare_scratch, args), total=len(args))
    ]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 216/216 [01:15<00:00,  2.86it/s]


In [9]:
daily_results_df = pd.DataFrame(new_rows)
daily_results_fp = anc_dir.joinpath(
    "production_data_comparisons",
    f"prod_comparison_{luts.groups[group]['fn_str']}_daily.csv"
)
daily_results_df.to_csv(daily_results_fp, index=False)

The hourly and daily results may now be investigated to 