# Production comparison: GFDL CM3 Projected

This notebook is for evaluating the results from the comparison of newly restacked files with existing production files.

It is meant to serve as a historical record and will not maintain functionality as files are moved.

Set up the environment:

In [1]:
import importlib.util
import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import xarray as xr
import tqdm


# all this code to load the config and luts modules by absolute path
project_dir = Path(os.getenv("PROJECT_DIR"))

def load_module(path):
    """https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly"""
    module = path.name.split(".py")[0]
    spec = importlib.util.spec_from_file_location(
        module, path
    )
    module_obj = importlib.util.module_from_spec(spec)
    sys.modules[module] = module_obj
    spec.loader.exec_module(module_obj)
    
    return module_obj

luts = load_module(project_dir.joinpath("restack_20km/luts.py"))
config = load_module(project_dir.joinpath("restack_20km/config.py"))

## Hourly data

Load the results from comparing hourly data:

In [2]:
hourly_fp = config.anc_dir.joinpath(
    "production_data_comparisons",
    f"prod_comparison_{luts.groups[config.group]['fn_str']}_hourly.csv"
)
hourly_df = pd.read_csv(hourly_fp)

#### Timestamp mismatches

Look at instances where something was wrong with the timestamp comparison.

First thing to check is that all of these mismatches are with the newly rotated wind data created in 2021. If they are all wind variables, we can ignore the other time comparisons - those wind data time stamps were labeled incorrectly (should not have leap days).

In [3]:
wind_varnames = ["u", "u10", "ubot", "v", "v10", "vbot"]
time_mismatch_vars = np.unique(hourly_df.query("time_result == False")["varname"])
assert np.all([varname in wind_varnames for varname in time_mismatch_vars])

Looks like we had some errors accessing production data. Wind variable, possibly corrupt:

In [4]:
hourly_df[~pd.isnull(hourly_df["error"])]

Unnamed: 0,varname,scratch_filename,prod_filename,prod_exists,timestamp,arr_result,time_result,error
3845,u,/import/SNAP/wrf_data/project_data/wrf_data/re...,/import/SNAP/wrf_data/project_data/wrf_data/ho...,True,2052-07-13 12:00:00,False,,RuntimeError
3865,u,/import/SNAP/wrf_data/project_data/wrf_data/re...,/import/SNAP/wrf_data/project_data/wrf_data/ho...,True,2072-08-28 10:00:00,False,,RuntimeError


Assert that all of the data comparisons were OK under the `arr_result` result column (for those where the time check passed):

In [5]:
assert np.all(hourly_df.query("time_result == True")["arr_result"])

AssertionError: 

Okay so we don't have matches for a handful of test cases. This many:

In [6]:
len(hourly_df.query("time_result == True & arr_result == False"))

15

It looks like there might just be some weird floating point stuff going on, as these cases pass the `numpy.isclose` check:

In [7]:
%%time
isclose_results = []
arr_mismatch_df = hourly_df.query("time_result == True & arr_result == False")
for i, row in arr_mismatch_df.iterrows():
    t = row["timestamp"]
    with xr.open_dataset(row["scratch_filename"]) as scratch_ds:
        with xr.open_dataset(row["prod_filename"]) as prod_ds:
            result = np.isclose(
                scratch_ds[row["varname"]].sel(time=t), 
                prod_ds[row["varname"]].sel(time=t)
            )
            isclose_results.append(result)

assert np.all(isclose_results)

CPU times: user 33.2 s, sys: 1.47 s, total: 34.7 s
Wall time: 47 s


This means the data actually are the same, but it is good to know that for whatever reasons, the true equality operations failed in the comparison test. So the hourly data for GFDL CM3 projected group passes the comparison with production data. These are safe to copy to `base_dir` and remove from scratch space.

## Daily data

Load the results from comparing daily data:

In [8]:
daily_fp = config.anc_dir.joinpath(
    "production_data_comparisons",
    f"prod_comparison_{luts.groups[config.group]['fn_str']}_daily.csv"
)
daily_df = pd.read_csv(daily_fp)

No time mismatches on this one:

In [9]:
daily_df.query("time_result == False")

Unnamed: 0,varname,scratch_filename,prod_filename,prod_exists,timestamp,arr_result,time_result,error


And no errors:

In [10]:
daily_df[~pd.isnull(daily_df.error)]

Unnamed: 0,varname,scratch_filename,prod_filename,prod_exists,timestamp,arr_result,time_result,error


So the only other thing that could be wrong is mismatches between production and scratch data, which happened in this many cases:

In [11]:
len(daily_df.query("arr_result == False"))

104

As with the hourly data, it would seem that the mismatches are due to rounding/floating point errors. Confirm this is the case using `numpy.isclose`:

In [12]:
%%time
df = daily_df.query("arr_result == False")
for i, row in tqdm.tqdm(df.iterrows(), total=df.shape[0]):
    tstamp = row["timestamp"]
    with xr.open_dataset(row["scratch_filename"]) as scratch_ds:
        with xr.open_dataset(row["prod_filename"]) as prod_ds:
            varname = list(prod_ds.data_vars)[0]
            assert np.all(
                np.isclose(
                    scratch_ds[varname].sel(time=tstamp),
                    prod_ds[varname].sel(time=tstamp)
                )
            )

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [01:11<00:00,  1.45it/s]

CPU times: user 36.7 s, sys: 2.66 s, total: 39.3 s
Wall time: 1min 11s





And that's it. Daily and hourly GFDL CM3 projected data appear to resemble the production data well enough.