# Quality control on restacked 20km outputs

Here we will run some quality checks on the newly restacked hourly and daily WRF outputs.

Set up the environment:

In [8]:
import time
from multiprocessing import Pool
import numpy as np
import pandas as pd
import tqdm
# project imports
from config import *
import luts
import restack_20km as main

### 1 - Ensure that the restacked files have expected data

Here we will use the `restack_20km.validate_restacked_file` function to do ensure that ALL* files:

* have the correct data, by checking a random time slice against the original WRF output file (done inside function)
* have the correct metadata
* ensure that all files open correctly, which is done via completion of the first two bullet points

...for all relevant variables.

**Note** - accumulation and wind variables will not be expected to match raw data.

Get all of the new file paths and set up for `Pool`-ing:

In [4]:
group_fn_str = luts.groups[group]["fn_str"]
all_wrf_fps = list(restack_scratch_dir.glob(f"*/*{group_fn_str}*.nc"))
args = [(fp, raw_scratch_dir) for fp in all_wrf_fps]
# set random seed
np.random.seed(907)

and run `validate_restacked_file` on them:

In [6]:
tic = time.perf_counter()

with Pool(20) as pool:
    new_rows = [
        result for result in tqdm.tqdm(
            pool.imap_unordered(main.validate_restacked_file, args), total=len(args))
    ]

print(f"Time elapsed: {round((time.perf_counter() - tic) / 60)}m")

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1656/1656 [06:51<00:00,  4.02it/s]

Time elapsed: 7m





Put results into a dataframe:

In [9]:
results_df = pd.DataFrame(new_rows)
results_df

Unnamed: 0,model,scenario,variable,timestamp,match,meta
0,NCAR-CCSM4,historical,snow,1999-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1,NCAR-CCSM4,historical,snow,1978-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
2,NCAR-CCSM4,historical,snow,1989-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
3,NCAR-CCSM4,historical,snow,2002-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
4,NCAR-CCSM4,historical,snow,1979-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
...,...,...,...,...,...,...
1651,NCAR-CCSM4,historical,slp,1973-07-17_19,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1652,NCAR-CCSM4,historical,slp,1985-09-17_01,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1653,NCAR-CCSM4,historical,slp,1984-07-05_08,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1654,NCAR-CCSM4,historical,slp,1993-11-06_11,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."


Where there any mismatches between slices from restacked data and expected raw values? Assert that there were none:

In [15]:
assert ~any(results_df["match"] == False)

No mismatches were detected.

### 2 - Ensure that restacked files have correct metadata

Have a look at the metadata from the first file:

In [16]:
results_df.iloc[0]["meta"]

{'time': {'time zone': 'UTC'},
 'lat': {'standard_name': 'latitude',
  'title': 'Latitude',
  'units': 'degrees_north',
  'valid_max': 90.0,
  'valid_min': -90.0},
 'lon': {'standard_name': 'longitude',
  'title': 'Longitude',
  'units': 'degrees_east',
  'valid_max': 180.0,
  'valid_min': -180.0},
 'xc': {'standard_name': 'projection_x_coordinate', 'units': 'm'},
 'xc_shape': (262,),
 'yc': {'standard_name': 'projection_y_coordinate', 'units': 'm'},
 'yc_shape': (262,),
 'crs': {'crs_wkt': 'PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",ELLIPSOID["unknown",6370000,0,LENGTHUNIT["metre",1,ID["EPSG",9001]]]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8901]]],CONVERSION["unknown",METHOD["Polar Stereographic (variant B)",ID["EPSG",9829]],PARAMETER["Latitude of standard parallel",64,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8832]],PARAMETER["Longitude of origin",-152,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8833]],PARAMETER["False easting",

And verify that expected metadata from all files is exactly the same by asserting this is the case:

In [17]:
assert np.all([row[1]["meta"] == results_df.iloc[0]["meta"] for row in results_df.iterrows()])

No `AssertionError` here indicates that all file metadata matches where expected.