# Quality control on restacked 20km outputs

Here we will run some quality checks on the newly restacked hourly and daily WRF outputs.

Set up the environment:

In [1]:
import time
from multiprocessing import Pool
import numpy as np
import pandas as pd
import tqdm
# project imports
from config import *
import luts
import restack_20km as main

### 1 - Ensure that the restacked files have expected data

Here we will use the `restack_20km.validate_restacked_file` function to do ensure that ALL* files:

* have the correct data, by checking a random time slice against the original WRF output file (done inside function)
* have the correct metadata
* ensure that all files open correctly, which is done via completion of the first two bullet points

...for all relevant variables.

**Note** - accumulation and wind variables will not be expected to match raw data.

#### 1.1 - Hourly data

This section will perform this validation for the hourly data.

Get all of the new file paths and set up for `Pool`-ing:

In [7]:
group_fn_str = luts.groups[group]["fn_str"]
all_wrf_fps = list(hourly_dir.glob(f"*/*{group_fn_str}*.nc"))
args = [(fp, raw_scratch_dir) for fp in all_wrf_fps]
# set random seed
np.random.seed(907)

and run `validate_restacked_file` on them:

In [8]:
with Pool(20) as pool:
    new_rows = [
        result for result in tqdm.tqdm(
            pool.imap_unordered(main.validate_restacked_file, args), total=len(args))
    ]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1656/1656 [20:26<00:00,  1.35it/s]

Time elapsed: 20m





Put results into a dataframe:

In [10]:
results_df = pd.DataFrame(new_rows)
results_df

Unnamed: 0,model,scenario,variable,timestamp,match,meta
0,NCAR-CCSM4,historical,qvapor,1990-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1,NCAR-CCSM4,historical,qvapor,2001-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
2,NCAR-CCSM4,historical,qvapor,1984-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
3,NCAR-CCSM4,historical,qvapor,1970-12-04_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
4,NCAR-CCSM4,historical,qvapor,1989-12-03_16,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
...,...,...,...,...,...,...
1651,NCAR-CCSM4,historical,swupbc,1995-08-29_10,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1652,NCAR-CCSM4,historical,swupbc,1994-07-29_13,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1653,NCAR-CCSM4,historical,swupbc,2002-07-18_23,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1654,NCAR-CCSM4,historical,swupbc,2001-07-17_19,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."


Were there any mismatches between slices from restacked data and expected raw values? Assert that there were none:

In [11]:
assert ~any(results_df["match"] == False)

#### 1.2 - Daily data

Now perform the same validation with the daily data. We can check this with both the daily data that was produced alongside the hourly outputs, as well as re-summarizing/resampling the restacked hourly outputs.

In [27]:
group_fn_str = luts.groups[group]["fn_str"]
daily_wrf_fps = list(daily_dir.glob(f"*/*{group_fn_str}*.nc"))
args = [(fp, hourly_dir) for fp in daily_wrf_fps]
# set random seed
np.random.seed(907)

In [33]:
with Pool(20) as pool:
    new_rows = [
        result for result in tqdm.tqdm(
            pool.imap_unordered(main.validate_resampled_file, args), total=len(args))
    ]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 216/216 [01:49<00:00,  1.97it/s]

Time elapsed: 2m





Put results into a dataframe:

In [34]:
daily_results_df = pd.DataFrame(new_rows)
daily_results_df

Unnamed: 0,model,scenario,variable,timestamp,match,meta
0,NCAR-CCSM4,historical,pcpc,2004-04-20,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
1,NCAR-CCSM4,historical,pcpc,1976-04-20,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
2,NCAR-CCSM4,historical,pcpc,1995-04-21,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
3,NCAR-CCSM4,historical,pcpc,1990-04-21,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
4,NCAR-CCSM4,historical,pcpc,1983-04-21,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
...,...,...,...,...,...,...
211,NCAR-CCSM4,historical,t2,1982-08-14,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
212,NCAR-CCSM4,historical,t2,1999-09-24,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
213,NCAR-CCSM4,historical,t2,1989-09-24,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."
214,NCAR-CCSM4,historical,t2,1998-09-24,True,"{'time': {'time zone': 'UTC'}, 'lat': {'standa..."


Were there any mismatches between slices from daily resampled data and expected restacked values? Assert that there were none:

In [35]:
assert ~any(daily_results_df["match"] == False)

### 2 - Ensure that restacked files have correct metadata

#### 2.1 - Hourly data

Have a look at the metadata from the first file:

In [12]:
results_df.iloc[0]["meta"]

{'time': {'time zone': 'UTC'},
 'lat': {'standard_name': 'latitude',
  'title': 'Latitude',
  'units': 'degrees_north',
  'valid_max': 90.0,
  'valid_min': -90.0},
 'lon': {'standard_name': 'longitude',
  'title': 'Longitude',
  'units': 'degrees_east',
  'valid_max': 180.0,
  'valid_min': -180.0},
 'xc': {'standard_name': 'projection_x_coordinate', 'units': 'm'},
 'xc_shape': (262,),
 'yc': {'standard_name': 'projection_y_coordinate', 'units': 'm'},
 'yc_shape': (262,),
 'crs': {'crs_wkt': 'PROJCRS["unknown",BASEGEOGCRS["unknown",DATUM["unknown",ELLIPSOID["unknown",6370000,0,LENGTHUNIT["metre",1,ID["EPSG",9001]]]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8901]]],CONVERSION["unknown",METHOD["Polar Stereographic (variant B)",ID["EPSG",9829]],PARAMETER["Latitude of standard parallel",64,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8832]],PARAMETER["Longitude of origin",-152,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8833]],PARAMETER["False easting",

And verify that expected metadata from all files is exactly the same by asserting this is the case:

In [13]:
assert np.all([row[1]["meta"] == results_df.iloc[0]["meta"] for row in results_df.iterrows()])

No `AssertionError` here indicates that all file metadata matches where expected.

#### 2.2 - Daily data

The global daily metadata should match that of the hourly restacked data exactly. Verify that expected metadata from all files is exactly the same by asserting each matches the metadata from the first hourly file checked above:

In [38]:
assert np.all([row[1]["meta"] == results_df.iloc[0]["meta"] for row in daily_results_df.iterrows()])

No `AssertionError` here indicates that all file metadata matches where expected.