# Quality control for regridding efforts

Use this notebook to check the quality of the regridded data.

In [1]:
import re
from multiprocessing import Pool
import tqdm
import numpy as np
import xarray as xr
from config import *
from regrid import generate_regrid_filepath, open_and_crop_dataset
from crop_non_regrid import get_source_filepaths_from_batch_files

### Check for completeness of regridded files

Get a list of filepaths for all regridded files. This now includes files that were cropped and not regridded.

In [2]:
regrid_fps = list(regrid_dir.glob("**/*.nc"))

Check that all files selected for regridding (which are those listed in the batch regrid files) are found in the regrid directory on scratch space.

First, need to get all of the source filenames from the batch files:

In [3]:
src_fps = get_source_filepaths_from_batch_files(regrid_batch_dir)

Since we renamed the files by replacing the grid type component of the original filename with "regrid" upon saving the regridded files, we must do this again to compare the source file names with the regridded filenames:

In [4]:
src_fns = set([generate_regrid_filepath(fp, regrid_dir).name for fp in src_fps])
regrid_fns = set([fp.name for fp in regrid_fps])

Now, the source files which are not found in the regridding output directory can be isolated. There should be no such files because they should have all been regridded:

In [5]:
missing_fns = list(src_fns - regrid_fns)
assert len(missing_fns) == 0

Sometimes the processing code would create files and fail before writing them completely. Ensure there are no files smaller than 1 MB:

In [6]:
%%time

def is_smol_file(fp):
    """Check whether a file is small for a regridded CMIP6 file."""
    if fp.stat().st_size / (1e3 ** 2) < 0.5:
        return fp
    else:
        return
    
with Pool(20) as pool:
    smol_fps = pool.map(is_smol_file, regrid_fps)
    
smol_fps = [fp for fp in smol_fps if fp is not None]

assert len(smol_fps) == 0

CPU times: user 91.6 ms, sys: 114 ms, total: 205 ms
Wall time: 6.95 s


### Validate regridding

Verify that all regridded files (including those that were simply cropped) all have the target grid by checking that the latitude and longitude variables match those of the file used for the target grid.

Define a function to check that the lat and lon arrays of the target grid match those of a given regridded fielpath:

In [7]:
def validate_latlon(args):
    regrid_fp, target_lat_arr, target_lon_arr = args
    regrid_ds = xr.open_dataset(regrid_fp)

    try:
        assert regrid_ds["lat"].values.shape == target_lat_arr.shape
        assert regrid_ds["lon"].values.shape == target_lon_arr.shape
        assert np.all(regrid_ds["lat"].values == target_lat_arr)
        assert np.all(regrid_ds["lon"].values == target_lon_arr)
        result = None
    except AssertionError:
        result = regrid_fp

    return result

Run the check for all regridded and cropped files:

In [8]:
dst_ds = open_and_crop_dataset(target_grid_fp, lat_slice=slice(50, 90))
target_lat_arr = dst_ds["lat"].values
target_lon_arr = dst_ds["lon"].values

args = [(fp, target_lat_arr, target_lon_arr) for fp in regrid_fps]

results = []
with Pool(20) as pool:
    # _ = pool.starmap(validate_latlon, args)
    for arg in tqdm.tqdm(pool.imap_unordered(validate_latlon, args), total=len(args)):
        results.append(arg)


100%|██████████| 23476/23476 [01:58<00:00, 197.76it/s]


If no filenames were returned from the above function, then all of the files in the regrid output directory do indeed have the same latitude and longitude grids:

In [10]:
assert np.all([x is None for x in results])
