# Quality control for regridding efforts

Use this notebook to check the quality of the regridded data.

In [13]:
import re
from multiprocessing import Pool
import tqdm
import numpy as np
import xarray as xr
from config import *
from regrid import rename_file

### Check for completeness of regridded files

Get a list of filepaths for all regridded files:

In [4]:
regrid_fps = list(regrid_dir.glob("*/*/*/*/*.nc"))

Check that all files to be regridded (which are those listed in the batch regrid files) are found in the regrid directory on scratch space.

First, need to get all of the source filenames from the batch files:

In [5]:
src_fns = []
for fp in regrid_batch_dir.glob("*.txt"):
    with open(fp) as f:
        src_fns.extend([line.split("/")[-1].replace("\n", "") for line in f.readlines()])

Since we renamed the files by replacing the grid type component of the original filename with "regrid", we must standardize again for both set of files. Do this by simply dropping "regrid" from the regridded files, and dropping the grid type component from the raw filenames:

In [6]:
rep = {"_gr_": "_", "_gr1_": "_", "_gn_": "_"}

src_fns = set([rename_file(fn, rep) for fn in src_fns])
regrid_fns = set([fp.name.replace("_regrid_", "_") for fp in regrid_fps])

Now, the source files which are not found in the regridding output directory can be isolated, and the number of them should be equal to the difference in number of files between source and completed files:

In [7]:
missing_fns = list(src_fns - regrid_fns)
len(missing_fns) == (len(src_fns) - len(regrid_fns)) == 0

True

Sometimes the processing code would create files and fail before writing them completely. Ensure there are no files smaller than 1 MB:

In [8]:
%%time
from multiprocessing import Pool


def is_smol_file(fp):
    """Check whether a file is small for a regridded CMIP6 file."""
    if fp.stat().st_size / (10e2 ** 2) < 1:
        return fp
    else:
        return
    
with Pool(20) as pool:
    smol_fps = pool.map(is_smol_file, regrid_fps)
    
smol_fps = [fp for fp in smol_fps if fp is not None]

assert len(smol_fps) == 0

CPU times: user 83.4 ms, sys: 114 ms, total: 198 ms
Wall time: 5.61 s


### Validate regridding

Verify that regridded files all have the target grid by checking that the latitude and longitude variables match those of the file used for the target grid.

Load the target grid file:

In [10]:
dst_ds = xr.open_dataset(target_grid_fp)

Define a function to check that the lat and lon arrays of the target grid match those of a given regridded fielpath:

In [18]:
def validate_latlon(args):
    regrid_fp, target_lat_arr, target_lon_arr = args
    regrid_ds = xr.open_dataset(regrid_fp)
    try:
        assert np.all(regrid_ds["lat"].values == target_lat_arr)
        assert np.all(regrid_ds["lon"].values == target_lon_arr)
        result = None
    except AssertionError:
        result = regrid_fp

    return result

Run the check for all regridded files:

In [19]:
target_lat_arr = dst_ds["lat"].values
target_lon_arr = dst_ds["lon"].values

args = [(fp, target_lat_arr, target_lon_arr) for fp in regrid_fps]

results = []
with Pool(20) as pool:
    # _ = pool.starmap(validate_latlon, args)
    for arg in tqdm.tqdm(
            pool.imap_unordered(validate_latlon, args), total=len(args)
        ):
            results.append(arg)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20368/20368 [02:28<00:00, 137.00it/s]


If no filenames were returned from the above function, then all of the files have the same latitude and longitude grids:

In [24]:
assert np.all([x is None for x in results])

We also want to do the same for the files which were not regridded. Find these files by taking the set difference of the standardized filenames between all CMIP6 files and all regridded files.

Get all of the current CMIP6 filepaths:

In [25]:
cmip6_fps = list(cmip6_dir.glob("*/*/*/*/*/*/*/*/*/*.nc"))

Get sets of the standardized names of all CMIP6 and regridded files:

In [26]:
rep = {"_gr_": "_", "_gr1_": "_", "_gn_": "_"}
cmip6_fns = set([rename_file(fp.name, rep) for fp in cmip6_fps])
regrid_fns = set([fp.name.replace("_regrid_", "_") for fp in regrid_fps])

Get the list of filenames which were not regridded:

In [27]:
non_regrid_fns = cmip6_fns - regrid_fns

Find the full filepaths of these files that were not regridded by globbing:

In [28]:
%%time
non_regrid_fps = []
for fn in non_regrid_fns:
    var_id, freq, model, scenario, variant, timespan = fn.split(".nc")[0].split("_")
    fp = list(
        cmip6_dir.glob(
            f"*/*/{model}/{scenario}/{variant}/{freq}/{var_id}/*/*/{var_id}_{freq}_{model}_{scenario}_{variant}_*_{timespan}.nc"
        )
    )[0]
    non_regrid_fps.append(fp)

CPU times: user 2.53 s, sys: 1.53 s, total: 4.06 s
Wall time: 12 s


Verify the grids of these filepaths we assume were already on the target grid:

In [29]:
args = [(fp, target_lat_arr, target_lon_arr) for fp in non_regrid_fps]

results = []
with Pool(20) as pool:
    for arg in tqdm.tqdm(
            pool.imap_unordered(validate_latlon, args), total=len(args)
        ):
            results.append(arg)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3119/3119 [00:35<00:00, 87.85it/s]


Again, if no filenames were returned from the above function, then all of the files which were not regridded do indeed have the same latitude and longitude grids:

In [30]:
assert np.all([x is None for x in results])

end