# Quality control for regridding efforts

Use this notebook to check the quality of the regridded data.

In [1]:
import re
from multiprocessing import Pool
import tqdm
import numpy as np
import xarray as xr
from config import *
from regrid import rename_file


### Check for completeness of regridded files

Get a list of filepaths for all regridded files. This now includes files that were cropped and not regridded.

In [29]:
regrid_fps = list(regrid_dir.glob("**/*.nc"))


Check that all files selected for regridding (which are those listed in the batch regrid files) are found in the regrid directory on scratch space.

First, need to get all of the source filenames from the batch files:

In [30]:
src_fns = []
for fp in regrid_batch_dir.glob("*.txt"):
    with open(fp) as f:
        src_fns.extend(
            [line.split("/")[-1].replace("\n", "") for line in f.readlines()]
        )

Since we renamed the files by replacing the grid type component of the original filename with "regrid" upon saving the regridded files, we must standardize again for both sets of files. Do this by simply dropping "regrid" from the regridded files, and dropping the grid type component from the raw filenames:

In [31]:
rep = {"_gr_": "_", "_gr1_": "_", "_gn_": "_"}

src_fns = set([rename_file(fn, rep) for fn in src_fns])
regrid_fns = set([fp.name.replace("_regrid_", "_") for fp in regrid_fps])


Now, the source files which are not found in the regridding output directory can be isolated. There should be no such files because they should have all been regridded:

In [32]:
missing_fns = list(src_fns - regrid_fns)
assert len(missing_fns) == 0


Sometimes the processing code would create files and fail before writing them completely. Ensure there are no files smaller than 1 MB:

In [7]:
%%time
from multiprocessing import Pool


def is_smol_file(fp):
    """Check whether a file is small for a regridded CMIP6 file."""
    if fp.stat().st_size / (10e2 ** 2) < 1:
        return fp
    else:
        return
    
with Pool(20) as pool:
    smol_fps = pool.map(is_smol_file, regrid_fps)
    
smol_fps = [fp for fp in smol_fps if fp is not None]

assert len(smol_fps) == 0

CPU times: user 81.2 ms, sys: 93.3 ms, total: 174 ms
Wall time: 4.58 s


### Validate regridding

Verify that regridded files all have the target grid by checking that the latitude and longitude variables match those of the file used for the target grid.

Load the target grid file:

In [8]:
dst_ds = xr.open_dataset(target_grid_fp)


Define a function to check that the lat and lon arrays of the target grid match those of a given regridded fielpath:

In [9]:
def validate_latlon(args):
    regrid_fp, target_lat_arr, target_lon_arr = args
    regrid_ds = xr.open_dataset(regrid_fp)
    try:
        assert np.all(regrid_ds["lat"].values == target_lat_arr)
        assert np.all(regrid_ds["lon"].values == target_lon_arr)
        result = None
    except AssertionError:
        result = regrid_fp

    return result


Run the check for all regridded files:

In [10]:
target_lat_arr = dst_ds["lat"].values
target_lon_arr = dst_ds["lon"].values

args = [(fp, target_lat_arr, target_lon_arr) for fp in regrid_fps]

results = []
with Pool(20) as pool:
    # _ = pool.starmap(validate_latlon, args)
    for arg in tqdm.tqdm(pool.imap_unordered(validate_latlon, args), total=len(args)):
        results.append(arg)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20368/20368 [02:07<00:00, 159.39it/s]


If no filenames were returned from the above function, then all of the files have the same latitude and longitude grids:

In [11]:
assert np.all([x is None for x in results])


##### Ready for copy - The regridded files are ready for copying at this point. See the [`README.md`](./README.md) for instructions here. 

### Validate non-regridded files

We also want to do the same grid check for the files which were not regridded. Find these files by taking the set difference of the standardized filenames between all CMIP6 files and all regridded files.

Get all of the current CMIP6 filepaths:

In [8]:
cmip6_fps = list(cmip6_dir.glob("**/*.nc"))


Get sets of the standardized names of all of the original CMIP6 files and the regridded files:

In [9]:
rep = {"_gr_": "_", "_gr1_": "_", "_gn_": "_"}
cmip6_fns = set([rename_file(fp.name, rep) for fp in cmip6_fps])


Get the list of filenames which were not regridded by taking the set difference on the standardized filenames:

In [10]:
non_regrid_fns = cmip6_fns - src_fns


Find the full filepaths of these files that were not regridded. Instead, these files should have been merely cropped but nonetheless saved to the regrid_dir in the same manner. 

In [22]:
%%time
non_regrid_fps = []
for fn in non_regrid_fns:
    var_id, freq, model, scenario, variant, timespan = fn.split(".nc")[0].split("_")
    fp = regrid_dir.joinpath(
        f"{model}/{scenario}/{freq}/{var_id}/{var_id}_{freq}_{model}_{scenario}_{variant}_regrid_{timespan}.nc"
    )
    non_regrid_fps.append(fp)

CPU times: user 26.9 ms, sys: 92 µs, total: 27 ms
Wall time: 25.8 ms


Verify the grids of these filepaths we assume were already on the target grid:

In [16]:
args = [(fp, target_lat_arr, target_lon_arr) for fp in non_regrid_fps]

results = []
with Pool(20) as pool:
    for arg in tqdm.tqdm(pool.imap_unordered(validate_latlon, args), total=len(args)):
        results.append(arg)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3119/3119 [05:44<00:00,  9.04it/s]


Again, if no filenames were returned from the above function, then all of the files which were not regridded do indeed have the same latitude and longitude grids:

In [17]:
assert np.all([x is None for x in results])

Check that all original  files mirrored are accounted for in the files + symlinks of the final location:

In [19]:
final_regrid_fps = list(final_regrid_dir.glob("**/*.nc"))

assert len(final_regrid_fps) == (len(non_regrid_fps) + len(regrid_fps))


End