## Regridding QC

This notebook serves as a QC check of regridded CMIP6 data. It is designed to be called from a prefect flow for a given set of variables, frequencies, models, and scenarios being regridded. It performs both some basic QC checks, such as comparing minimums / maximums before and after regridding. A percentage of the regridded files are randomly selected for plotting against the original source file for a rapid visual assessment of the regridding.

#### How to use with `prefect` via `papermill`:
This notebook should be run as the final step of the prefect regridding flow. The output will be saved as a new notebook in the QC directory created during the flow. To accomplish this, create a task in the prefect flow that will execute this notebook from the command line using `papermill`, e.g.:

```papermill path/to/repo/regridding/visual_qc.ipynb path/to/qc/output/output.ipynb -r output_directory "/path/to/output/dir" -r cmip6_directory "/path/to/cmip6/dir"```

The first argument is this notebook's location, which can be constructed using the `{output_directory}` parameter of the flow run (ie, the notebook's location within the downloaded repo directory). The second argument is the desired notebook output location, which can also be constructed using the `{output_directory}` parameter of the flow run. The remaining arguments are raw strings (denoted by `-r`) of the working and input directories used in the flow run.

Papermill parameter cell:

In [1]:
# this cell is tagged "parameters" and contains default parameter values for this notebook
# any parameters injected by papermill during the prefect flow will be written into a new cell directly beneath this one
# and will override the values in this cell
output_directory = "/beegfs/CMIP6/snapdata/cmip6_regridding"
cmip6_directory = "/beegfs/CMIP6/arctic-cmip6/CMIP6"
vars = "tas"
freqs = "mon"
models = "GFDL-ESM4"
scenarios = "ssp370"

#### Setup

In [3]:
from pathlib import Path
from qc import (
    get_source_fps_from_batch_files,
    summarize_slurm_out_files,
    compare_expected_to_existing_and_check_values,
    generate_regrid_fps_from_params,
    plot_comparison,
    extract_params_from_src_filepath,
    subsample_files,
)

Define data sources and parameters for QC. This notebook is expected to only QC the data that was processed in the flow run, i.e. only those files derived from source files which are listed in the existing batch files. We will want to verify that the supplied parameters correspond to these regridded files. 

Determine which regridded files to check:

In [None]:
# set cmip6_dir
cmip6_dir = Path(cmip6_directory)
output_dir = Path(output_directory)
regrid_dir = output_dir.joinpath("regrid")

regrid_batch_dir = output_dir.joinpath("regrid_batch")
slurm_dir = output_dir.joinpath("slurm")
slurm_rerid_dir = slurm_dir.joinpath("regrid")

# determine which source files were used
src_fps = get_source_fps_from_batch_files(regrid_batch_dir)

Make sure the expected source files match the parameters supplied to notebook. If not then the notebook was not run with the expected parameters!

In [95]:
src_params = [extract_params_from_src_filepath(fp) for fp in src_fps]
for p_name, p_str in zip(
    ["model", "scenario", "frequency", "variable_id"], [models, scenarios, freqs, vars]
):
    assert all(
        [params[p_name] in p_str for params in src_params]
    ), f"Source files submitted for regridding contain values for the {p_name} parameter ({', '.join(list(set([params[p_name] for params in src_params])))}) which were not supplied for QC in this notebook ({p_str})."

Ignore certain files based on results in slurm output files:

In [96]:
# check slurm files
fps_to_ignore = summarize_slurm_out_files(slurm_dir)
for fp in fps_to_ignore:
    if fp in src_fps:
        src_fps.remove(fp)

#### Check regridded values

Now compare expected files to existing files and make sure values OK. This will open and check files in parallel and could take a while. 

In [None]:
ds_errors, value_errors, src_min_max, regrid_min_max = (
    compare_expected_to_existing_and_check_values(
        regrid_dir,
        regrid_batch_dir,
        slurm_dir,
        vars,
        freqs,
        models,
        scenarios,
        fps_to_ignore,
    )
)

Oriented bbox: (183.31155833650703, 54.788769187247624, 231.59112423820852, 72.30843416684736)


Here is a summary of the errors:

In [123]:
# print summary messages
error_count = len(ds_errors) + len(value_errors)
print(f"QC process complete: {error_count} errors found.")
if len(ds_errors) > 0:
    print(
        f"Errors in opening some datasets. {len(ds_errors)} files could not be opened."
    )
if len(value_errors) > 0:
    print(
        f"Errors in dataset values. {len(value_errors)} files have regridded values outside of source file range."
    )

QC process complete: 0 errors found.


#### Visual assessment

Using only the output regridded file names, we will locate the original CMIP6 source data and plot the source data alongside regridded data to compare visually.

From our previous random selection of regridded files to QC, plot comparisons:

In [None]:
regrid_fps = generate_regrid_fps_from_params(models, scenarios, vars, freqs, regrid_dir)
qc_files = subsample_files(regrid_fps)

for fp in qc_files:
    plot_comparison(fp, cmip6_dir)