This QC notebook is modeled after the USGS GeoDataPortal [`Zarr` Data Review Tutorial](https://code.usgs.gov/wma/nhgf/geo-data-portal/gdp_data_processing/-/blob/main/workflows/zarr-data-review/tutorial.ipynb?ref_type=heads), but does not contain all elements of their QC process. 

This notebook is designed to check local `zarr` outputs to make sure they meet basic data requirements before submitting for further processing for USGS GeoDataPortal upload.

### Setup

In [8]:
import xarray as xr
from pathlib import Path

In [9]:
# directory with reformatted zarr files
# right now, this is only a subset of the provisional outputs (GFDL-ESM4 and TaiESM-1 models only)

dir = Path('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted')
zarr_stores = list(dir.glob('*.zarr'))
zarr_stores.sort()
zarr_stores

[PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_GFDL-ESM4_historical_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_GFDL-ESM4_ssp126_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_GFDL-ESM4_ssp245_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_GFDL-ESM4_ssp370_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_GFDL-ESM4_ssp585_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_TaiESM1_historical_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_TaiESM1_ssp126_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_TaiESM1_ssp245_adjusted.zarr'),
 PosixPath('/beegfs/CMIP6/jdpaul3/cmip6_4km_downscalin

### Review Directory

Each dataset's directory should contain the following files:
- .zattrs
- .zgroup
- .zmetadata
- a directory for each data variable / dimension (ex: 'crs', 'lat', 'time', 'wind_speed')

The following code determines the presence of all these files, and creates a list of successfully opened datasets for further review.

In [10]:
datasets = []

expected_metadata = {'.zattrs', '.zgroup', '.zmetadata'}

for zarr_store in zarr_stores:

    print(f"Checking {zarr_store.name}... ", end='')

    try:
        ds = xr.open_dataset(
            zarr_store,
            engine='zarr',
            backend_kwargs={'consolidated':True},
            chunks={},
            decode_cf=True,
            decode_times=True
        )
        datasets.append(ds)

    except Exception as e:
        print(f"FAILED to open dataset! Error: {e}")
        continue

    missing = []

    # list everything in the zarr directory
    dirs = [str(p.name) for p in zarr_store.iterdir()]

    for metadata in expected_metadata:
        if metadata not in dirs:
            missing.append(metadata)

    for variable in ds.variables:
        if variable not in dirs:
            missing.append(variable)

    if not missing:
        print("PASSED")
    else:
        print("MISSING: ", ", ".join(missing))


print(f"\n\nTotal datasets checked: {len(zarr_stores)}")
print(f"Total datasets successfully opened: {len(datasets)}")

Checking dtr_GFDL-ESM4_historical_adjusted.zarr... PASSED
Checking dtr_GFDL-ESM4_ssp126_adjusted.zarr... PASSED
Checking dtr_GFDL-ESM4_ssp245_adjusted.zarr... PASSED
Checking dtr_GFDL-ESM4_ssp370_adjusted.zarr... PASSED
Checking dtr_GFDL-ESM4_ssp585_adjusted.zarr... PASSED
Checking dtr_TaiESM1_historical_adjusted.zarr... PASSED
Checking dtr_TaiESM1_ssp126_adjusted.zarr... PASSED
Checking dtr_TaiESM1_ssp245_adjusted.zarr... PASSED
Checking dtr_TaiESM1_ssp370_adjusted.zarr... PASSED
Checking dtr_TaiESM1_ssp585_adjusted.zarr... PASSED
Checking pr_GFDL-ESM4_historical_adjusted.zarr... PASSED
Checking pr_GFDL-ESM4_ssp126_adjusted.zarr... PASSED
Checking pr_GFDL-ESM4_ssp245_adjusted.zarr... PASSED
Checking pr_GFDL-ESM4_ssp370_adjusted.zarr... PASSED
Checking pr_GFDL-ESM4_ssp585_adjusted.zarr... PASSED
Checking pr_TaiESM1_historical_adjusted.zarr... PASSED
Checking pr_TaiESM1_ssp126_adjusted.zarr... PASSED
Checking pr_TaiESM1_ssp245_adjusted.zarr... PASSED
Checking pr_TaiESM1_ssp370_adjusted.

### Time decoding

Check the first 5 values in each dataset's `time` dimension to make sure we are seeing actual date objects and not integers. Seeing integers would suggest that `xarray` was unable to decode the CF-compliant `time` values.

In [11]:
for ds in datasets:
    print(f"\nDataset: {ds.encoding.get('source')}")
    print("First 5 time values:", ds['time'].values[:5])


Dataset: /beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_GFDL-ESM4_historical_adjusted.zarr
First 5 time values: [cftime.DatetimeNoLeap(1965, 1, 1, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(1965, 1, 2, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(1965, 1, 3, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(1965, 1, 4, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(1965, 1, 5, 12, 0, 0, 0, has_year_zero=True)]

Dataset: /beegfs/CMIP6/jdpaul3/cmip6_4km_downscaling/adjusted_cs_reformatted/dtr_GFDL-ESM4_ssp126_adjusted.zarr
First 5 time values: [cftime.DatetimeNoLeap(2015, 1, 1, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(2015, 1, 2, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(2015, 1, 3, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(2015, 1, 4, 12, 0, 0, 0, has_year_zero=True)
 cftime.DatetimeNoLeap(2015, 1, 5, 12, 0, 0, 0, has_year_zero=True)]

Dataset: /beegfs/CMIP6/jdpaul3/cmip6_4km_down

### File combination

We can test if the files can be successfully combined using `xarray.open_mfdataset()`. 

In [12]:
combined_ds = xr.open_mfdataset(zarr_stores, engine='zarr', combine='by_coords', parallel=True)
combined_ds

Unnamed: 0,Array,Chunk
Bytes,1.55 MiB,19.53 kiB
Shape,"(460, 443)","(50, 50)"
Dask graph,90 chunks in 113 graph layers,90 chunks in 113 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.55 MiB 19.53 kiB Shape (460, 443) (50, 50) Dask graph 90 chunks in 113 graph layers Data type float64 numpy.ndarray",443  460,

Unnamed: 0,Array,Chunk
Bytes,1.55 MiB,19.53 kiB
Shape,"(460, 443)","(50, 50)"
Dask graph,90 chunks in 113 graph layers,90 chunks in 113 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.55 MiB,19.53 kiB
Shape,"(460, 443)","(50, 50)"
Dask graph,90 chunks in 113 graph layers,90 chunks in 113 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.55 MiB 19.53 kiB Shape (460, 443) (50, 50) Dask graph 90 chunks in 113 graph layers Data type float64 numpy.ndarray",443  460,

Unnamed: 0,Array,Chunk
Bytes,1.55 MiB,19.53 kiB
Shape,"(460, 443)","(50, 50)"
Dask graph,90 chunks in 113 graph layers,90 chunks in 113 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,473.40 MiB
Shape,"(49640, 460, 443, 10)","(49640, 50, 50, 1)"
Dask graph,900 chunks in 59 graph layers,900 chunks in 59 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 376.84 GiB 473.40 MiB Shape (49640, 460, 443, 10) (49640, 50, 50, 1) Dask graph 900 chunks in 59 graph layers Data type float32 numpy.ndarray",49640  1  10  443  460,

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,473.40 MiB
Shape,"(49640, 460, 443, 10)","(49640, 50, 50, 1)"
Dask graph,900 chunks in 59 graph layers,900 chunks in 59 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,598.72 MiB
Shape,"(49640, 460, 443, 10)","(31390, 50, 50, 2)"
Dask graph,1620 chunks in 49 graph layers,1620 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 376.84 GiB 598.72 MiB Shape (49640, 460, 443, 10) (31390, 50, 50, 2) Dask graph 1620 chunks in 49 graph layers Data type float32 numpy.ndarray",49640  1  10  443  460,

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,598.72 MiB
Shape,"(49640, 460, 443, 10)","(31390, 50, 50, 2)"
Dask graph,1620 chunks in 49 graph layers,1620 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,598.72 MiB
Shape,"(49640, 460, 443, 10)","(31390, 50, 50, 2)"
Dask graph,1620 chunks in 49 graph layers,1620 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 376.84 GiB 598.72 MiB Shape (49640, 460, 443, 10) (31390, 50, 50, 2) Dask graph 1620 chunks in 49 graph layers Data type float32 numpy.ndarray",49640  1  10  443  460,

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,598.72 MiB
Shape,"(49640, 460, 443, 10)","(31390, 50, 50, 2)"
Dask graph,1620 chunks in 49 graph layers,1620 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,598.72 MiB
Shape,"(49640, 460, 443, 10)","(31390, 50, 50, 2)"
Dask graph,1620 chunks in 49 graph layers,1620 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 376.84 GiB 598.72 MiB Shape (49640, 460, 443, 10) (31390, 50, 50, 2) Dask graph 1620 chunks in 49 graph layers Data type float32 numpy.ndarray",49640  1  10  443  460,

Unnamed: 0,Array,Chunk
Bytes,376.84 GiB,598.72 MiB
Shape,"(49640, 460, 443, 10)","(31390, 50, 50, 2)"
Dask graph,1620 chunks in 49 graph layers,1620 chunks in 49 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
