# 2024-01-16 preprocess NODD precipitation for kerchunk

This notebook does not need to be run.  It is included for reference for how the `combined-kerchunk.json` file was generated.  Here we take advantage of `kerchunk`'s ability to read metadata from netCDF files in the cloud and aggregate it in a form that can allow loading data from those netCDF files in a lazy way.  It takes roughly five minutes to preprocess the files in this way.

It uses the internal chunk structure of the netCDF files for the ultimate dask chunks; in this case that happens to be 1 x 720 x 1440 for a dataset with total dimension 3640 x 5760 x 11520, where the dimension labels are `"time"`, `"grid_yt"`, `"grid_xt"`.

In [1]:
import logging

import dask.bag
import dask.diagnostics
import xarray as xr
import ujson

from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr

In [2]:
logging.basicConfig(level=logging.INFO)

In [3]:
COMBINED_TARGET = "combined-kerchunk.json"
PATTERN = "gs://gfdl-xshield-pire-2022/X-SHiELD-2021/PIRE/{timestamp}/pr_C3072_11520x5760.fre.nc"
SECONDS_PER_DAY = 86400
TIMESTAMPS = xr.cftime_range(
    "2019-10-20",
    "2021-01-12",
    freq="5D",
    calendar="julian",
    inclusive="both"
).strftime("%Y%m%d%H")


def generate_single_json(timestamp):
    path = PATTERN.format(timestamp=timestamp)
    target = f"{timestamp}-kerchunk.json"

    logging.info(f"Writing kerchunk json for {path} to {target}.")
    chunks = SingleHdf5ToZarr(path)
    with open(target, "wb") as file:
        output = ujson.dumps(chunks.translate()).encode()
        file.write(output)
    return target


def generate_combined_json(single_targets, combined_target):
    mzz = MultiZarrToZarr(
        single_targets,
        coo_map={"time": "cf:time"},
        concat_dims=["time"]
    )
    chunks = mzz.translate()
    output = ujson.dumps(chunks).encode()
    with open(combined_target, "wb") as file:
        file.write(output)


def shift_dataset(ds):
    half_nx = ds.sizes["grid_xt"] // 2
    half_grid_xt = ds.grid_xt.isel(grid_xt=slice(None, half_nx))
    shifted_grid_xt = xr.concat(
        [half_grid_xt, -half_grid_xt.isel(grid_xt=slice(None, None, -1))],
        dim="grid_xt"
    )
    ds = ds.assign_coords(grid_xt=shifted_grid_xt)
    return ds.sortby("grid_xt")

In [4]:
bag = dask.bag.from_sequence(TIMESTAMPS)
with dask.diagnostics.ProgressBar():
    single_targets = bag.map(generate_single_json).compute()

[########################################] | 100% Completed | 219.50 s


In [5]:
generate_combined_json(single_targets, COMBINED_TARGET)

In [6]:
ds = xr.open_dataset(
    "reference://", engine="zarr",
    backend_kwargs={
        "storage_options": {
            "fo": COMBINED_TARGET,
            "remote_protocol": "gs",
            "remote_options": {"anon": True}
        },
        "consolidated": False
    },
    chunks={}
)

INFO:fsspec.reference:Read reference from URL combined-kerchunk.json


In [7]:
ds

Unnamed: 0,Array,Chunk
Bytes,319.92 MiB,90.00 kiB
Shape,"(3640, 11520, 2)","(1, 11520, 2)"
Dask graph,3640 chunks in 2 graph layers,3640 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 319.92 MiB 90.00 kiB Shape (3640, 11520, 2) (1, 11520, 2) Dask graph 3640 chunks in 2 graph layers Data type float32 numpy.ndarray",2  11520  3640,

Unnamed: 0,Array,Chunk
Bytes,319.92 MiB,90.00 kiB
Shape,"(3640, 11520, 2)","(1, 11520, 2)"
Dask graph,3640 chunks in 2 graph layers,3640 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,159.96 MiB,45.00 kiB
Shape,"(3640, 5760, 2)","(1, 5760, 2)"
Dask graph,3640 chunks in 2 graph layers,3640 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 159.96 MiB 45.00 kiB Shape (3640, 5760, 2) (1, 5760, 2) Dask graph 3640 chunks in 2 graph layers Data type float32 numpy.ndarray",2  5760  3640,

Unnamed: 0,Array,Chunk
Bytes,159.96 MiB,45.00 kiB
Shape,"(3640, 5760, 2)","(1, 5760, 2)"
Dask graph,3640 chunks in 2 graph layers,3640 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,899.78 GiB,3.96 MiB
Shape,"(3640, 5760, 11520)","(1, 720, 1440)"
Dask graph,232960 chunks in 2 graph layers,232960 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 899.78 GiB 3.96 MiB Shape (3640, 5760, 11520) (1, 720, 1440) Dask graph 232960 chunks in 2 graph layers Data type float32 numpy.ndarray",11520  5760  3640,

Unnamed: 0,Array,Chunk
Bytes,899.78 GiB,3.96 MiB
Shape,"(3640, 5760, 11520)","(1, 720, 1440)"
Dask graph,232960 chunks in 2 graph layers,232960 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
