Unable to append to a very large zarr store #1732

timothyas · 2024-03-28T14:52:19Z

Zarr version

v2.16.1

Numcodecs version

v0.12.0

Python Version

3.11.6

Operating System

Linux

Installation

using conda

Description

I have a very large (~1 Pb) zarr store on Google Cloud Storage here, containing output from a reanalysis (weather model fields like air temperature, wind velocity, etc). I am using xarray and generally following this guidance to append to the zarr store along the time dimension. The existing dataset has ~30 years of data at 3 hour frequency and 1/4 degree resolution, and I just want to append ~two more months of data, so appending a relatively small amount of data to a huge existing zarr store.

I tried this locally with a small example and it was fine (given as an example below), but I can't seem to do it with this large dataset. I'm executing the initial append step (highlighted as problematic down below) from a c2-standard-60 node (240 GB RAM) on google cloud, but it never properly completes the task, and the node eventually becomes unresponsive. Any tips on how to do something like this would be very helpful, and please let me know if I should post this somewhere else. Thanks in advance!

Steps to reproduce

import xarray as xr

# this is the existing dataset
ds = xr.open_zarr(
    "gcs://noaa-ufs-gefsv13replay/ufs-hr1/0.25-degree/03h-freq/zarr/fv3.zarr",
    storage_options={"token": "anon"},
)

# grab just 2 time stamps of the data, store locally
# this is an example to mimic the existing dataset on GCS
ds[["tmp"]].isel(time=slice(2)).to_zarr("test.zarr")

# now get the next two time stamps and append
# this is the step that never completes for the real thing
xds = ds[["tmp"]].isel(time=slice(2, 4)).load();
(np.nan * xds).to_zarr("test.zarr", append_dim="time", compute=False) # <- this is the operation that never completes

# this is what I'll eventually do to actually fill the appended container with values
for i in range(2,4):
    region = {
        "time":slice(i, i+1),
        "pfull": slice(None, None),
        "grid_yt": slice(None, None),
        "grid_xt": slice(None, None),
    }
    xds.isel(time=[i-2]).to_zarr("test.zarr", region=region)

Additional output

No response

timothyas added the bug Potential issues with the zarr-python library label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to append to a very large zarr store #1732

Unable to append to a very large zarr store #1732

timothyas commented Mar 28, 2024 •

edited

Loading

Unable to append to a very large zarr store #1732

Unable to append to a very large zarr store #1732

Comments

timothyas commented Mar 28, 2024 • edited Loading

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

timothyas commented Mar 28, 2024 •

edited

Loading