Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to append to a very large zarr store #1732

Open
timothyas opened this issue Mar 28, 2024 · 0 comments
Open

Unable to append to a very large zarr store #1732

timothyas opened this issue Mar 28, 2024 · 0 comments
Labels
bug Potential issues with the zarr-python library

Comments

@timothyas
Copy link

timothyas commented Mar 28, 2024

Zarr version

v2.16.1

Numcodecs version

v0.12.0

Python Version

3.11.6

Operating System

Linux

Installation

using conda

Description

I have a very large (~1 Pb) zarr store on Google Cloud Storage here, containing output from a reanalysis (weather model fields like air temperature, wind velocity, etc). I am using xarray and generally following this guidance to append to the zarr store along the time dimension. The existing dataset has ~30 years of data at 3 hour frequency and 1/4 degree resolution, and I just want to append ~two more months of data, so appending a relatively small amount of data to a huge existing zarr store.

I tried this locally with a small example and it was fine (given as an example below), but I can't seem to do it with this large dataset. I'm executing the initial append step (highlighted as problematic down below) from a c2-standard-60 node (240 GB RAM) on google cloud, but it never properly completes the task, and the node eventually becomes unresponsive. Any tips on how to do something like this would be very helpful, and please let me know if I should post this somewhere else. Thanks in advance!

Steps to reproduce

import xarray as xr

# this is the existing dataset
ds = xr.open_zarr(
    "gcs://noaa-ufs-gefsv13replay/ufs-hr1/0.25-degree/03h-freq/zarr/fv3.zarr",
    storage_options={"token": "anon"},
)

# grab just 2 time stamps of the data, store locally
# this is an example to mimic the existing dataset on GCS
ds[["tmp"]].isel(time=slice(2)).to_zarr("test.zarr")

# now get the next two time stamps and append
# this is the step that never completes for the real thing
xds = ds[["tmp"]].isel(time=slice(2, 4)).load();
(np.nan * xds).to_zarr("test.zarr", append_dim="time", compute=False) # <- this is the operation that never completes

# this is what I'll eventually do to actually fill the appended container with values
for i in range(2,4):
    region = {
        "time":slice(i, i+1),
        "pfull": slice(None, None),
        "grid_yt": slice(None, None),
        "grid_xt": slice(None, None),
    }
    xds.isel(time=[i-2]).to_zarr("test.zarr", region=region)

Additional output

No response

@timothyas timothyas added the bug Potential issues with the zarr-python library label Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

1 participant