`to_zarr()` is extremely slow writing to high latency store #277

slevang · 2023-11-15T18:40:44Z

Unbearably so, I would say. Here is an example with a tree containing 13 nodes and negligible data, trying to write to S3/GCS with fsspec:

import numpy as np
import xarray as xr
from datatree import DataTree

ds = xr.Dataset(
    data_vars={
        "a": xr.DataArray(np.ones((2, 2)), coords={"x": [1, 2], "y": [1, 2]}),
        "b": xr.DataArray(np.ones((2, 2)), coords={"x": [1, 2], "y": [1, 2]}),
        "c": xr.DataArray(np.ones((2, 2)), coords={"x": [1, 2], "y": [1, 2]}),
    }
)

dt = DataTree()
for first_level in [1, 2, 3]:
    dt[f"{first_level}"] = DataTree(ds)
    for second_level in [1, 2, 3]:
        dt[f"{first_level}/{second_level}"] = DataTree(ds)

%time dt.to_zarr("test.zarr", mode="w")

bucket = "s3|gs://your-bucket/path" 
%time dt.to_zarr(f"{bucket}/test.zarr", mode="w")

Gives:

CPU times: user 53.8 ms, sys: 3.95 ms, total: 57.8 ms
Wall time: 58 ms

CPU times: user 6.33 s, sys: 211 ms, total: 6.54 s
Wall time: 3min 20s

I suspect one of the culprits may be that we're having to reopen the store without consolidated metadata on writing each node:

datatree/datatree/io.py

Lines 205 to 223 in 433f78d

    
           for node in dt.subtree: 
        
               ds = node.ds 
        
               group_path = node.path 
        
               if ds is None: 
        
                   _create_empty_zarr_group(store, group_path, mode) 
        
               else: 
        
                   ds.to_zarr( 
        
                       store, 
        
                       group=group_path, 
        
                       mode=mode, 
        
                       encoding=encoding.get(node.path), 
        
                       consolidated=False, 
        
                       **kwargs, 
        
                   ) 
        
               if "w" in mode: 
        
                   mode = "a" 
        
           if consolidated: 
        
               consolidate_metadata(store)

Any ideas for easy improvements here?

The text was updated successfully, but these errors were encountered:

jhamman · 2023-11-15T18:47:46Z

Many many ideas for improvements. The Zarr backend we wrote was really meant to be an MVP, it absolutely needs some work. Here's my diagnosis:

As mentioned, opening / listing each group independently is inefficient. This could be addressed here in Datatree.
Xarray sequentially initializes each group and array, then updates the user attributes. Any batching here would help. This should probably be addressed upstream in Xarray and Zarr-Python.

My approach to (2) is to rethink the Zarr-Python API for creating hierarchies. You may be interested in the discussion here: zarr-developers/zarr-python#1569

slevang · 2023-11-15T19:11:00Z

Awesome, thanks for the info! I imagine (1) would require reimplementing a good chunk of xarray's ZarrStore and other backend objects here in a way that avoids as many of these serial ops as possible?

slevang · 2023-11-15T20:13:32Z

In the meantime, this is plenty fast for the small data case:

def to_zarr(dt, path):
    with TemporaryDirectory() as tmp_path:
        dt.to_zarr(tmp_path)
        fs.put(tmp_path, path, recursive=True)

Takes 1s on my example above instead of 3m.

TomNicholas added the IO Representation of particular file formats as trees label Nov 15, 2023

slevang mentioned this issue Nov 19, 2023

feat: support saving to netcdf xarray-contrib/xeofs#123

Merged

slevang mentioned this issue Nov 28, 2023

Adding new datasets in zarr hierarchy (group) is slower and slower, as previous groups metadata/data is scanned over and over pydata/xarray#8488

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_zarr()` is extremely slow writing to high latency store #277

`to_zarr()` is extremely slow writing to high latency store #277

slevang commented Nov 15, 2023

jhamman commented Nov 15, 2023

slevang commented Nov 15, 2023

slevang commented Nov 15, 2023

to_zarr() is extremely slow writing to high latency store #277

to_zarr() is extremely slow writing to high latency store #277

Comments

slevang commented Nov 15, 2023

jhamman commented Nov 15, 2023

slevang commented Nov 15, 2023

slevang commented Nov 15, 2023

`to_zarr()` is extremely slow writing to high latency store #277

`to_zarr()` is extremely slow writing to high latency store #277