Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF files with many empty variables require large drop_variables declaration #194

Open
ghidalgo3 opened this issue Jul 18, 2024 · 2 comments

Comments

@ghidalgo3
Copy link
Contributor

I'm not sure if this is a real issue or working as intended, but I wanted to get clarification on it. The GOES dataset's NetCDF files contain many non-array variables such that the call to virtualizarr.open_virtual_dataset needs a large number of drop_variables to work. All of the dropped variables have in common a chunk_shape of (), because they're not arrays.

If validate_chunk_shape skipped validating empty dicts, it would be possible to open this dataset without having to pass such a drop_variables list, but more importantly we could preserve variable metadata when we write a Zarr store for this file and others. Is there a reason VirtualiZarr shouldn't read empty variables like this?

Repro

import planetary_computer
import pystac_client
import virtualizarr as vz

catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1/")
search = catalog.search(
    collections=["goes-cmi"],
    datetime=["2019-08-01T12:00:00Z", "2019-08-01T13:00:00Z"],
    query={"goes:image-type": {"eq": "FULL DISK"}},
)
items = sorted(search.item_collection(), key=lambda x: x.datetime)
item = items[0]
vds = vz.open_virtual_dataset(
    planetary_computer.sign(item.assets["CMIP_C15-nc"].href),
    # drop_variables=[
    #     "processing_parm_version_container",
    #     "earth_sun_distance_anomaly_in_AU",
    #     "goes_imager_projection",
    #     "esun",
    #     "algorithm_product_version_container",
    #     "std_dev_brightness_temperature",
    #     "maximum_focal_plane_temperature",
    #     "focal_plane_temperature_threshold_increasing",
    #     "focal_plane_temperature_threshold_exceeded_count",
    #     "y_image",
    #     "nominal_satellite_height",
    #     "outlier_pixel_count",
    #     "focal_plane_temperature_threshold_decreasing",
    #     "min_brightness_temperature",
    #     "planck_bc2",
    #     "nominal_satellite_subpoint_lon",
    #     "nominal_satellite_subpoint_lat",
    #     "valid_pixel_count",
    #     "planck_bc1",
    #     "mean_brightness_temperature",
    #     "planck_fk2",
    #     "percent_uncorrectable_GRB_errors",
    #     "geospatial_lat_lon_extent",
    #     "kappa0",
    #     "percent_uncorrectable_L0_errors",
    #     "t",
    #     "max_brightness_temperature",
    #     "x_image",
    #     "planck_fk1",
    #     "algorithm_dynamic_input_data_container",
    #     "total_number_of_points"],
    reader_options={},
)
@TomNicholas
Copy link
Collaborator

By "non-array" you mean scalar right?

This is definitely an edge case I haven't thought through in great detail.

I think the main question is whether one can have valid Zarr arrays with a chunk shape of (). If not then we might want to automatically load these variables instead, i.e. add them to loadable_variables.

@ghidalgo3
Copy link
Contributor Author

ghidalgo3 commented Jul 19, 2024

By "non-array" you mean scalar right?

I think so? I'm not sure how to translate between the NetCDF concept and a Zarr array, I think ultimately Zarr only deals in Arrays or Groups. Scalar is a good word for it though.

I think the main question is whether one can have valid Zarr arrays with a chunk shape of ().

You can in zarrv2. This is allowed:

import zarr
store = zarr.open("example.zarr", mode="w", shape=(), chunks=(), path="a")
store.attrs["important attribute"] = 42

Which results in this store:

❯ tree example.zarr -a
example.zarr
├── .zgroup
└── a
    ├── .zarray
    └── .zattrs

.zgroup:

{
    "zarr_format": 2
}

.zarray of a:

{
    "chunks": [],
    "compressor": null,
    "dtype": "<f8",
    "fill_value": 0.0,
    "filters": null,
    "order": "C",
    "shape": [],
    "zarr_format": 2
}

zattrs of a:

{
    "important attribute": 42
}

BUT This is not allowed in ZarrV3:

import zarr
zarr.open(store="example.zarr", mode="w", shape=(), chunk_shape=(), path="a")
File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/synchronous.py:45, in open(store, mode, zarr_version, zarr_format, path, **kwargs)
     36 def open(
     37     *,
     38     store: StoreLike | None = None,
   (...)
     43     **kwargs: Any,  # TODO: type kwargs as valid args to async_api.open
     44 ) -> Array | Group:
---> 45     obj = sync(
     46         async_api.open(
     47             store=store,
     48             mode=mode,
     49             zarr_version=zarr_version,
     50             zarr_format=zarr_format,
     51             path=path,
     52             **kwargs,
     53         )
     54     )
     55     if isinstance(obj, AsyncArray):
     56         return Array(obj)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/sync.py:92, in sync(coro, loop, timeout)
     89 return_result = next(iter(finished)).result()
     91 if isinstance(return_result, BaseException):
---> 92     raise return_result
     93 else:
     94     return return_result

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/sync.py:51, in _runner(coro)
     46 """
     47 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
     48 exception, the exception will be returned.
     49 """
     50 try:
---> 51     return await coro
     52 except Exception as ex:
     53     return ex

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/asynchronous.py:198, in open(store, mode, zarr_version, zarr_format, path, **kwargs)
    195     store_path = store_path / path
    197 try:
--> 198     return await open_array(store=store_path, zarr_format=zarr_format, **kwargs)
    199 except KeyError:
    200     return await open_group(store=store_path, zarr_format=zarr_format, **kwargs)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/asynchronous.py:868, in open_array(store, zarr_version, zarr_format, path, **kwargs)
    865         raise e
    867 # if array was not found, create it
--> 868 return await create(store=store, path=path, zarr_format=zarr_format, **kwargs)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/asynchronous.py:686, in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, dimension_separator, write_empty_chunks, zarr_version, zarr_format, meta_array, attributes, chunk_shape, chunk_key_encoding, codecs, dimension_names, **kwargs)
    683 if path is not None:
    684     store_path = store_path / path
--> 686 return await AsyncArray.create(
    687     store_path,
    688     shape=shape,
    689     chunks=chunks,
    690     dtype=dtype,
    691     compressor=compressor,
    692     fill_value=fill_value,
    693     exists_ok=overwrite,  # TODO: name change
    694     filters=filters,
    695     dimension_separator=dimension_separator,
    696     zarr_format=zarr_format,
    697     chunk_shape=chunk_shape,
    698     chunk_key_encoding=chunk_key_encoding,
    699     codecs=codecs,
    700     dimension_names=dimension_names,
    701     attributes=attributes,
    702     **kwargs,
    703 )

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/array.py:153, in AsyncArray.create(cls, store, shape, dtype, zarr_format, fill_value, attributes, chunk_shape, chunk_key_encoding, codecs, dimension_names, chunks, dimension_separator, order, filters, compressor, exists_ok)
    149     if compressor is not None:
    150         raise ValueError(
    151             "compressor cannot be used for arrays with version 3. Use bytes-to-bytes codecs instead."
    152         )
--> 153     return await cls._create_v3(
    154         store_path,
    155         shape=shape,
    156         dtype=dtype,
    157         chunk_shape=chunk_shape,
    158         fill_value=fill_value,
    159         chunk_key_encoding=chunk_key_encoding,
    160         codecs=codecs,
    161         dimension_names=dimension_names,
    162         attributes=attributes,
    163         exists_ok=exists_ok,
    164     )
    165 elif zarr_format == 2:
    166     if codecs is not None:

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/array.py:237, in AsyncArray._create_v3(cls, store_path, shape, dtype, chunk_shape, fill_value, chunk_key_encoding, codecs, dimension_names, attributes, exists_ok)
    227 if isinstance(chunk_key_encoding, tuple):
    228     chunk_key_encoding = (
    229         V2ChunkKeyEncoding(separator=chunk_key_encoding[1])
    230         if chunk_key_encoding[0] == "v2"
    231         else DefaultChunkKeyEncoding(separator=chunk_key_encoding[1])
    232     )
    234 metadata = ArrayV3Metadata(
    235     shape=shape,
    236     data_type=dtype,
--> 237     chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape),
    238     chunk_key_encoding=chunk_key_encoding,
    239     fill_value=fill_value,
    240     codecs=codecs,
    241     dimension_names=tuple(dimension_names) if dimension_names else None,
    242     attributes=attributes or {},
    243 )
    245 array = cls(metadata=metadata, store_path=store_path)
    247 await array._save_metadata(metadata)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/chunk_grids.py:51, in RegularChunkGrid.__init__(self, chunk_shape)
     50 def __init__(self, *, chunk_shape: ChunkCoordsLike) -> None:
---> 51     chunk_shape_parsed = parse_shapelike(chunk_shape)
     53     object.__setattr__(self, "chunk_shape", chunk_shape_parsed)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/common.py:145, in parse_shapelike(data)
    143 data_tuple = tuple(data)
    144 if len(data_tuple) == 0:
--> 145     raise ValueError("Expected at least one element. Got 0.")
    146 if not all(isinstance(v, int) for v in data_tuple):
    147     msg = f"Expected an iterable of integers. Got {type(data)} instead."

Maybe it will be fixed? zarr-developers/zarr-python#1977


Setting these as loadable_variables did work, I think we can live with that for now.

Should VirtualiZarr automatically load scalar variables? I think it's a nice feature for users and it would make it unnecessary to add a long list of variables to the open_virtual_dataset call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants