Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An experimental rasterio-based Zarr storage class #2623

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
72 changes: 72 additions & 0 deletions rasterio/zarr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""Zarr storage"""

from collections.abc import MutableMapping
import json
import logging
from pathlib import Path

import numpy
from rasterio.windows import Window

log = logging.getLogger(__name__)


class RasterioStore(MutableMapping):
def __init__(self, dataset):
self.dataset = dataset
chunk_height, chunk_width = self.dataset.block_shapes[0]
self._data = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._data is the "static" part of this store's data mapping.

".zgroup": json.dumps({"zarr_format": 2}).encode("utf-8"),
Path(self.dataset.name).name
+ "/.zarray": json.dumps(
{
"zarr_format": 2,
"shape": (
self.dataset.count,
self.dataset.height,
self.dataset.width,
),
"chunks": (
1,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this first revision, I chose to put each band in its own chunk. Not ideal for all situations, I'm sure. I've got a lot to learn about zarr in production.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For uncompressed flat binary data, you can break the bytes into chunks however you want. Perhaps this could be a user configurable parameter.

chunk_height,
chunk_width,
),
"dtype": numpy.dtype(self.dataset.dtypes[0]).str,
"compressor": None,
"fill_value": None,
"order": "C",
"filters": None,
}
).encode("utf-8"),
Path(self.dataset.name).name + "/.zattrs": json.dumps({}),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about metadata? Is there any metadata that would be useful to expose as Zarr attrs?

}

def __getitem__(self, key):
if key in self._data:
return self._data[key]
elif key.startswith(Path(self.dataset.name).name):
Copy link
Member Author

@sgillies sgillies Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where we intercept raster chunk lookups and translate them into rasterio windowed reads. A bit like kerchunk's byte offsets.

chunk_height, chunk_width = self.dataset.block_shapes[0]
chunking = key.split("/")[-1]
bc, rc, cc = [int(x) for x in chunking.split(".")]
chunk = self.dataset.read(
bc + 1,
window=Window(
cc * chunk_width, rc * chunk_height, chunk_width, chunk_height
),
boundless=True,
)
return chunk
else:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else clause is so important! Without it, you get a sorta baffling error about path '' contains an array 😅

raise KeyError("Key not found")

def __setitem__(self, key, val):
pass

def __delitem__(self, key, val):
pass

def __len__(self):
return len(self._data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably incorrect, since it doesn't include the keys.


def __iter__(self):
return iter(self._data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. A store should expose the metadata keys AND the chunk keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, chunk keys could be generated as needed. But it looked like array access was going to work without these, so I skipped that for now.

1 change: 1 addition & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ sphinx
sphinx-click
sphinx-rtd-theme
wheel
zarr
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,7 @@ def copy_data_tree(datadir, destdir):
"pytest-cov>=2.2.0",
"pytest>=2.8.2",
"shapely",
"zarr",
],
}

Expand Down
13 changes: 13 additions & 0 deletions tests/test_zarr_store.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""Test of rasterio Zarr store"""

import rasterio
from rasterio.zarr import RasterioStore
import zarr


def test_zarr_store(path_rgb_byte_tif):
"""Open sesame"""
with rasterio.open(path_rgb_byte_tif) as dataset:
store = RasterioStore(dataset)
z = zarr.group(store)
assert (z["RGB.byte.tif"][:] == dataset.read()).all()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect fidelity is always reassuring 😄