-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An experimental rasterio-based Zarr storage class #2623
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
"""Zarr storage""" | ||
|
||
from collections.abc import MutableMapping | ||
import json | ||
import logging | ||
from pathlib import Path | ||
|
||
import numpy | ||
from rasterio.windows import Window | ||
|
||
log = logging.getLogger(__name__) | ||
|
||
|
||
class RasterioStore(MutableMapping): | ||
def __init__(self, dataset): | ||
self.dataset = dataset | ||
chunk_height, chunk_width = self.dataset.block_shapes[0] | ||
self._data = { | ||
".zgroup": json.dumps({"zarr_format": 2}).encode("utf-8"), | ||
Path(self.dataset.name).name | ||
+ "/.zarray": json.dumps( | ||
{ | ||
"zarr_format": 2, | ||
"shape": ( | ||
self.dataset.count, | ||
self.dataset.height, | ||
self.dataset.width, | ||
), | ||
"chunks": ( | ||
1, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this first revision, I chose to put each band in its own chunk. Not ideal for all situations, I'm sure. I've got a lot to learn about zarr in production. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For uncompressed flat binary data, you can break the bytes into chunks however you want. Perhaps this could be a user configurable parameter. |
||
chunk_height, | ||
chunk_width, | ||
), | ||
"dtype": numpy.dtype(self.dataset.dtypes[0]).str, | ||
"compressor": None, | ||
"fill_value": None, | ||
"order": "C", | ||
"filters": None, | ||
} | ||
).encode("utf-8"), | ||
Path(self.dataset.name).name + "/.zattrs": json.dumps({}), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about metadata? Is there any metadata that would be useful to expose as Zarr attrs? |
||
} | ||
|
||
def __getitem__(self, key): | ||
if key in self._data: | ||
return self._data[key] | ||
elif key.startswith(Path(self.dataset.name).name): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's where we intercept raster chunk lookups and translate them into rasterio windowed reads. A bit like kerchunk's byte offsets. |
||
chunk_height, chunk_width = self.dataset.block_shapes[0] | ||
chunking = key.split("/")[-1] | ||
bc, rc, cc = [int(x) for x in chunking.split(".")] | ||
chunk = self.dataset.read( | ||
bc + 1, | ||
window=Window( | ||
cc * chunk_width, rc * chunk_height, chunk_width, chunk_height | ||
), | ||
boundless=True, | ||
) | ||
return chunk | ||
else: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This else clause is so important! Without it, you get a sorta baffling error about |
||
raise KeyError("Key not found") | ||
|
||
def __setitem__(self, key, val): | ||
pass | ||
|
||
def __delitem__(self, key, val): | ||
pass | ||
|
||
def __len__(self): | ||
return len(self._data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is probably incorrect, since it doesn't include the keys. |
||
|
||
def __iter__(self): | ||
return iter(self._data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same. A store should expose the metadata keys AND the chunk keys. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, chunk keys could be generated as needed. But it looked like array access was going to work without these, so I skipped that for now. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,3 +15,4 @@ sphinx | |
sphinx-click | ||
sphinx-rtd-theme | ||
wheel | ||
zarr |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
"""Test of rasterio Zarr store""" | ||
|
||
import rasterio | ||
from rasterio.zarr import RasterioStore | ||
import zarr | ||
|
||
|
||
def test_zarr_store(path_rgb_byte_tif): | ||
"""Open sesame""" | ||
with rasterio.open(path_rgb_byte_tif) as dataset: | ||
store = RasterioStore(dataset) | ||
z = zarr.group(store) | ||
assert (z["RGB.byte.tif"][:] == dataset.read()).all() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perfect fidelity is always reassuring 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._data
is the "static" part of this store's data mapping.