-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connecting with numpytiles #37
Comments
Specifically see https://twitter.com/rabernat/status/1319723780076376064 and its descendants for a small bit of twitter discussion. |
Big 👍 for rescuing these conversations from twitter. |
I regret on too strong on twitter arguing that numpytiles should be "deprecated". Twitter is not the best platform for subtle discussion. 😞 Really what I meant was, I crave interoperability between all the cool tools that are part of this numpytiles ecosystem and zarr cloud-based data. 😁 Let me try to explain what I mean. IIUC, numpytiles is a spec that replaces, say .png or .jpg as a data format within a tiled map. If you limit your view to a single image / tile, and you consider the format to be completely self-contained to that file, then indeed there is no connection to zarr. However, the existence of a tile server implies that there will be many such files, organized in a hierachy like this
For a particular
At this point, it is looking like a 2D zarr array. A 2D Zarr array served over HTTP has chunks like
(I have transposed x and y following standard python C ordering of dimensions, but this is trivial.)
We could imagine making a separate array for each zoom level and naming these array with numbers, such that Now what about the files themselves? Numpy tiles (e.g. The big difference between an npy file (and, by extension, a numpytile) is that, with npy, the metadata for decoding the array contents live inside the file, while with zarr, they live in an external .json object, the
An example
The only real difference on the client side is that the client application only has to fetch this metadata once for the entire array. When it fetches individual chunks (e.g. I disagree with the claim I heard on twitter that NPY files are somehow more standard or widely used than zarr chunks. The reason is that, in this context (no compressors or filters), the zarr chunk is just a flat binary file, the simplest and most universal way of transmitting scientific data. To decode it you need to know its shape, its dtype, and its ordering (in this case, fixed as C order). Why does this matter? Because we would like Zarr to become a leading standard for cloud-based scientific data, with a wide ecosystem of cloud-native tools. There are already some efforts at building js-based visualizers for zarr data, mostly in the biomedical realm. For example, the amazing VIV project by @manzt (github, demo)
Try the demo--is this not essentially the same thing a tiled map app is trying to do?!? The coordinates and context differ, but the fundamentals are the same. Imagine the sort of amazing stuff we could build if we were able to harness the aggregate creativity and technical chops of the geospatial and bioimaging worlds! (@freeman-lab is perhaps uniquely poised to comment on this.) So let's work towards aligning! It won't happen tomorrow, but within one year, could the tiled map ecosystem develop the capability to work natively with appropriately formatted Zarr data? That would be an excellent goal. |
I'm very interested in zarr supporting tiles, but I'd also like not to duplicate data. In zarr-developers/zarr-specs#80 I'm imagining an API where: arr[z, y, x] # for a 2D array would fetch the data for a zoom |
@davidbrochart IIUC it sounds like you are talking about a way to generate pyramidal tiles on the fly. (Pyramids have come up elsewhere, e.g. zarr-developers/zarr-specs#50, zarr-developers/zarr-python#520). While this sounds very useful, I'd like to suggest that we pursue that in a separate issue. This issue is about possible alignment between zarr and the numpytiles specification. From the point of view of the spec, it does not matter how the data are generated--only the fly, materialized to disk, etc. I'd prefer not to confuse these points. |
Sure, sorry for the noise 😄 |
👋 So just to be clear, the We use the NumpyTile format to create web map tiles dynamically from COG (or whatever cloud-friendly format GDAL can read) and then instead to apply rescaling/colormap to the data to create a visual image (PNG) we return the data in NumpyTiles (float, complex...) so the web client (right now @kylebarron is working on a deckGL plugin.
👌
In our case, we create the |
Understood. You do not store any npy data on disk. As I hope I made clear in my comment, Zarr over http likewise does not have to be materialized on disk. The Zarr data can be served via an API, with the chunks generated on demand. Xpublish is an example of this. I see the question of on-disk vs dynamically generated as orthogonal to the questions about formats and specifications. Am I missing something? |
Yes, instead of converting each # overly-simplified pseudo-code
def handle_request(key):
if ".zarray" in key:
return array_json # metadata that describes entire array
x, y = key.split('.') # here just simple large 2d array
arr = read_region(x, y) # custom function to read region from image as np array
if compressor:
return compressor.encode(arr.tobytes()) # can dynamically compress data as well!
return arr.tobytes() # just return uncompressed buffer directly! EDIT: Similar to xpublish, I've experimented with a library called store = COGStore(file, tilesize=512, compressor=Blosc()) # can configure custom tilesize / compressor for tiles
# in python
z = zarr.open(store)
z[:] # use numpy syntax to return chunks
# as a webserver
from simple_zarr_server import serve
serve(z) # HTTP endpoint for zarr on localhost:8000 Colab example using a custom store (for openslide-tiff image) to serve image over HTTP as zarr. |
First, I see the reason for the "Numpy Tile" existence in the first place as a solution for a simple N-D array format that's very easy to parse in JS. Image formats like jpg/png don't support arbitrary data types, and sending GeoTIFFs to the browser directly requires bringing in a new complex dependency (geotiff.js), which is 135kb minified. Most tiled datasets require some sort of external metadata file. Vector and raster datasets in the Mapbox ecosystem often use TileJSON files, which at a minimum describe the valid zoom levels for the dataset and the URL paths for each tile. The Numpy Tile spec doesn't touch this broader metadata at all. It seems like it would be possible to use Zarr for many of these tiled cases where you need to fetch some sort of metadata file anyways. Here are some discussion points in my view for the use case of tiled map data:
|
Thanks for pushing us towards interoperability @rabernat! I'm definitely all for aligning standards as much as is reasonably possible. I think I'm still wrapping my head around what interoperability here would look like, to understand if it truly makes sense and what that path would actually be. And if there are real wins from having them be the same format. My mental model had been zarr + COG are cloud-native formats, and though they can be accessed directly from javascript there are cases where it makes sense to have a tile server running close by (lambda or k8's based for scalability) to help get data to the browser. And indeed on learning that most zarr files had large chunks (as Kyle mentioned), that reinforced my view more - store in zarr, optimized for direct scientific use cases on the cloud, but then run a tile server to break it up into the chunks our browser-based mapping libraries already understand. So the thing I'm getting my head around is the same thing Kyle mentioned - this idea of a 'virtual' zarr dataset generating chunks on demand. Is that a common thing today? And then also trying to understand using that for the established 'web map tiles' use case. I guess it could be cool if zarr parsers could talk directly to any geospatial tile server? But does it make sense? If we wanted to go down this path I think it'd involved trying to add an extension to the new ogc api - tiles, that would generate a .zarray file at the base of every It still feels a bit 'weird' to me, like we're maybe trying too hard to combine two things that have pretty similar goals and made pretty similar choices. And I'm just not sure what the exact 'win' is. Cloud-native formats to me are all about putting the compute close to the data. Supporting cross-cloud interoperability when your compute is not next to the data is a nice side-effect of interoperability. But sending data as efficiently as possible to the browser on a desktop or mobile device feels like a 'different' problem. The lower hanging fruit interoperability that I'm surprised doesn't exist (or maybe I just didn't find it) is a GDAL driver for Zarr. That would make much more sense to me as a place to start, so that the general GIS toolchain can read zarr directly. And then we don't need to go through tile services for that interoperability. Like it feels like there's a baseline level of geo tools with zarr data that we should tackle first. That said, I'm all in support of trying to evolve these things to come together, and to have a tiles format that is based on zarr chunks. From the Planet perspective, I think we'll complete this 'release' of the spec and open source library. But we don't mean it for it to be a 'big' thing, just a small piece that is compatible with XYZ / WMTS tiles, as well as the coming ogc api - tiles. We see 'COG's as the 'big thing' that we support (and remain interested in Zarr), numpy tiles really a minor thing to work with our tile services. I could seeing evolving to a zarr-tiles standard, a sort of 'numpy tiles 2.0' once we sort out all the details of how it actually fits into tile services. |
Thanks so much to @manzt, @vincentsarago, @kylebarron, and @cholmes for weighing in on this. I have so much admiration and respect for the work you all are doing...it's a real honor to have this discussion! The most important takeaway from what I have read above seems to be the argument that "numpytiles is a relatively minor thing" and therefore probably not worth a lot of effort for aligning. We remain very interested in browser-based visualization, so my concern is that a big ecosystem would emerge around numpytiles that would not interoperate with zarr. Given the limited scope and ambitious for numpytiles, I see now that this is probably not a big concern. That said, let me respond to some specific points.
That is true for the current zarr spec (v2.5). Going forward the v3 spec will likely have an extension that supports image pyramids (see zarr-developers/zarr-specs#50). This is major need for the bioimaging community as well.
No. Zarr is a generic container for chunked, compressed arrays with metadata (similar to HDF5). What we need is a community standard or convention, on top of the base Zarr spec, for encoding geospatial data. In Pangeo, we encode our data following NetCDF / CF conventions, which have a convention for CRS. From Zarr's perspective, this is all considered "user metadata". OGC is currently reviewing Zarr as a community standard. By the same token, is there a specification for geospatial referencing in NPY files? No, of course not. It's a low level container for array data. So that seems like a slightly inconsistent comparison.
Yes this sounds like a significant challenge. The Zarr spec supports the concept of missing chunks, which are defined as filled by a constant
As we discussed on Twitter, Zarr has absolutely no inherent preference about chunks. They can be as small or as large as the application demands. Those particular chunks were chosen with backend bulk processing in mind, not interactive visualization. On-the-fly, lambda-based rechunking of a zarr dataset would be a really cool application to develop!
Compression is optional. It improves speed and storage / bandwidth utilization at the cost of complexity. Most existing zarr datasets use compression, but it can be turned off easily.
This is not particularly common, although it definitely does work! We saw two example libraries that do this (simple-zarr-server and xpublish) in the comment above. The client doesn't need to know or care about whether the chunks are on disk or generated dynamically. I think there is a lot of potential here for building APIs. But of course none of this is vetted or approved by any standardization body.
This is a convincing point. I guess I tried to explain that in my earlier post. The "win" would be that tools for interactive browser-based visualization of multiscale chunked array data could share a common file format and set of javascript libraries. However, that idea neglects the fact that there is already an extensive implementation of tileset viewers which have no real incentive to refactor their operation around this goal.
Yes, this is a great idea. I would personally have no idea where to start though. I guess we would need a motivated person who wants to use Zarr data from GDAL land. |
My argument is that Numpy Tiles' metadata describe a single tile whereas Zarr metadata describes an entire dataset. Numpy Tiles don't need geospatial referencing because that's assumed to be external (just as it would be for PNG/JPG/MVT), but since Zarr describes the whole dataset, it needs to include geospatial referencing.
I don't think it's necessarily a challenge as long as you know the geospatial referencing of the dataset. For example, the TileMatrixSet spec is very similar in that its internal indexing starts from Regarding chunksize, compression: I think my point here is the difference between Zarr as a storage format and Zarr as an API. Zarr as a storage format needs to make these decisions for backend processing; the API server could rechunk to the client's preferred dimensions. |
Hi all! Thanks for the super informative discussion here. (My apologies for getting the party started before going dark for a few days) I have a few questions after reading through all of this.
Again, thanks all for the thoughtful and productive discussion! |
Agreed, very rarely/never. And the Numpy format includes only enough metadata to read the data back into the original array, but no metadata to describe what the data is. Aside from shape and geospatial referencing, you have no way to know which bands correspond to which wavelength, and if one of the bands is a data mask.
I was interested in direct reading of large datasets straight from the browser. Due to chunksize issues, that wasn't possible with the one dataset I looked at, but zarr.js looks to be perfectly adequate. A demo dataset with small chunks (256px) might be nice; I don't think it's important for it to be uncompressed... numcodecs.js looks to support the normal gzip/blosc. |
Agreed. Zarr.js aims to be a feature complete implementation of Zarr (including indexing and slicing of the full nd-array), but for web-based visualization applications, you really just want to load a tile/chunk by key (e.g. The equivalent of In that case, Zarr.js is certainly adequate, but it's a large, bulky dependency whose internals can't be totally tree-shaken. This means that there is a a high "cost" to bringing on Zarr.js as a dependency for a project, which web-developers are especially opposed, The conversation here (and experience with our own applications) led me to think about what is the minimal amount of JavaScript required to load a Zarr array "chunk" by key. Right now I'm experimenting with import { openArray } from 'zarr-lite';
import HTTPStore from 'zarr-lite/httpStore';
// import { openArray } from 'zarr';
// import { HTTPStore } from 'zarr'; // could use a any store from Zarr.js!
(async () => {
const store = new HTTPStore('http://localhost:8080/data.zarr');
const z = await openArray({ store });
console.log(z.dtype, z.shape, z.chunks);
// "<i4", [10, 1000, 1000], [5, 500, 500]
const chunk = await z.getRawChunk('0.0.0'); // get chunk by key; can also use [0, 0, 0];
console.log(chunk);
// {
// data: Int32Array(1250000),
// shape: [5, 500, 500],
// strides: [250000, 500, 1],
// }
})(); Please see this interactive example for more info: https://observablehq.com/@manzt/using-zarr-lite
|
So I guess zarr-lite implements what we call the store API in Zarr v3, and especially the readable store API. |
Perhaps in part? Reading what you sent, I don't see where the store is responsible for 1.) decoding a binary blob (if compression is used) and 2.) creating a view of a decoded buffer on |
You're right, the store is not responsible for compression/decompression, it only deals with raw bytes. And no view either. So yes, zarr-lite is more than the store API. |
A while ago I started experimenting with cutting tiles on the fly on top of a redis backed zarr store. The resulting efforts ended up in this project (https://github.com/lbrindze/angle_grinder) which is half baked to say the least. Given what is being discussed here I think it would be fairly trivial to re purpose the drawing of pngs on the fly to just dynamically cutting a zarr chunk + custom .zattrs that would make up the tile. the mercantile lib already does all the tile math for you, which is awesome! I would be more than happy to adapt this into a tile endpoint feature for Xpublish if there is interest in that. At least in the consumer weather application space, a lot of folks are moving away from png tiles and just sending off numerical data and letting the client draw it. This most often happens independent of data storage format. The 2 most common ways I've seen of doing this, based on a small survey I did for an internal company presentation a couple months ago, was to shove the tile into protobuf or json frames and send the data that way. Each one I found was a completely different implementation with no standard interoperability of data tiles between these products. |
Thanks so much for sharing @lbrindze!
👆 THIS is why we need to be having this discussion. There is a broader opportunity here to develop standards for browser-based viz of numerical array data. |
To respond to @kylebarron's last comments first...
Perhaps this is where zarr has something to offer. The metadata associated with a collection of chunks is shared in the I've put a smallish sample zarr dataset in an open-access Azure cloud bucket.
I've also uploaded the notebook I used to make this array which includes a path to a COG version of the same dataset: https://gist.github.com/jhamman/2a95102567025b3b69e2873a89aaed22 Now to @lbrindze's comments. I'd be super excited to see the development of a tile endpoint to Xpublish. @benbovy has recently refactored the plugin architecture for new API endpoints and we'd be happy to give you any pointers needed. Check out his twitter thread from just today. |
Hmm I'm getting errors with the public access datasets. Are they available via HTTP? $ curl -L https://carbonplan.blob.core.windows.net/carbonplan-share/zarr-demo/nftd/.zarray
# <?xml version="1.0" encoding="utf-8"?><Error><Code>ResourceNotFound</Code><Message>The specified resource does not exist.
# RequestId:964c9a75-c01e-0010-162d-b95a85000000
# Time:2020-11-12T19:56:39.1862516Z</Message></Error>% |
@manzt - my apologies, the bucket was missing a part of its public setting. Should work now. |
Ah thank you! BTW, I don't think the metadata in I notice that the default
|
@manzt - very cool. On the subject of the NaN strings, let's follow up here: zarr-developers/zarr-python#429 |
Great discussion here! It took some time to catch up, but I enjoyed reading it. I guess that choosing between zarr and numpytiles really matters if we need to directly deal with those stores and/or files on the client side? On a general note, I agree that standards will greatly help towards better interoperability between web applications dealing with scientific, array data. But I also think that it would be better to have a handful of standards, each optimal for given situations, rather than trying to come up with a unique standard that more-or-less fits all use cases. Nowadays, with tools like FastAPI, it's really easy to develop lightweight backends that are adapted to different situations. Supporting multiple specs would not add much maintenance burden IMO. The server and client sides have different constraints and we can easily decouple that. For example, load a zarr store of large, compressed data chunks in the backend, and then serve it via the API as dynamically generated, uncompressed, small tiles with another standard format for data and/or metadata. @lbrindze, I had a brief look at your |
Happy to take a look at submitting a PR over the weekend here. I will follow up with something in Xpublish's project directly :) |
This use case seems to have some overlap with one or both of h2n5, a web server which encodes 2D slices of N5 volumes into a regular image, and n5-wasm, which is used to decode and slice fetched N5 chunks for use as 2D images in CATMAID (see here). |
Maybe xtensor-zarr could be used to implement a GDAL driver for Zarr. It is written in C++, I'm not sure this makes it a better candidate than zarr-python. |
Just a quick note here to share some recent progress using Zarr for multiscale array visualization: https://carbonplan.org/blog/maps-library-release |
The folks @planetlabs (@cholmes, @theduckylittle, and @qscripter) have just released a new project called numpytiles: https://github.com/planetlabs/numpytiles-spec. Its a cool little project designed to make multidimensional arrays usable in the browser. It also has some obvious overlap with Zarr that may be worth exploring.
tweet-refs:
Pinging @rabernat, @kylebarron, and @freeman-lab who have all engaged in a bit of prior conversation about this.
The text was updated successfully, but these errors were encountered: