Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big data support #924

Closed
christophenoel opened this issue Dec 22, 2021 · 4 comments
Closed

Big data support #924

christophenoel opened this issue Dec 22, 2021 · 4 comments

Comments

@christophenoel
Copy link

Hello,

We plan to generate a super large data array (lat, lon, time) from the 56.000 Sentinel-2 tiles (per year). The longitude variable would hold about 3 million coordinates and about 2 billion chunks.

  • What would be the metadata size for that ? Is there an explicit mapping to each chunk, or is it computed based on index?
  • Coordinates size would be around 8 MB ?
  • How fast would be a query resolved ? Is Zarr-python efficient with such large index ?

Thank you very much for your support.

@rabernat
Copy link
Contributor

Thanks for sharing. This sounds like a perfect application for Zarr.

  • What would be the metadata size for that ? Is there an explicit mapping to each chunk, or is it computed based on index?

The metadata size is independent of the array size, as you can see from the spec. Arbitrarily large arrays can be stored in Zarr. This is a fundamental goal of the project.

  • Coordinates size would be around 8 MB ?

Zarr has no concept of coordinates. Just groups and arrays. Perhaps you're thinking of Xarray?

  • How fast would be a query resolved ? Is Zarr-python efficient with such large index ?

Can you clarify what you mean by "query"? Zarr-python supports accessing arrays via numpy-style indexing as described in the docs. The speed at which data are returned will likely depend entirely on your storage system.

about 2 billion chunks.

This has me worried. There are very few storage media that are happy with so many files / objects. How do you plan to store you data.

More details would be helpful. What are the explicit lat, lon, time dimensions and chunk sizes you have in mind?

@christophenoel
Copy link
Author

christophenoel commented Dec 23, 2021

Thanks, for your answer. I probably was not clear in my questions as I mixed concepts from zarr and from xarray.

  1. We store geospatial data In a chunked multidimensional array (e.g. data) composed of three dimensions. Do you confirm that zarr can map from array index to chunk index only by function of the chunk shape ? It does not use an internal mapping index ?

  2. I don't think Amazon S3 is limited in number of objects as they already host many missions (and by the way, we may increase the chunk size to reduce their number). So I suppose my concern is how xarray react when reading a 3 million records array which is used to map coordinates to index of another array.

@rabernat
Copy link
Contributor

  1. Do you confirm that zarr can map from array index to chunk index only by function of the chunk shape ?

Yes. If you have a 3D Zarr Array and use numpy indexing to retrieve a value by position, e.g. data[10, 20, 30], Zarr will figure out which chunks need to be read. I don't know what you mean by "internal mapping index".

2. I suppose my concern is how xarray react when reading a 3 million records array which is used to map coordinates to index of another array.

I don't understand what you mean by "map coordinates". Can you clarify? Why do you think xarray will have to read 3 million records? Can you say more about the access pattern you have in mind?

@clbarnes
Copy link
Contributor

If your data are on an irregular grid (or for any other reason you need to look up values by something other than the index in the array), you'll need to use xarray, which can read from a zarr array with particular metadata, although IIRC it stores its coordinate indices in the zarr metadata i.e. JSON, so depending on how often it has to deserialise 24MB of coordinates, there might be some issues there). If your data aren't on a grid at all, I don't think zarr or xarray can help you.

@zarr-developers zarr-developers locked and limited conversation to collaborators Dec 2, 2022
@joshmoore joshmoore converted this issue into discussion #1281 Dec 2, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants