Update design_doc.md #24

danlooo · 2023-11-14T10:32:18Z

This PR addresses #22 to clarify the design document.

Scope

Clarify: xddgs defines the in-memory representation of DGGS data in Python environments.
Other specifications, e.g. DGGS data specification draft aims to define a storage (file) format for DGGS data for use in other programming languages as well. The data specification and xddgs should work together.

Basic operations should include neighbor-search and BBox queiries

Getting the cells within a lat/lon bounding box is the fundamental operation used for conversion between geodetic and discrete grids, for plotting, as well for convolutions. We should define the API for this functions that are implemented in the individual DGGSs. I think this is generic enough for all DGGSs.

Scaling should be a must

When possible, xdggs operations should scale to fine DGGS resolutions (millions of cells).

In my opinion, scaling to at least billions of cells is a MUST, because we are talking about global grids. We would use just an optimized projection instead of a DGGS for small sub-country level projects.
10^6 cells is still a very coarse resolution that equals to ~510km^2 per cell. DGGRID ISEA4H supports up to 4.3*10^10 cells and Uber H3 supports up to 5.7 * 10^14 cells.

Existing standards

The OGC abstract specification topic 21 defines properties of a DGGS including the reference systems of its grids.
However, there is no consensus yet about the actual specification on how to work with DGGD data.

Dimensionality of DGGSIndex

There are multi dimensional cell id systems, e.g. DGGRID PROJTRI and Uber H3 IJK. Fortunalley, a xarray coordinate may consist of multiple dimensions. How do we want to deal with this in DGGSIndex ?

Spatiotemporal DGGS

There are DGGS having also multiple time resolutions (e.g. daily, weekly, ...). Like different spatial resolutions, they are stored in different zarr groups. Therefore, we still have just one spatial and one temporal resolution in a dataset.

Raster vs lat/lon grid

What is the difference e.g. between functions ds.dggs.from_latlon_grid and ds.dggs.from_raster ?

Staggering

In climate models, different variables, e.g. air pressure and wind speed are stored at different locations within the cell, e.g. at center points, edges or vertices. We should at least put this info into the metadata.

benbovy · 2023-11-14T13:42:22Z

Thanks for chiming in @danlooo!

Thanks for the clarification on DGGS standards.

In my opinion, scaling to at least billions of cells is a MUST, because we are talking about global grids.

Yes I agree we should target that. However, I wouldn't say it is a MUST, since horizontal scaling of some aspects of DGGS (especially indexing) can be very challenging and will certainly require a lot of effort. Also Xarray is not yet 100% ready for supporting lazy indexes (e.g., based on a dask array for the DGGS cell coordinate), although we're slowly getting closer (pydata/xarray#8124).

The primary goal of xdggs is to democratize DGGS so I wouldn't mind if it doesn't scale yet to the finest DGGS resolutions in its first released versions. Lots of examples that I've seen in GIS use DGGS (H3 or S2) on fairly limited extent spatial domains (e.g., as a way to aggregate point data) so xdggs with 10^4 - 10^6 cells would already be useful there. Also, I think that we could already reach acceptable resolutions (100M-500M of cells) on a global grid using recent hardware (even laptops and desktops). I haven't run experiments, though.

Out of curiosity, do you have examples of existing real-world datasets on a DGGS that are bigger than 10^9 cells?

There are multi dimensional cell id systems, e.g. DGGRID PROJTRI and Uber H3 IJK. Fortunalley, a xarray coordinate may consist of multiple dimensions. How do we want to deal with this in DGGSIndex ?

Although that is certainly possible, there would be a number of challenges in supporting >1-d coordinates with DGGSIndex (at least in its current implementation backed by a PandasIndex).

Alternatively, I guess xdggs could provide index classes distinct from DGGSIndex that are designed for these specific cases.

What is the difference e.g. between functions ds.dggs.from_latlon_grid and ds.dggs.from_raster ?

While both are rectilinear, the 1st one is a global grid with geodetic coordinates while the 2nd one is a regional grid with projected coordinates (at least this is what I had in mind when writing those API examples).

In climate models, different variables, e.g. air pressure and wind speed are stored at different locations within the cell, e.g. at center points, edges or vertices. We should at least put this info into the metadata.

Good point. Maybe we can also add a reference to xgcm here.

benbovy · 2023-11-14T13:50:27Z

design_doc.md


 Examples of common DGGS features that `xdggs` should provide or facilitate:

 - convert a DGGS from/to another grid (e.g., a DGGS, a latitude/longitude rectilinear grid, a raster grid, an unstructured mesh)
 - convert a DGGS from/to vector data (points, lines, polygons, envelopes)
+- nearest neighbor search and bounding box queries around a given cell


Aren't these special cases of selection by geometries (already mentioned below)? For example, in xvec you can achieve that on vector data cubes using ds.xvec.query() with a shapely.bbox object or an array of shapely.Point objects.

Yes. However, bounding boxes are convex polygons so we can use much faster algorithms for this special case.

danlooo · 2023-11-14T14:35:23Z

The index has 2 major properties:

It affects the cell ordering also on disk and thus chunking and I/O performance
It affects the algorithms used for nearest neighbor search and bounding box queries: In DGGRID Q2DI, we only need to convert the 4 corner points to get all cell ids within that box. However, for DGGRID SEQNUM, we need to query all points on a lat/lon grid individual to get the list of all seqnum cell ids.

I suggest to see them as two separate DGGS. Let DGGS be a subclass of xarray.Dataset. Then we can create a class DGGRIDQ2DI as a subclass of DGGS and implement special index aware functions for e.g. .dggs.query().

I haven't seen any native DGGS datasets yet that cover the entire earth. I think its because of the lack of tooling and file formats that we are currently developing ;)

benbovy · 2023-11-14T15:01:13Z

I'm not familiar enough with DGGRIDQ2DI (and actually DGGS in general, at least not as much as you :)) but yes I imagine that the specific properties of DGGS (and/or their implementation) may help to index / query the grid cells in a more optimal way than considering them as arbitrary vector geometries.

Supporting spatial indexing via converting a DGGS data cube to a vector data cube has the advantage that, although suboptimal, it works for the general cases and also that we already can use Xvec, so it is a low-hanging fruit.

More optimal, DGGS-specific indexing is certainly welcome and it is quite complementary (as I doubt we could easily reach the same flexibility, e.g., working with a variety of predicates). The challenging part is figuring out how to support that in a consistent way across all kinds of DGGS and via a common API. This would require some more thinking.

I suggest to see them as two separate DGGS. Let DGGS be a subclass of xarray.Dataset. Then we can create a class DGGRIDQ2DI as a subclass of DGGS and implement special index aware functions for e.g. .dggs.query().

Usually we recommend not to subclass xarray.Dataset. However, we could allow special index logic be implemented in DGGSIndex subclasses (e.g., DGGRIDQ2DIIndex) and be executed via .dggs.query().

benbovy · 2023-11-20T12:30:47Z

@danlooo those are all great additions and everything looks good to me except perhaps (very nit picking!) the "must" on scalability that I find too imperative (what do you think about "shall"?). Is there any other update you want to suggest in this PR?

danlooo · 2023-11-21T08:54:44Z

@benbovy All right. I'm fine with making scaleability a recommended thing to promote prototyping. This PR can be merged from my side.

benbovy · 2023-11-21T09:17:49Z

Great, thanks @danlooo!

Update design_doc.md

4df9616

benbovy reviewed Nov 14, 2023

View reviewed changes

danlooo added 2 commits November 14, 2023 15:12

Add multi dimensional index

b5aa85a

BBox queries are special cases of polygon queries

1e5ee11

danlooo marked this pull request as ready for review November 14, 2023 14:20

Update scaleability to recommended

b8dae54

benbovy merged commit 6dc6a30 into xarray-contrib:main Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update design_doc.md #24

Update design_doc.md #24

danlooo commented Nov 14, 2023 •

edited

benbovy commented Nov 14, 2023 •

edited

benbovy Nov 14, 2023

danlooo Nov 14, 2023

danlooo commented Nov 14, 2023 •

edited

benbovy commented Nov 14, 2023 •

edited

benbovy commented Nov 20, 2023

danlooo commented Nov 21, 2023

benbovy commented Nov 21, 2023

Update design_doc.md #24

Update design_doc.md #24

Conversation

danlooo commented Nov 14, 2023 • edited

Scope

Basic operations should include neighbor-search and BBox queiries

Scaling should be a must

Existing standards

Dimensionality of DGGSIndex

Spatiotemporal DGGS

Raster vs lat/lon grid

Staggering

benbovy commented Nov 14, 2023 • edited

benbovy Nov 14, 2023

Choose a reason for hiding this comment

danlooo Nov 14, 2023

Choose a reason for hiding this comment

danlooo commented Nov 14, 2023 • edited

benbovy commented Nov 14, 2023 • edited

benbovy commented Nov 20, 2023

danlooo commented Nov 21, 2023

benbovy commented Nov 21, 2023

danlooo commented Nov 14, 2023 •

edited

benbovy commented Nov 14, 2023 •

edited

danlooo commented Nov 14, 2023 •

edited

benbovy commented Nov 14, 2023 •

edited