Skip to content

Letter to CF Response 20161130

bekozi edited this page Dec 20, 2016 · 10 revisions

Jonathan and Chris,

Thank you kindly for the replies and apologies for the delay on the response. Please see responses below. The comments were very useful in introducing some conceptual hangups and will help us moving forward. As usual, looking forward to continued discussion.

A quick note. We've revised our thinking a bit and recorded our current design in an AGU poster: http://goo.gl/0NI4Sd. Based on the feedback in this thread and what we've learned in the process of preparing the poster and software, we need to discuss a bit more and will prepare a proposal for the community to review soon.

I was asking whether this means that for each collection (of points, lines or polygons) there is a single timeseries. For instance, in your example of a single geometry composed of several polygons, there is a single number for each time. But that is not the case for weather stations; for each weather station there is a timeseries, and at each time there is a different number (value of temperature, precipitation or whatever) for each weather station. You also write, "The US National Weather Service’s National Water Model (NWM) ... forecasts streamflow rates in about 2.7 million stream segments averaging 2km." The stream network is a MultiLineString geometry, but I don't think there is just one value of streamflow applying to the entire network at any given time; I guess there is a different timeseries for each stream segment. But in my example above, the Atlantic Ocean is a single polygon with a single timeseries for its average temperature, not a different timeseries for each node. Thus I am unclear about the dimensions of the data. In terms of your original example, does the data have dimensions (time,geometry, where geometry=1) or (time,node)?

Before diving in, it's critical to define some terminology. A geometry is meant to refer to a potentially multipart geometric entity that might otherwise be called a feature. A geometry is made up of one or more points, lines, or polygons.

That said, we are thinking the dimensions of time-varying data would be (time, geometry) where time and geometry may have arbitrary lengths. Hence, multiple time-varying variables could be associated with each geometry. Chris addressed this in his response.

How geometry data is "exploded" is up to the client-software. The 2.7 million stream segments would likely not be a single MultiLineString geometry. The geometry counts could be 2.7 million Linestrings and 2.7 million Polygons. One could collapse all this geometry data into single multi-geometries, but this would prove unwieldy. Some of the LineStrings could be discontinuous multilinestring geometries (only requiring one index on the geometry dimension but consisting of two physical LineStrings).

This seems to me to be a crucial difference. In the former case the simple geometry can be regarded as a more complex alternative to cells bounds - the cell has a complicated geometry of nodes and lines, but it's still a single cell. In the latter case you're providing many timeseries in an unstructured geometry, which is what ugrid describes. Which do you have in mind?

We intend for this proposal to fit in the Discrete Sampling Geometry timeSeries featureType. So this proposal does not contain any new mechanism to link a time-varying data variable with a network composed of polygons, points, and lines (a whole hydrologic system for example). UGRID provides some mechanisms for this similar to other CF conventions (data is associated with a grid center point and its bounds for example - or a "face" has a center point in UGRID).

It's possible your question is still not being addressed. "Nodes" are used in all geometries. We would never associate time-varying data with nodes or the edges between them. Data would always be associated with the geometry (a feature comprised of nodes).

You propose the index variable in order for the convention to be like ugrid. However this still seems to me to be an unnecessary complexity and use of space if you aren't going to have many shared nodes. I think the case for having another convention, distinct from ugrid, is stronger if it is unlike ugrid in this respect, and therefore simpler as well.

Sharing nodes should be possible within the spec in our opinion. There may be overhead for some dataset encodings, but if one is willing to sacrifice computational time when writing complex geometric datasets with shared-node-topology, considerable disk space and memory may be saved.

Reusing coordinate indexing indirection does not seem like a duplication of UGRID. In fact, it makes sense to align with UGRID as much as possible to facilitate data exchange.

I agree that repeating the inside/outside flag many times is wasteful. That, coupled with your clarification that you may have several geometries, each consisting of several elements (points, lines, polygons), means that you need, in effect, a ragged array of ragged arrays (geometry,element,node). This is more complicated than DSGs, but it seems to me it would be reasonably easy to understand if your multi-geometry example https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example was stored something like this:

geom=3;
part=11;
node=36;
int number_of_parts(geom);
  number_of_parts:parts="number_of_nodes";
int number_of_nodes(part);
  number_of_nodes:inout="inout";
char inout(part);
float x(node);
float y(node);
number_of_parts=6, 3, 2;
number_of_nodes=4, 3, 3, 3, 3, 3, 3, 5, 3, 3, 3;
inout="OIIIOOOOIOO";
x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, -40,
-20, -45, -20, -10, -10, -30, -45, -30, -20, -20, 30, 45, 10, 25, 50, 30;
y = 0, 0, 20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29,
-40, -45, -30, -35, -30, -10, -5, -20, -20, -15, -25, 20, 40, 40, 5, 10, 15;

where I assume that all polygons are closed. What do you think?

This is not a bad approach and, with some modifications, it could be used. A few thoughts:

  • In regards to "ragged array of ragged arrays", this is solved in our proposal by using multiple coordinate index / geometry variables indicating a "geom/instance" dimension to identify associated data variables. These variables may use their own coordinate vectors or index into shared coordinate vectors. Some more discussion on this when responding to Chris's comments.
  • This approach is more complex, in our opinion, than the indexing approach. In the case of multiple geometries in a single file, it is not clear how these variables are all linked together. The coordinate index variable hosts the attributes necessary for self-description.
  • We had considered adopting something along these lines to avoid the use of "break values". We wanted to avoid variables where possible. The contiguous ragged approach requires only stop indexing (one additional variable). For NetCDF4 ragged arrays, no ragged indexing is required keeping the schema clean. Geometries lend themselves naturally to NetCDF4 ragged arrays. We agree break values are confusing. Break values present a reasonable solution without adding considerable instrumentation for multi-part geometries which will generally be an exception.
  • The nice thing about the coordinate index approach is that one approach basically works for all geometry types. In your example, number_of_parts and inout are only present for multi-geometries (multipolygons).
  • It is more difficult to extract a single geometry using this approach. Not that big of deal necessarily, but the contiguous ragged and NetCDF4 variable length make the process relatively simple. With geometries, we view single element / random access as an important schema characteristic.

For comparison, your CDL would look something like the following when transformed into our proposal. Note, the ragged index and nodes have been broken onto one line per geometry:

geom=3;
index=44;
node=36;
int start_of_geom(geom);
  start_of_geom:contiguous_ragged_dimension = "index"
int coordinate_index(index);
  coordinate_index:geom_type="multipolygon";
  coordinate_index:geom_coordinates="x y";
  coordinate_index:geom_dimension="geom";
  coordinate_index:stop_encoding="cra";
  coordinate_index:multipart_break_value=-1;
  coordinate_index:hole_break_value=-2;
  coordinate_index:outer_ring_order="anticlockwise";
  coordinate_index:closure_convention="last_node_equals_first";
float x(node);
float y(node);

start_of_geom=0, 24, 37 ;
coordinate_index=0,1,2,3,-2,4,5,6,-2,7,8,9,-2,10,11,12,-1,13,14,15,-1,16,17,18,
19,20,21,-1,22,23,24,25,26,-2,27,28,29,
30,31,32,-1,33,34,35 ;
x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, 
-40, -20, -45, -20, -10, -10, -30, -45, -30, -20, -20, 
30, 45, 10, 25, 50, 30;
y = 0, 0, 20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29,
-40, -45, -30, -35, -30, -10, -5, -20, -20, -15, -25, 
20, 40, 40, 5, 10, 15;

On to Chris's comments.

I think it may be helpful to borrow terminology (and the data model) from the GIS world here. In this case, I am referencing the geoJSON spec, as I happen to be working with that at the moment, but the basic data model is pretty consistent.

Agreed. We want to, at the very least, maintain consistency with geometry types (not case sensitive). The "break values" are a poor man's parentheses corresponding to their uses in arrays used by GeoJSON and WKT.

Note that they have "geometries" which can be things like points, polygons, polyllines. IIUC (and I'm no osgeo mavin) geometries represent a "single" entity. Then there are "Features": a Feature is essentially data associated with a particular geometry. But note: there are "Collections" -- both Geometry and Feature Collections -- that is what you use to "bundle" various data together.

I think we may be well served by thinking in terms of mapping the GIS data model to CF/netcdf -- for instance it would be great to be able to write a netcdf<->geoJSON converter that was lossless, AND would be fairly "native" in both cases.

Agree in principle. In practice, this proves difficult of course. :-) A FeatureCollection may contain features with different geometry types. We would need to add an additional dimension variable describing the geometry type. A GeometryCollection itself may nest inside a FeatureCollection. We think of simple geometry variables as FeatureCollections with a static geometry type. NetCDF groups may help provide a crosswalk, but I am not sure we are ready to go there.

GeoJSON also tends to break down with time series - repeating geometries coordinates ad nauseum for each time coordinate. TopoJSON offers some solutions in this regard. TopoJSON also incorporates node sharing. Chris did link to this JSON spec already.

Also, how do NetCDF attributes, dimensions, and data types fit into GeoJSON? No good answer for this one.

What's important is that we establish a way to encode simple/basic geometry types. Collections can be created using indexing or hierarchies. The basic static geometry-typed FeatureCollection should be sufficient for most applications.

(though I'm still confused, maybe you can have an "array" of data associated with a GeometryCollection?)

as for MultiLineString -- you could associate an array of data with the Multilinestring -- so one value per segment. But I think that violates the intent of the data model -- you should have a GeometryCollection of linestrings instead. and then each segment has its own geometry and you can associate an array of data with that. (or it should be a FeatureCollection? I'm getting confused now!

Yup, makes your head hurt. No good answer for this. This leads back to the "ragged array of ragged arrays" comment by Jonathan earlier. It is likely beyond the scope of our proposal. The solution may lie in turning geometry variable attributes and names into arrays themselves adding another layer of indexing in the process. We hope that the simple geometry encoding methods could be reused in an effort like this.

Of course, CF doesn't need to follow this data model, but it's a good idea to be informed by it.

Yes, absolutely. Your descriptions of geometry types and collections is correct. GeometryCollections can also be used for the same types of geometries - a minor point.

In the GIS data model, nodes are not shared between geometries, and you are quite right that keeping nodes separate with geometries indexing into it is an added complication and would not be space-efficient.

However, there is another reason to do it -- it makes it definitive that two (or more) geometries share the exact same node, rather than them being distinct points that happened to be at the same location (Or worse, with FP error and all, two points that are very close)

Yes, this is why we are for the coordinate indexing approach used in UGRID.

This is actually a major limitation in the standard GIS model.

Also yes!