Skip to content

Letter to CF Response 20160926

bekozi edited this page Sep 27, 2016 · 6 revisions

Jonathan and CF-Metadata List,

Thanks for the suggestions and discussion. We’ve attempted to respond to the major questions and concerns using Jonathan's mail as a template. Apologies in advance if we missed anything outstanding or did not appropriately acknowledge contributions in this thread.

You explain that the need is to specify spatial coordinates with a simple geometry for a timeSeries variable. For example, this could be for the discharge as a function of time across some line in a river (your example), or I suppose it could be an average temperature as a function of time for the Atlantic Ocean, where you wanted to supply the polygon which drew the outline of the basin. Have I got the idea?

Yes, you have this mostly right. It’s common to have a collection of points (weather stations), lines (stream reaches), or polygons (hydrologic catchments) with an associated time series.

Timeseries like this can be stored in CF, but their geographical extent is usually described only in words e.g. a region name of atlantic_ocean, and this is fine for applications like CMIP where you want to compare data from different data sources in which the Atlantic Ocean may have different exact shapes (different AOGCMs, in particular). An array of region names is also possible, so I don't think we need a new convention to contain your dwarf planet example.

The dwarf planet example is intended to describe our generalized approach to continuous ragged arrays that may be used for arbitrarily-sized data arrays. For some (including me), using a string instead of a numeric example helps illustrate the concept. It is an idiosyncratic example in many ways. Sorry for the confusion.

Sect 9.1 on discrete sampling geometries says it cannot yet be used for cases "where geo-positioning cannot be described as a discrete point location. Problematic examples include time series that refer to a geographical region (e.g. the northern hemisphere) ...". Actually I think that's not quite right. The existing convention can describe regions which are contiguous, and rectangular or polygonal, using its usual bounds convention (Sect 7.1). I think we should consider changing this text, because it seems unnecessarily restrictive.

Your explanation makes sense, and this should be captured in the DSG convention text.

If the regions were irregular polygons in latitude and longitude, nv would be the number of vertices and the lat and lon bounds would trace the outline of the polygon e.g. nv=3, lat=0,90,0 and lon=0,0,90 describes the eighth of the sphere which is bounded by the meridians at 0E and 90E and the Equator. I think, therefore, we do not need an additional convention for points or polygonal regions.

Many earth science datasets (excluding triangular, hexagonal, etc. meshes) representable as polygons and lines have differing node counts. "nv" could not efficiently capture watershed A with 5 nodes and watershed B with 100. Additionally, the cell bounds concept does not include the structure and semantics needed to support MultiLines, MultiPolygons, or polygons with holes/interiors.

However, we would need new conventions for a timeseries where each value applies to a set of discontiguous regions or regions with holes in them, a set of points, a line or a set of lines. I guess that these are included in the geometry types you list (LineString, Multipoint, MultiLineString, and MultiPolygon).

Yes.

Do you have definite use-cases for all of these? (I ask this because we don't add new functionality to CF until there is a definite and common need for it in practice.)

David Arctur described the primary motivation for developing the simple geometries approach: "Among other applications, NetCDF-CF is now being used as an intermediate & output data format in the US National Weather Service’s National Water Model (NWM). This forecasts streamflow rates in about 2.7 million stream segments averaging 2km, throughout the continental US, at multiple time horizons (3 hr, 18 hr, 10 days) every hour, and an ensemble for 30-day forecast less frequently." These data also contain multi-geometries primarily in the form of MultiLineStrings and MultiPolgyons.

To this we would add that working with GIS datasets of this magnitude is difficult with current NetCDF metadata conventions, often yielding an unwieldy hybrid of NetCDF data and other softwares like ESRI ArcGIS and PostGIS. ESRI ArcGIS and PostGIS are not usable on many HPC platforms where models like the NWM reside.

I suspect that geometries of this kind can be described by the ugrid convention http://ugrid-conventions.github.io/ugrid-conventions, which is compliant with CF. Their purpose is to describe a set of connected points, edges or faces at which values are given, whereas in your case you'd give a single value for the whole set, but the description of the geometry itself might be similar. Have you had a look at whether ugrid could meet your needs? If it almost does so, perhaps a better thing to do would be to propose additions to ugrid. We would like to avoid having more than one way to describe such geometries.

Bert Jagers and Chris Barker have already commented on this. It is important to note that UGRID is the primary inspiration behind this proposed approach. That should have been mentioned in the original mail. The genesis of this work was with full knowledge of UGRID.

This proposed CF addition is meant to align more closely with the community standards behind GIS features types used by the OGC community. To accommodate the feature types described by this proposal UGRID would need to incorporate:

  1. Ragged arrays for coordinate index vectors.
  2. Encoding method for multi-geometries.
  3. Support for point geometries.

The simple features proposal does not expect node sharing amongst adjacent/contiguous elements and, in all fairness, this is not a requirement of UGRID but rather a recommendation. The simple features approach does inherit from UGRID as Bert indicated in that it is possible to implement node sharing via coordinate index indirection.

We agree with David Arctur that the simple features approach is easier to implement than UGRID. No offense intended to UGRID which is a powerful convention indeed.

It really is up to the community if they would rather see simple features represented in an amended CF-compatible UGRID or an addition to CF. We are of the opinion that a simple features specification would be very useful.

So far CF does not say anything about the use of netCDF-4 features (i.e. not the classic model). We have often discussed allowing them but the general argument is also made that there has to be a compelling case for providing a new way to do something which can already be done. (Steve Hankin often made this argument, but since he's mostly retired I'll make it now in his name :-) If there are two ways to do something, software has to support both of them. We already have ways to encode ragged arrays, so is there a compelling case for needing the netCDF-4 vlen array as well? We already have a way to encode strings too, as character arrays. I think this is probably a discussion we should have again in a different thread, so I'll just talk about your classic encoding. The same points apply to both encodings.

Yes, let’s leave that conversation for another time. We mostly want to be forward compatible understanding that vlen provides a more simple and some would say more elegant way of handling ragged array data.

Your approach uses a coordinate_index variable to identify indices of geometry coordinates where the -1 and -2 indices indicates where exterior and interior polygons begin, and the first polygon has an implied -1 at the start. Is that right? Given this example, I wonder why you need the index array, because none of the coordinates indices (values >=0) is repeated, so no space is saved in the x and y arrays. I guess this would be the usual case. If polygons did touch or lines crossed, a few points would be in common, but not so many that seems to need the complication of the index array. A simpler way to do it would be ... which needs only one dimension, or you could use the CF ragged array convention (Sect 9.3.3)...

Our example may not be complete enough to fully demonstrate the use case we are trying to describe. The example given, inspired by the DSG Continuous Ragged Array encoding, uses a 'stop' variable rather than a ‘count’ variable. It may not be apparent that each ‘simple feature’ may actually be multiple polygons (with or without hole polygons) or lines. Regarding the ‘outside_inside’ example you provided, we should show an example where the geometry count (dimension) is more than 1 and a geometry has multiple polygons prior to the 'stop' coordinate. The word encoding example was meant to convey this, but may not have been sufficient. Here is an example with three geometries: https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example.

We are hesitant to add an additional integer variable to indicate 'inside_outside' as it will introduce (in our minds) extraneous, duplicated value variables. Why repeat -1 5,000 times when introducing a -1 at multi-geometry breaks accomplishes the same task? One could also argue for additional variables containing multi-geometry breaks, but again, this is extraneous. As an example, using break values and ragged arrays similar to what we describe, the 2.7 million catchment dataset mentioned by Dave Arctur (which contains MultiPolygons) results in a ~10 GB uncompressed, netCDF-4 file. Adding 'inside_outside' variables to describe the breaks and/or holes will make this file larger. We could reduce the file size by removing repeated nodes via the coordinate indexing method.

You provide the attributes multipart_break_value and hole_break_value to specify the values (-1 and -2 above) for the outside vs inside distinction. Do you need the generality of being able to choose these values? It would seem simpler to use a character array and specify in the convention which letters should be used e.g. ... That makes it more readable, perhaps.

Those values could be fixed. We would recommend they always be appended to the variable as attributes, however. We also tend to think of them as fill values which are customizable in CF. In regards to the character array, it again seems like a lot of repetition.

Similarly, you propose attributes for clockwise/anticlockwise node order and for the polygon closure convention. Do these need to be freely choosable? You could specify clockwise, like the existing CF bounds convention, and that the polygons are closed. In the latter case, you could omit the last vertex of each polygon since it must be the same as the first, and that would save a bit of space. If you specify these choices, the attributes aren't needed.

These attributes would be considered optional. If controls are in place for ordering, it may be specified on the polygon variables. Many GIS software packages don't care about ordering and repeated nodes, but modelling and regridding codes tend to be more picky.

If this convention is going to be used for discrete sampling geometries, an additional dimension is needed, because in a single data variable you might have data for several of these geometries. That is, you need an array of ragged arrays. Again, I wonder whether this suggests trying to use ugrid. It might be you could name each one as a mesh, and specify the geometry of for the set of timeSeries as an array of mesh names. That would be a very easy change to the existing Sect 9.

The multiple geometry example may help: https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example.