Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ET-W2AT asking for new message format. #78

Closed
petersilva opened this issue Apr 5, 2022 · 19 comments
Closed

ET-W2AT asking for new message format. #78

petersilva opened this issue Apr 5, 2022 · 19 comments

Comments

@petersilva
Copy link
Contributor

petersilva commented Apr 5, 2022

Well, as per committee consensus, I asked for geographic and temporal extent advice from ET-AT, and they replied by asking us to use GeoJSON format outright with a heavy re-work of elements of the payload that changes semantics significantly.

https://wmo-teams.atlassian.net/wiki/spaces/WIS2/pages/322568193/2022-04-04+ET-W2AT+Meeting

The ET-AT suggestion reject and/or discards all work on message format by TT-protocols since it's inception, and makes conceptual and semantic changes whose effect are unknown, necessitating a return to design phase, and re-validation of all existing work. There are many issues raised:

All of which can be done, but requires investigation and validation, and will take some time to work through.
A minimum impact alternative #81 which just adds fields to the current payload is far more straightforward, and could be done immediately.

We should put a link in to Jeremy's slides, I'm just going from memory from what was presented for now:

changes:

good:

  • well it's already JSON, and there are geoJSON parsers out there
  • adds geographic and temporal extent. It does what we were looking for.

observations:

  • relPath is bad, but relPath - filename == topic... topic is good... and if the id is implemented as per the spec, then it becomes a relative identifer... so... identical the the combination of topic+id ... likely could accommodate what is essentially a name change... but given that it isn't the entire topic, it should likely be relative topic, RelTopic, because otherwise it will be extremely confusing for people to know the difference between the real topic, and the one in the message.

  • having a filename is bad, but having an id is good... it's the same thing. how is one bad and the other good? if the hierarchy is bad, then why is topic introduced?

  • Looking at RFC 7946, GEOJSON allows for "Foreign Members" (https://www.rfc-editor.org/rfc/rfc7946.html#page-15) , which can be placed at the top level. So the existing TT-Protocols format, once, "type" and "geometry" are added, is already GeoJSON. There is no need to change anything else for GEOJSON compliance. "id" and "version" in the proposal are examples of foreign members, and "foreign members" such as the "relPath", "integrity" etc... are entirely permitted by the standard.

  Python 3.10.4 (main, Apr  2 2022, 09:04:19) [GCC 11.2.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import geojson
  >>> my_point = geojson.Point((43.24, -1.532))
  >>> my_point['relPath'] = 'the/path/of/interest/tome.txt'
  >>> dump = geojson.dumps( my_point )
  >>> dump 
'{"type": "Point", "coordinates": [43.24, -1.532], "relPath": "the/path/of/interest/tome.txt"}'
  >>> 

not clear why the rest of the reformatting is necessary if software that does not understand the messages, will still not understand it regardless of it's placement in "properties", and software that parses geojson will still understand the existing payload in so far as it will already be geoJSON, to the same degree that STAC is.
I suppose someone will argue that such GEOJSON is "unusual" ... I wonder if, in this case, being unusual is an advantage, it that it is unlikely to be mistaken for ordinary GEOJSON.

  • One would have to explain that the format is GEOJSON, but one cannot supply any GEOJSON, it must be "profiled", restricion on the types of geometries permitted (5 tuples?) no feature collections... etc...
  • "links" in the proposal is not part of GEOJSON either. it is STAC, which was explicitly rejected by TT-Protocols.
  • we give up a hierarchical unique id (relPath) replacing it with one with no hierarchy (id) ? What benefit?

Worries/Dangers:

  • making it look like ordinary GEOJSON increases the risk of future uses not applying any restriction profiles and "abusing" (for example, by pushing data in geojson as messages.)
  • breaking file replication is a deal breaker for me (@petersilva ) it is the basis of all past work on the topic and robs us the most trivial implementation, and access to many tools for validating replication, and changes this application from a horizontal one to a vertical for a specific business.
    • loss of baseURL breaks replication.
    • loss of filename (unclear if id can stand-in) breaks replication.
      The robust use case of testing is all with the file replication case, all validation benefits of using a format that is heavily used (for announcements) is lost.
  • This is a new format with different semantics, different... but unclear about motives expect vague promises of "compatibility" when the existing encoding is already compatible, while explicitly breaking existing implementations for no discernable reason.
  • Use of full "geometry" is worrisome because it is an unbounded size, one can have an infinite number of tuples in a polygon specification. One can always double them by halving their length.
  • what is the size limit of a GEOJSON file? It is a full file format, there is nothing stopping anyone from embedding arbitrary objects in such files, as it is a normal usage for the format.
  • The GeoJSON isn't actually just GEOJSON, it's actually STAC, thinned out a bit. from memory, there were fields in there, such as "links", which are not in rfc7946. TT-Protocols discussed stack at the last meeting, and explicitly rejected it ( Use STAC for payload: no #75 )
  • one can have "Feature Collections" that contain an arbitrary number of "Features"... making the message much more complex again.
  • use of an existing format promising not to use all it's features, seems fragile.
@petersilva
Copy link
Contributor Author

#73 opened for the addition of version to the payload.

@petersilva
Copy link
Contributor Author

#80 opened for adding "topic" to payload.

@petersilva
Copy link
Contributor Author

#81 opened for minimal transformation to achieve ET-AT's ask.

@antje-s
Copy link
Contributor

antje-s commented Apr 8, 2022

An extension with additional geo information and type should not be a problem from my point of view. The message size does not increase significantly (in default case) and if we comply with a general standard we have a clear advantage. However, we should possibly set a maximum message size so that we do not have problems in the long run and a basis to reject message with large sizes.
An extension to include geo information may not be widely used at the moment, but offers potential for further development. A "wrapping" of our content values with "properties" to be more familiar to the geoJSON format is also not much change

{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [102.0, 0.5]
},
"properties": {
"prop0": "value0"
}
}

Also here I think important is a limit of the total size, because after geoJSON there seems no limit is set for further properties.

A download link or a service url is core content, but I think that probably in the WMO world also the repository systems and their repository structures will differ and are not necessarily oriented to the topic structure, therefore a total value in one field or other names of fields that has to be concatenated seems to be ok. The original idea that only the baseUrl has to be adjusted to own values seems to be not practicable in the diverse environment of WMO Nodes. Since a split into baseUrl and relPath is currently used in the pilot, the question would be what reasons speak against keeping the two fields to be merged for link?

However, an id or a unique filename value is very helpful for possible error searches and quality checks. We should be able to clearly identify content among each other. If someone asks if content x has arrived, it must be possible to quickly check if it is there. As the storage structures could differ, there must be some kind of id that makes this possible.

@petersilva
Copy link
Contributor Author

The unique filename (as represented by the relPath.. potentially renamed) is either:

  • the part of the URL that is constant among all servers (for those servers that implement WAF) or
  • the product id, giving a name to a specific record under a topic hierarchy.

The two are equivalent.

  • ... yes the storage can differ, but a hierarchical unique filename or (or product id, the name can change.) is already agreed to be a good thing, and in the conceptually simplest implementation, the baseUrl and relative path can be concatenated together as one possible implementation.

If only a complete Url is provided, the baseUrl cannot be extracted for it, so any use case that needs baseURL is broken.

@petersilva
Copy link
Contributor Author

#84 for more comprehensive means of addressing comments about Integrity.

@petersilva
Copy link
Contributor Author

petersilva commented Apr 10, 2022

In summary: The ET-AT suggestion reject and/or discards all work on message format by TT-protocols since it's inception, and makes conceptual and semantic changes whose effect are unknown, necessitating a return to design phase, and re-validation of all existing work. There are many issues raised:

All of which can be done, but requires investigation and validation, and will take some time to work through.
A minimum impact alternative #81 which just adds fields to the current payload is far more straightforward, and could be done immediately.

@petersilva petersilva changed the title ET-AT asking for GeoJSON as a message format. ET-AT asking for new message format. Apr 11, 2022
@josusky
Copy link
Contributor

josusky commented Apr 13, 2022

Yay, I almost thought that we will get bored.

@petersilva
Copy link
Contributor Author

@antje-s agree just moving the fields around is not a problem... it's actually minor and already implemented (the feed from hpfx has v03 (tt-protocols original flavour messages) and v04 (geojsonized ones) available now.. just pick v04 as the topicPrefix and geoJSON goodness at your fingertips. geoJSON isn't the issue, it's the wholesale replacement of fields by other fields with different semantics, the banning of file names (which basically forces WAF to use the message id as the file name) that they don't want paths on WAF to be representative... So I guess we should be generating random trees with id's in them, as the base case.

petersilva added a commit to MetPX/sarracenia that referenced this issue Apr 13, 2022
@josusky
Copy link
Contributor

josusky commented Apr 13, 2022

I do not have access to https://wmo-teams.atlassian.net/wiki/spaces/WIS2/pages/322568193/2022-04-04+ET-W2AT+Meeting
@petersilva This is a groundbreaking change and I do not understand the reasons. And I already see significant drawbacks. Who proposed this and why? Could we have a teleconference about it?

@petersilva
Copy link
Contributor Author

There was a meeting in Geneva (I wasn't there) and the group decided this was a good idea. the notes are from there... I think the content of this issue accurately reflects their request. In subsequent ET-AT meetings I tried to explain the implications, but people seem fixated on "WAF is ugly", so so everything else I say seems to be noise to them. They claim breakage.. it was N vs. 1... so ... whatever...

I'm implementing it for folks to see, and they can judge based on the result of their recommendations. I don't support these changes at all... (note: geojson is fine... it's the rest) but I'd rather implement than argue.... It's a test feed... no harm done.

People will gradually re-discover that we originally had an optimal, minimal solution... or we'll end up with some baroque rube Goldberg thing... where they keep adding things because it doesn't work because they don't understand the semantics that were implicitly taken care of by the original... whatever... out of my control. I think people think I'm difficult now, so would prefer if others spoke.

As you can see. I'm piling up the issues, and we will certainly start talking about it at the next teleconference. with this new development the committee has a lot more work to do.

@josusky
Copy link
Contributor

josusky commented Apr 13, 2022

@petersilva The notes that I cannot access (but I have sent a request so maybe I will get access) ;-)
Now, in parallel, I was chatting with Tom Kralidis. He pointed out that https://geojson.org/schema/Feature.json does allows "null" geometry. So

{
  "pubTime" : "20190120T045018.314854383",
  "baseUrl" : "https://localhost/data/20190120",
  "integrity": {"method": "sha512", "value": "A2KNxvks...S8qfSCw=="},
  "relPath" : "WIS/CA/CMC/UpperAir/04/UANT01_CWAO_200445___15103.txt",
  "size": 457,
  "content" : { "encoding": "utf-8", "value": "encoded bytes from the file" },
  "retPath" : "4Pubsub/92c557ef-d28e-4713-91af-2e2e7be6f8ab",
  "type"    : "Feature",
  "geometry": null,
  "properties": null
}

is a valid GeoJSON. This said, almost any JSON can be trivially converted into a valid GeoJSON, but that does not make things interoperable. I am going to do some analogies - It is almost like telling: "your software produces XML my software consumes XML we are interoperable". For some data, a CSV is much better than XML because CSV is much more limited and it hints that the data is tabular. Similarly, using our primitive "WMO mesh message-schema" makes things clearer than the use of a much more generic format.
We can refer to https://geojson.org/schema/Geometry.json as I did in the issue081 (I will probably do one more alternative there) but we should not market it as GeoJSON because that will just confuse everybody.

@petersilva
Copy link
Contributor Author

petersilva commented Apr 13, 2022

To be clear, I think using geojson as the format for mqp messages is fundamentally wrong, as geojson is a data format, and most people will reasonably interpret such information as meaning that one should post geojson data as mqp messages. One will then always have to have a second conversation about sizes, and feature limitations or "profiling" ... and at that point why bother calling it geojson at all?

@antje-s
Copy link
Contributor

antje-s commented Apr 14, 2022

@antje-s agree just moving the fields around is not a problem... it's actually minor and already implemented (the feed from hpfx has v03 (tt-protocols original flavour messages) and v04 (geojsonized ones) available now.. just pick v04 as the topicPrefix and geoJSON goodness at your fingertips. geoJSON isn't the issue, it's the wholesale replacement of fields by other fields with different semantics, the banning of file names (which basically forces WAF to use the message id as the file name) that they don't want paths on WAF to be representative... So I guess we should be generating random trees with id's in them, as the base case.

@petersilva switched the client to v04 as a test and it works

@petersilva
Copy link
Contributor Author

I was mistaken...btw... I believe the ET-AT proposal is against id being the file name as well, so some other random name must be generated. The Canadian feed implements random file names as well (different names from message-id's)

@josusky josusky changed the title ET-AT asking for new message format. ET-W2AT asking for new message format. Apr 19, 2022
@josusky
Copy link
Contributor

josusky commented Apr 19, 2022

I have commented on some of the referenced issues where I had the impression that my view might be helpful.
Here I would like to comment on the high-level topics:

  1. Do we want to make the data notifications in WIS2 to be valid GeoJSON documents?
    I do not think so.
    The format is very generic and has only 3 required properties and at least two of them can be "null" thus two GeoJSON documents can differ in all but one property. At the same time, in certain domains the GeoJSON is used for actual data and consequently specialists from those domains may get confused and assume that notification messages in WIS2 will contain specific things - either values or formats of the values, that they are used to from their specific GeoJSON sub-specifications (e.g., compare "time" and "display_time" in http://agora.ex.nii.ac.jp/digital-typhoon/geojson/wnp/201601.en.json with our format).

  2. Does it make sense to use "GeoJSON Geometry" in the data notifications?
    Yes, but.

  • If there is a reason (possibility) to assign a geometry to a notification, then it definitely should be something well known, that is https://geojson.org/schema/Geometry.json.
  • On the other hand, we shall make it clear that it is optional. There might be some rare data that does not have a clear association with a geographical location. But more importantly, in the transition period, we will republish a lot of data from legacy systems that do not provide geographical information explicitly - and its deduction from actual data could be time-consuming, unreliable, or both.
  • It seems to me that for some applications a bounding box (https://datatracker.ietf.org/doc/html/rfc7946#page-12) might be more natural - but I admit that it can be trivially transformed into a polygon so that is not so relevant.
  • Finally, wherever possible, the filtering should be done server-side, that is, if the data source/service provides data for several areas/locations, it is preferable that it publishes notifications about them under separate (sub-)topics, so that data consumers can subscribe to just the relevant (sub-)topic, as opposed to having just one topic for all and then filtering the notifications after reception at the client.

@josusky
Copy link
Contributor

josusky commented Apr 27, 2022

I have added some examples to https://github.com/wmo-im/GTStoWIS2/tree/issue078/message_format

  • message-schema.json and message-schema.yaml is another iteration of GeoJSON based schema.
  • message-schema-full.json is the same thing but bundled with the GeoJSON stuff, so you can use it standalone.
  • message-example.json is example of a message/notification that corresponds to situation when relPath != retPath.
  • message-example2.json is example of a message/notification that corresponds to situation when relPath == retPath, thus it is closed to the typical WAF usecase.

I had troubles with expressing the nullability of the geometry in a way that would be accepted by https://www.jsonschemavalidator.net. I will check this later on. Moreover, I had issues with "date-time". That is a known problem that this type in JSON means RFC 3339 format that is a somewhat lame subset of ISO 8601 (not only it is longer, but it does not allow higher precision than milliseconds). In my opinion ISO standards are more, er, standard then RFC, but for now I have used the RFC format.

@petersilva
Copy link
Contributor Author

petersilva commented Apr 27, 2022

in example 2, you use a templated url value, hierarchy, from properties:

  • I was under the impression that one is supposed to use "variables" as part of each link construct, and that variables in the properties part of the data structure are sort of out of scope. but I could easily be mistaken. Is there a standard way to tell a templated uri to use an "external" dictionary to look for things?
  • in templating, '/' character gets url-encoded when using the python-uritemplate. I don't know if that is an implementation foible of the library, or part of the standard, but from what I can tell having a bare / in a variable value will not be substituted correctly using standard uri templating libraries.

@petersilva
Copy link
Contributor Author

petersilva commented May 10, 2022

subsequent discussions with ET-AT, reflected here:
https://wmoomm.sharepoint.com/:b:/s/wmocpdb/Ech78Pb1k_9Jvde0uB-Ir0MB4giLGIaMBjelE0sETNhYRw?e=0ehXK5
brief summary of updates:

  • data_id is now a combination of a hierarchical category and and instance id (very similar to relPath) derived from data somehow.
  • the topic hierarchy is revised to begin with: <channel>/v04/wis2/
    • message version or encoding is back in the tropic hierarchy
    • the "wis2" provides a hierarchy versioning/typing mechanism (and implicitly permits foreign data by placing them outside the wis2 hierarchy.)
    • a proposal for determining sub-topic hierarchy below that level is proposed.

As this changes the proposal substantially. likely further discussion with multiple different proposals at different points in the issue will be hard to follow. We will close this issue. in favour of #90 to discuss only the revised proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants