Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment of topics for multidisciplinary datasets #38

Closed
amilan17 opened this issue Mar 14, 2023 · 32 comments · Fixed by #39
Closed

Assignment of topics for multidisciplinary datasets #38

amilan17 opened this issue Mar 14, 2023 · 32 comments · Fixed by #39
Assignees

Comments

@amilan17
Copy link
Member

amilan17 commented Mar 14, 2023

Posting a question from @masato-f29 from the NWPMetadata team. 

"Sea surface temperature can be included in weather, climate, and oceans. Should sea surface temperature data be tagged with three controlled vocabularies (CV): weather, climate, and ocean? Or should we make it exclusive and propose the additional CV at level 8?

DECISION

The topic_hierarchy will only be used to identify a channel for pub/sub. When a dataset is applicable under multiple domains, one should choose one domain (that is a best fit) and use the WCMP2 metadata record for further descriptions.

@tomkralidis
Copy link
Collaborator

TT-WISMD 2023-04-12:

  • add a multidisciplinary token/value?
    • can be put forth in WCMP2 as themes/concepts
  • should have a single topic for subscription
  • needs more discussion

@sebvi
Copy link

sebvi commented Apr 17, 2023

can a blob of data be registered under several hierarchy? i.e. multiple times

@yhe-wmo
Copy link

yhe-wmo commented Apr 19, 2023

The TT-NWPMD meeting (17.04.2023) suggested that it would be more helpful to allow the datasets be made available, subscribed and notified under several (multiple) hierarchy topics.

  • add a multidisciplinary token/value?

TT-NWPMD would also like to seek further clarification how the idea of multidisciplinary token/value will work.
I had a quick chat with @amilan17, if I understood correctly, the idea was that to add to Level 8 (now the 7 Earth system domains) a new controlled vocabulary for "multidisciplinary".

@tomkralidis
Copy link
Collaborator

Having a multidisciplinary definition may lead to providers dumping any data in question to this topic?

The lesser evil here could be publishing multiple messages with the same properties.data_id.

So if a data granule is published that applies to 3 topics, then:

  • 3 messages sent
  • each message has the identical properties.data_id

In this manner, we are able to ensure deduplication. This would, however, require WCMP2 properties.wmo:topicHierarchy to support multiple topics (which would need an update, which is fine).

@kaiwirt
Copy link
Contributor

kaiwirt commented May 3, 2023

The current design is, that the data_id is uniquely identifying the data granule. If a cache is receiving three messages in three topics with the same data_id, then the behaviour is, that the cache is downloading the data once, republishes the corresponding message and drops the other two messages as duplicates.

@golfvert
Copy link
Collaborator

golfvert commented May 3, 2023

If we go for this option, multiple messages in the currently existing domains, with the same data_id, then the Global cache should republish the message in each topic hierarchy but only download once. Is that doable ?

How to have this behaviour will keeping the current anti duplication of downloads ?

We also have to remember the purpose of the topic hierarchy. It is not a way to describe the data (that is the job of the metadata) but to allow filtering by subscribers. It is not meant either to show that this data is useful for X and not Y. Again, this is the purpose of the metadata record.
It (sort of) reminds of the other discussion (wmo-im/wis2-topic-hierarchy#30), it looks to me that we are overcharging the meaning and purpose of the topic hierarchy.

I therefore wonder if both "requirements" are consistent with the currently agreed purpose of the topic hierarchy.
We don't want the topic hierarchy to become the new TTAAii of WIS2 ;)

@kaiwirt
Copy link
Contributor

kaiwirt commented May 5, 2023

We can implement this change if this is agreed on. However i am not sure this is a good solution. Having different messages in different topics for the same data only increases the number of messages with no additional benefit in my opinion.

If data "fits" into several topics, then i would prefer having a decision on which is the correct topic for that data instead of just sending out multiple messages.

@golfvert
Copy link
Collaborator

golfvert commented May 5, 2023

I suggest to wait before implementing anything! This is still under discussion. As explained in my comment above, I have the feeling (I might be wrong though) that we are taking the topic hierarchy discussion on the wrong path. It may eventually look like a second-class metadata record. We should focus on the official metadata for this kind of information.

@amilan17
Copy link
Member Author

amilan17 commented May 5, 2023

We also have to remember the purpose of the topic hierarchy. It is not a way to describe the data (that is the job of the metadata) but to allow filtering by subscribers. It is not meant either to show that this data is useful for X and not Y. Again, this is the purpose of the metadata record.
It (sort of) reminds of the other discussion (#30), it looks to me that we are overcharging the meaning and purpose of the topic hierarchy.

@golfvert the readme of this repository states that:
"The WIS2 topic hierarchy provides a central classification and categorization scheme used by data providers and WIS2 Global Services in support of core WIS2 workflows: publish, discover, subscribe and download."

@golfvert
Copy link
Collaborator

golfvert commented May 5, 2023

I don't think that what I wrote contradicts that statement. The data is classified and categorized. We haven't a unique central topic with everything. Then, the question is how far this topic hierarchy should be sufficient to identify the data.
I have the feeling that we might push this too far...

@amilan17
Copy link
Member Author

amilan17 commented May 10, 2023

Proposed decision for the @wmo-im/tt-wismd:

  • Only one Topic Hierarchy (w/ one discipline) is allowed in the metadata record.
  • One should not publish the same message under multiple disciplines.
    Noting, that the primary purpose of the TH is to identify the channel for the publication of notifications and to subscribe to notifications from that channel. For broader discovery of other relevant disciplines these should be listed as themes in the metadata record.
  • Also noting that the TT-WISMD, decided to remove the topic hierarchy from the data_id in the notification message.

@amilan17
Copy link
Member Author

@sebvi wants a real-life use case to understand

@amilan17
Copy link
Member Author

related to: wmo-im/wcmp2#94

@steingod
Copy link

Following the discussion above I struggle to see what I could use the topicHierarchy for. I thought it was for filtering, but if multiple hierarchies are not permitted it won't serve the purpose for many datasets. Furthermore I don't understand the comment above:

We also have to remember the purpose of the topic hierarchy. It is not a way to describe the data (that is the job of the metadata) but to allow filtering by subscribers. It is not meant either to show that this data is useful for X and not Y. Again, this is the purpose of the metadata record.

I thought is was part of the metadata (https://wmo-im.github.io/wcmp2/standard/wcmp2-DRAFT.html#_topic_hierarchy) and the elements in this is what we use for filtering relevant information. For cryosphere many of the datasets could also be published using weather or climate etc and then you will have to resort to Properties/Themes to really sort what is relevant and just ignore TopicHierarchy. So given the ambiguity for many datasets I am struggling to see the use case for TopicHierarchy.

@josusky
Copy link

josusky commented May 17, 2023

Hi @steingod,
this might be a terminology issue. The word "topic" has a specific meaning in publish-subscribe protocols. It is an identifier needed to create a subscription. Having multiple topics for one dataset would be confusing - should I subscribe to all of them, or the first one, the last one?

@amilan17
Copy link
Member Author

I thought is was part of the metadata (https://wmo-im.github.io/wcmp2/standard/wcmp2-DRAFT.html#_topic_hierarchy) and the elements in this is what we use for filtering relevant information.

@steingod the TT-WISMD decided recently scale down the multiple uses of topic_hierarchy and this we will remove it as a property in WCMP2 and as a requirement for the data_id in the notification message. See: wmo-im/wcmp2#95

So now, the topic_hierarchy will only be used to identify a channel for pub/sub.

@steingod
Copy link

Thanks for the update, makes sense to me. Concerning pub/sub I do understand it has a specific meaning, but for this to be useful at the practical level, the implementation requires that it is possible to connect datasets to only one channel, else you would anyway have to subscribe to everything and filter afterwards. Removing it as a requirement from WCMP2 makes sense, but how is the relation of datasets and channels addressed to make it consistent across the community(ies)?

@yhe-wmo
Copy link

yhe-wmo commented Jun 14, 2023

TT-NWPMD meeting (2023.06.13) noted the decision on scaling down the multiple uses of topic hierarchy. TT-NWPMD asks for further clarification on how to solve the original issue. For a dataset of multidisciplinary in nature, which topic should it be associated with? Clear and well-documented guidance would be needed to ensure consistency.

@amilan17
Copy link
Member Author

@wmo-im/tt-nwpmd This will probably be the guidance:

If a dataset is multidisciplinary in nature, then choose the best fit for the TH. Think of the TH as a key or identifier for notifications on the cache with some basic meaning, but not a full description. More descriptions about other relevant disciplines will go into the WCMP2 metadata record for that dataset. Currently, the TT-WISMD is considering the best approach for this. Please see this comment in issue # 101 wmo-im/wcmp2#101 (comment).

@6a6d74
Copy link
Collaborator

6a6d74 commented Jun 20, 2023

Adding my thoughts ...

We need to treat each domain separately, so "similar data" from, say, 2 earth system domains would need to be published in places on the topic hierarchy. We shouldn't try to conflate. This solution might not be super elegant, but at least it's predictable for data publishers and data consumers.

@antje-s
Copy link

antje-s commented Jun 21, 2023

note: currently notifications for the same data (same data-id) would be considered as duplicate by the Global Cache, even if they were published in different topics. Of course code is patient and it could be extended, but we will increase complexity...
the code would have to implement that the download is executed only once (in order not to let the data volume grow as well) and execute a re-publish of the further notifications, whereby the download link would have to be adjusted with the value of the first data download (this value have to be saved).
Also the disadvantage remains that we would increase the overall message volume significantly, e.g. most observation data are relevant for several domains and would trigger many notifications.
And an automatic forwarding at receiver's end would be difficult, because the first received notification will execute the data download and the next ones not (recognized as "already downloaded" via data-id check) so that the data will be missed in the other client targets. Also the client-code would have to implement more differentiation.

At least my first feeling would be that multiple publish for the same data (with automatic download of the linked data) is not a good idea. But maybe I am just overlooking a simple solution...

@efucile
Copy link
Member

efucile commented Jun 21, 2023

I am afraid that this can start very complex discussions. Example: precipitation is hydrology and weather. Does this mean that we publish precipitation observations on two topics? We should not build too much around topic and make use of the discovery metadata to inform different communities. I think that this needs to be addressed at the WCMP2 level, not in the topic.
Publishing on different topics the same data can increase the complexity in an unsustainable way.

@golfvert
Copy link
Collaborator

golfvert commented Jun 21, 2023

I VERY strongly supports Enrico's comment... topic hierarchy and messages is about knowing that new dataset is available while providing some filtering capabilities. It is not to describe the data nor to limit its usage.

@kaiwirt
Copy link
Contributor

kaiwirt commented Jun 21, 2023

I also strongly oppose that we publish messages for the same data in multiple topics. This only adds complexity without providing much advantage.

@amilan17
Copy link
Member Author

DECISION

The topic_hierarchy will only be used to identify a channel for pub/sub. When a dataset is applicable under multiple domains, one should choose one domain (that is a best fit) and use the WCMP2 metadata record for further descriptions.

I updated the decision in the first issue comment to reflect the general consensus of this group.

@tomkralidis
Copy link
Collaborator

TT-WISMD 2023-06-22:

  • TT agrees / endorses decision

@yhonda21
Copy link

One data should be published with one discipline. If the data is relevant to other discplines, these information should be described in the metadata of the data. (see TT-NWPMD meeting on 05.07.2023)

@antje-s
Copy link

antje-s commented Aug 29, 2023

Should we close the issue as decided?

@tomkralidis
Copy link
Collaborator

We need the decision reflected in documentation in the resulting specification (once wmo-im/wis2-topic-hierarchy#47 is reviewed/merged).

@tomkralidis
Copy link
Collaborator

TT-WISMD 2023-09-12:

  • add to specification
  • if a dataset can be made available under > 1 topic, the centre SHALL choose one topic for publication purposes

@tomkralidis tomkralidis self-assigned this Sep 25, 2023
@tomkralidis
Copy link
Collaborator

TT-WISMD 2023-09-25

  • WCMP2 (distribution/MQTT) and WTH (?): add text to see the WIS2 Guide for further guidance on "choosing a topic for your dataset"
  • update WIS2 Guide with work clarification

@tomkralidis
Copy link
Collaborator

PR in #39.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Decision added
Development

Successfully merging a pull request may close this issue.