Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic Hierarchy Structure: The extreme complexity of the Topic hierarchy could potentially lead to a limited adoption of the service or very large performance issues #50

Open
gaubert opened this issue Sep 8, 2023 · 4 comments

Comments

@gaubert
Copy link

gaubert commented Sep 8, 2023

I hope that I am misunderstood something and that it should be resolved easily by updating my understanding of the WIS architecture but I have a couple of point to raise on the topic hierarchy.

I have been looking at the WIS2 topic hierarchy structure which is meant to be built for helping users finding datasets and filtering the data topics per subject. Thinking of it and how it could be implemented, it looks to me that its complexity will be a very large barrier to entry or it could lead to having users completely ignoring it.
Another point is that the topic hierarchy could lead to the implementation of a very complex system for the main broker reflecting the entire hierachy and in addition maintaining good performances could be extremelly challenging.
Below are the points that I have been trying to develop:

Large Discovery/Domain information in the topic hierarchy will be counter-productive in helping user understanding what data is available and how to find relevant data for users

A quick calculation taking the 8 first levels and assuming that we have around 195 countries and 20 centres per country in average (which is probably below the real number).
I end-up to 2x1x1x195x20x4x2x8 = 499200 branches for the 8 first levels and for the total tree taking 3 level of 5 sub discipline each: 2x1x1x195x20x4x2x8x5x5x5 = 62.400.000 topics. The assumption taken might be too large but reducing the problem by a factor 100 will lead to the same conclusion.

From the discovery/usability point of view, this is a large obstacle for users if the intention is to have them understanding the topic hierarchy and use it to find the data they are interested.
Users will most probably not find their way and might simply use + or # wildcards at many levels to receive some data.
They could then be overwhelmed by the number of messages received and the main brokers could be overloaded by such queries and the number of clients subscribing to many topics.

This is why I am questioning, the purpose of providing so much semantic and discovery information in the topic hierarchy and making it so deep.

Additionally, if the intention is to help users understanding what data is available why do we have 8 levels of technical (version, WIS2) and political information before the domain information ?

At least the topic hierarchy should be reversed but in my opinion, mostly simplified.

If the answer to the interrogations above is that the catalogue will provide the discovery services to find the data then there is no need to create such a complex topic hierarchy structure that will make the implementation very complex and challenging for the users.

Potential performance issues and challenges for implementation

Another point is performance of a system that will have to replicate and manage for distribution 62 Millions topics with some topics having a very high distribution frequency. This means that it is certainly leading to the implementation of a large scale system and tests of that scale should be performed to assess that the products on the market (HIVEMQ, RabbitMQ, Mosquitto, Amazon MQTT service) can cope easily with such scale.
It should also be noted that this complex hierarchy forces users to use wildcards (+, #) which will make the system to be created, even more demanding in term of resources (need of tables in memory, on disc, databases to resolve the wild cards and maintain the multi subscriptions or thousands of users).

Proposal for a way forward

I would propose to re-think the topic hierarchy and go back to the initial requirements:

  • Remy said what the intention was to use it to help user not subscribing to too many topics and being overwhelmed by the number of messages received.

How the topic hierarchy should be organise to focus on such requirement ?

Here are some leads that could help solving the issue and not leading to a difficult full scale implementation:

  • The discovery services of the catalogue shall be used to provide the different topics to which user will want to subscribe. Then we do not provide semantic in the topic hierarchy (It is not a discovery service). Use arbitrary names to avoid any mis-interpretation and minimize their numbers.
  • Limit the number of levels in the topic hierarchy.
  • The originator of the notification is in the messages so the political structure might not be needed in the hierarchy, the domain structure also might not be needed to minimize the complexity as it will be available from the discovery catalogue.
  • It might be that only a limited set of data/messages need such a deep topic hierarchy and it should be only built for that limited purpose.
  • A practical organisation might be to limit the number of levels and let centres define a simple technical/logical structure while alimiting the number of topics.
  • Rules on how many topics at each different levels should be created and enforced. Exceptions should be reviewed and approved by a WMO body.
  • What about multi-purpose datasets ? How are centres going to classify this type of data and respond to the users' queries. Indeed currently one choice of topic domain category will be made for a dataset and a user using this data for another purpose will have difficulty to find it. Then again what is the purpose of providing a wrong semantic structure for that user. On the other hands datasets can be qualified in multiple domain categories in a discovery catalogue.

Another proposal would be to implement a large scale prototype simulating the load and number of topics to be created and reflected on the main brokers.

What do you think ? Comments ?

@kaiwirt
Copy link

kaiwirt commented Sep 8, 2023

From my point of view, the topic hierarchy is not meant to help users find their data set. For this purpose there will be the WIS2 metadata. If i am not mistaken we had the discussion on the meaning of topics here wmo-im/wis2-guide#38

So if you search for data, use the Global Discovery Catalogue and the metadata will point you to the correct topic to subscribe to.

Additionally it is perfectly fine to make much use of "+" and "#". The topic is for filtering. If you don't want to filter on a certain level of the topic hierarchy then use a wildcard. As Global Cache we subscribe to anything below origin/a/wis2/data/core/# which is what a cache is supposed to do.

What i can not judge at the moment is the performance implications. MQTT browsers are made to handle lots of small messages distributed between lots of clients. IoT stuff. So millions of messages per timeframe should not be an issue. I can not yet say something about the impact of topics and filtering on the performance. In my opinion this is something tests and reallife need to show. To me it is clear, that there will be lots of tuning involved in the early days of WIS2.

The decision on multi-purpose data sets in short is: Pick one topic that fits.

One last remark: I think the sheer number of possible topics is not something that matters. The broker just needs to know which client wants which messages. And using wildcards in that perspective is preferred over creating thousands of individual subscriptions.

@golfvert
Copy link
Contributor

golfvert commented Sep 11, 2023

I fully agree with Kai's comment above.
Some more comments.

The WIS2 Pilot Phase is exactly meant for those kinds of tests. I have also the feeling there might be some misunderstanding on how MQTT protocol works. We will not "create" the 62M topics. I don't think it will be even 1/100th of that number of topics. Topics are "created" when someone publishes/subscribes to it.

Then, MQTT brokers are built, by design, to handle large number of subscribers and publishers. It is also interesting to notice that in one sizing tool of MQTT clusters (eg. https://www.emqx.com/en/server-estimate), what is important is the number of connected clients. And the number of messages, which is independent of the number of topics.

MQTT Topic Hierarchy is also not meant to be used as a poor man discovery metadata.

I also saw in other ETs/TTs some temptation/views, in creating many sublevels with many more topics.

At the moment, we have 8 "global" levels. I don't think it is, by design, too much.

We have to agree on the "right" balance between the level of filtering provided by the topics and a very coarse grain (only one topic) or very fine grain (A lot of levels of topics). "right" will obviously have a different meaning for all the experts :)

Having discussed this kind of things (and our agreed topic hierarchy) with the main developer of one of the large MQTT broker, this was not considered as an issue

The Sparplug standard (https://sparkplug.eclipse.org/specification/version/3.0/documents/sparkplug-specification-3.0.0.pdf) used for a typical IoT world defines one topic by IoT device. So they are really talking about millions here.

@gaubert gaubert changed the title Topic Hierarchy Structure: The extreme complexity of the Topic hierarchy could potentially lead to the a limited adoption of the service or very large performance issues Topic Hierarchy Structure: The extreme complexity of the Topic hierarchy could potentially lead to a limited adoption of the service or very large performance issues Sep 13, 2023
@gaubert
Copy link
Author

gaubert commented Sep 17, 2023

@golfvert @kaiwirt Thanks for the comments.

There is no misunderstanding on how MQTT works and it is clear that the topics are not "pre"-created but still if you look at the extract from AWS IOT Core, EMQX or HiveMQ, they are trying to really limit the number of topics created and limit the size of a topic hierarchy. I have attached my presentation with links and extracts from the different documentations.
For instance HiveMQ as part of the MQTT 5.1 support is introducing aliases (index numbers) for the topics in messages as otherwise they get too heavy when you have a large scale infrastructure. These are all signs that a large number of topics is going to create potential difficulties on the built infrastructure when used heavily.
In the end it is pattern matching to route the message yes but large scale pattern matching takes a toll on the infrastructure and needs to be managed.

To make sure that the current topic setup is fine, it would be very valuable to perform a representative tests with a representative load for instance using the NWP discipline topic hierarchy duplicating it by 4 or 5 (discipline) and having a representative number of publishers (1000 or more) and consumers (10000 or more) using it to see how the brokers, publishers and consumers react and how each should be scaled/adapted.

@golfvert you indicates that there is 8 levels at the moment which is perfectly fine but much more are going to be created below by the different disciplines (weather, hydrology, ....) . Right now, a lot of semantic is added in the topic hierarchy by each disciplines not necessarily for filtering purpose. There seem to be a misunderstanding on the topic hierarchy purpose and this should be communicated differently and the topic hierarchy should not extended as much as it is been done now.

Another consequence is that it is creating a complex hierarchy: This is the second potential issue of the topic hierarchy definition: usability and making it easy to understand and use for users. If it is too complex (and this is my feeling with 8 top levels and then the disciplines) then users will ignore it and use wildcards to subscribe to almost everything. Then all this work done to define the full topic hierarchy that will create additionally some maintenance and infrastructure issues will have been done for little value and will be difficult to change.

Hope that this issue will help converging toward a final topic hierarchy with one purpose: offering some filtering to avoid having consumers receiving too many messages while having a scalable manageable infrastructure.

20230911-Topic-Hierarchy-Potential-Issues.pptx

@golfvert
Copy link
Contributor

Thanks for this message.
Firstly, I would disregard what amazon is claiming as good practices. They are hiding like that the limitation of their implementation of MQTT. I suspect they are using some proprietary tool under the hood that has those limitations. When you are large enough, the temptation is strong to mask your weaknesses like that :)
I agree that we have to define the "right" number of levels of topics. I like that sentence from emqx: Try not to use more topic levels “just because I can”.`Last week during the meeting there was the discussion on merging country and centre_id. That would remove one level.
I fully agree with:
"Hope that this issue will help converging toward a final topic hierarchy with one purpose: offering some filtering to avoid having consumers receiving too many messages while having a scalable manageable infrastructure."
It means that when we create a new level below 8 (maybe soon 7 - see above) with its various topics, then we have to consider if users will want to filter the messages or not between those levels and topics.
Levels and topics are not meant to replace the metadata.
They are not meant to fully describe the data.
Creating levels and topics to end up with wildcard subscriptions would be a bad outcome.

When we delegated the creation of the topic hierarchy to the various disciplines (typically NWP) we may have lacked some guidance...

I also agree with the idea of a stress test.
We can plan and organize this in 2024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: For performance testing hackathon (2024Q2-Q3)
Development

No branches or pull requests

3 participants