Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordered processing of entity registration messages before any other messages #2466

Conversation

albinsuresh
Copy link
Contributor

@albinsuresh albinsuresh commented Nov 17, 2023

Proposed changes

Spec

  • Problems statements
  • Solution proposals
  • Finalized solution

Impl

  • Caching of early telemetry messages
  • Caching of early twin data messages
  • Handling early child entity registration messages
  • Persist entity store: Deferred for later

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue


Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link
Contributor

github-actions bot commented Nov 17, 2023

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
362 0 3 362 100 49m46.662s

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These thoughts make me wonder if using MQTT is the right tool to register which entities are attached to each topic. And even, if this can be dynamic.

I would consider a declarative approach. To opt-out the default topic scheme, a user will have to provide a description of they custom topic scheme. It can be a set of regex rules or a user program returning a registration message for a entity id.

By a set of regex rules, I mean something along these lines:

  • ground/rasp/// -> main device
  • floor[0-9]+/rasp// -> child device of main device
  • (?<parent> floor[0-9]+)/plc[0-9]+// -> child device of ${parent}/rasp//

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
@reubenmiller
Copy link
Contributor

These thoughts make be wonder if using MQTT is the right tool to register which entities are attached to each topic. And even, if this can be dynamic.

I would consider a declarative approach. To opt-out the default topic scheme, a user will have to provide a description of they custom topic scheme. It can be a set of regex rules or a user program returning a registration message for a entity id.

By a set of regex rules, I mean something along these lines:

  • ground/rasp/// -> main device
  • floor[0-9]+/rasp// -> child device of main device
  • (?<parent> floor[0-9]+)/plc[0-9]+// -> child device of ${parent}/rasp//

Yes, very valid points. Some kind of blocking call (e.g. to a HTTP endpoint) might be more appropriate to register devices (or patterns, nice idea).

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
@albinsuresh
Copy link
Contributor Author

albinsuresh commented Nov 21, 2023

These thoughts make me wonder if using MQTT is the right tool to register which entities are attached to each topic. And even, if this can be dynamic.

I can't agree more and most "capable devices" probably would prefer such an HTTP API over the clunky MQTT one. But, my understanding was that there would be many limited devices that we'll have to support, that can only handle MQTT and we'll have to provide an MQTT-only solution for them anyway. So, ideally we should have both APIs, but we might be forced to start with MQTT first, as that works for everyone.

Irrespective of the protocol, another key thing that we need to define is what a "successful registration" means? Does it mean that any one mapper has processed that registration message or all the connected mappers have processed it? How can we communicate either of those in a cloud-agnostic manner? I've expressed my concerns on this matter in this comment in detail.

I would consider a declarative approach. To opt-out the default topic scheme, a user will have to provide a description of they custom topic scheme. It can be a set of regex rules or a user program returning a registration message for a entity id.

By a set of regex rules, I mean something along these lines:

  • ground/rasp/// -> main device
  • floor[0-9]+/rasp// -> child device of main device
  • (?<parent> floor[0-9]+)/plc[0-9]+// -> child device of ${parent}/rasp//

Once we have the registration message feedback mechanism, on which the device-logic will have to wait, do we still need this template mechanism? Such a mechanism would still be useful to the customer, to avoid the explicit registration messages altogether. But this is not necessary in the context of this problem, right?

One other key point to note is that these issues are not just limited to custom topic schemes and the id-generation associated with it. That's definitely one key aspect, but not the only one. Even the default topic scheme suffers from these limitations when explicit registration is mandatory, like in the case of nested child devices. Getting the id wrong is not the only problem in this case, but getting the @parent wrong as well. I've added another dedicated sequence diagram for nested-child devices case, explaining the error path.

@didier-wenzek
Copy link
Contributor

Once we have the registration message feedback mechanism, on which the device-logic will have to wait, do we still need this template mechanism? Such a mechanism would still be useful to the customer, to avoid the explicit registration messages altogether. But this is not necessary in the context of this problem, right?

We need to choose, sure.

  • either a protocol over MQTT or HTTP to dynamically register entities
  • or a static topic scheme used to deduce registration messages from entity identifiers.

I do think the latter would simplify things a lot (the topic<->entity relation being computed)
while being flexible enough (a user don't have to list in advance all and every entities - but to detail the logic behind his topic scheme).

One other key point to note is that these issues are not just limited to custom topic schemes and the id-generation associated with it. That's definitely one key aspect, but not the only one. Even the default topic scheme suffers from these limitations when explicit registration is mandatory, like in the case of nested child devices. Getting the id wrong is not the only problem in this case, but getting the @parent wrong as well. I've added another dedicated sequence diagram for nested-child devices case, explaining the error path.

Why registration should be mandatory and necessarily done dynamically using registration messages?

A rule such as this one can be used to derive the external id as well as the parent of a grand child device,
assuming a scheme where a bunch of PLC are attached to a Rasp on each floor:

(?<floor> floor[0-9]+)/(?<plc> plc[0-9]+)// -> { "@id" -> "${floor}:${plc]", "@parent": "${floor}/rasp//" }

@albinsuresh
Copy link
Contributor Author

A rule such as this one can be used to derive the external id as well as the parent of a grand child device,
assuming a scheme where a bunch of PLC are attached to a Rasp on each floor:

For the external ID, I completely agree that some kind of correlation would definitely exist between the entity topic ID and its external ID, making that mapping definition fairly straight-forward. But, deriving the parent, I'm not fully sure. Looking at a few examples, like devices deployed in factory floors in a hierarchical way, it seems plausible, but I'm just not sure if we can generalise that assumption.

tests/RobotFramework/tasks/debug.robot Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay for "Phase 1 - live update" and have a suggestion for an alternative solution to "Phase 1 - on mapper restart".

I see the "Phase 2" proposal as not mature enough. Is this even required?

I also have a question on where to put this content. These are more internal design issues and discussions rather than specifications as implied by the name of the updated file.

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
@albinsuresh albinsuresh marked this pull request as ready for review November 28, 2023 11:29
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current text clarifies well the problem statement/alternative solutions/pros & cons;
but it's not so clear what's the retained solution among all the alternative ideas and combinations.

docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
docs/src/references/mappers/c8y-mapper.md Outdated Show resolved Hide resolved
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with the proposed design.

This document can be improved with an executive summary of the retained solution.
3 bullet points would be enough:

  • The c8y mapper persists its entity store on disk.
  • Messages received from entities not registered yet are cached till the registration is received (with an eviction policy to deal with rogue clients).
  • Auto-registration has to be disabled if the user opt-in for a custom MQTT scheme.

design/decisions/0003-c8y-mapper.md Outdated Show resolved Hide resolved
Comment on lines +116 to +123
To minimize the message races on a restart, especially between all the entities that were previously registered
before the mapper went down and any data messages for that entity that may have come while the mapper was down,
a persistent copy of the entity store is maintained on the disk, and kept up-to-date, while the mapper is live.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to consider to use the new agent.state.path directory for that purpose as the plan is to have this directory persisted across device restart during firmware updates.

See #2488

@albinsuresh albinsuresh force-pushed the feat/2428/registration-message-processing-priority branch from 2ddf72b to 13564ef Compare December 6, 2023 08:26
Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Merging #2466 (2cac9f0) into main (1ef77c9) will decrease coverage by 0.2%.
Report is 33 commits behind head on main.
The diff coverage is 92.9%.

Additional details and impacted files
Files Coverage Δ
crates/core/tedge_api/src/lib.rs 100.0% <ø> (ø)
crates/core/tedge_api/src/mqtt_topics.rs 86.7% <0.0%> (+1.0%) ⬆️
crates/core/tedge_api/src/ring_buffer.rs 97.5% <97.5%> (ø)
crates/core/tedge_api/src/entity_store.rs 94.1% <85.9%> (-0.6%) ⬇️
crates/core/tedge_api/src/pending_entity_store.rs 96.3% <96.3%> (ø)
crates/extensions/c8y_mapper_ext/src/converter.rs 82.4% <90.0%> (+1.0%) ⬆️

... and 30 files with indirect coverage changes

crates/core/tedge_api/src/message_log.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/entity_store.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/entity_store.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try to understand why most of the system tests are failing.

The mappers would benefit from an EntityStore taking in charge all the caching issues. Can this be done by two independent implementations: one with auto registration another one with a cache of pending registration.

crates/common/mqtt_channel/src/messages.rs Outdated Show resolved Hide resolved
crates/extensions/c8y_mapper_ext/src/converter.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/partial_entity_store.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/partial_entity_store.rs Outdated Show resolved Hide resolved
Ok(entity_store)
}

pub fn load_from_message_log(&mut self) -> Result<(), InitError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you consider to persist not a log of registration messages but the entity store itself? Or more precisely its main component aka entities: HashMap<EntityTopicId, EntityMetadata>. I think this would be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial plan was to persist the EntityMetadata struct instances in the log. But since the command capability messages were not part of this struct, I was forced to drop that plan (Eventually, we'll have to include them also in the entity store, but doing that now would have been a bigger change and risky).

Then I almost introduced a new PersistentMessage enum wrapper for all persistable messages (reg messages, twin data messages, command metadata etc), but then finally decided to persist the raw MqttMessages itself to cover other metadata messages as well, like MeasurementMetadata, EventMetadata etc which are not part of the entity store at the moment. It felt like a better choice as we could cover any other message types as well in the future (e.g: persist even the telemetry messages on disk instead of the in-memory ring buffer to reduce memory pressure, if needed).

One additional risk that I felt in logging the EntityMetadata struct itself was the possibility of future changes to this struct making the persistent logs between different tedge versions incompatible. Since the raw messages are less likely to change, I felt it's a safer option. But, this is a point worth discussing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed and agreed offline, we're sticking with the persistence of raw MqttMessages for now. But, this persistence impl won't be included in the current PR anymore. It will be added in a follow-up PR.

crates/core/tedge_api/src/entity_store.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/message_log.rs Outdated Show resolved Hide resolved
@albinsuresh albinsuresh force-pushed the feat/2428/registration-message-processing-priority branch from dc64075 to 7abfae0 Compare December 13, 2023 11:43
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The code is clear and far simpler with pending message caching responsibility moved behind the entity store.

As discussed and agreed offline, we're sticking with the persistence of raw MqttMessages for now. But, this persistence impl won't be included in the current PR anymore. It will be added in a follow-up PR.

Will have then to be addressed the pending comment related to storage.
Notably, we need to find an agreement on /data/tedge

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm my approval

@albinsuresh albinsuresh force-pushed the feat/2428/registration-message-processing-priority branch from bf4a31d to 2cac9f0 Compare December 14, 2023 08:33
@albinsuresh albinsuresh merged commit b4974b9 into thin-edge:main Dec 14, 2023
18 checks passed
@albinsuresh
Copy link
Contributor Author

albinsuresh commented Dec 14, 2023

Test Plan

This PR fixes the message race issue where an out-of-order delivery of entity data messages delivered before the entity registrations themselves results in those data messages getting dropped. The fix results in the following behaviour:

  1. All data messages: telemetry and other metadata messages like twin data, received before their respective registration messages, are cached in-memory and not converted.
    • Telemetry messages for all entities are cached in a cache with a capacity of only 100 entries. Hence, when the cache is full, older entries are replaced with newer entries.
    • Metadata messages are cached in unbounded buffers as we can't afford dropping such critical data
  2. All child device registration messages received before their parents are also cached
  3. When the registration message is received, itself and all its cached child devices are registered, and all their cached data are also processed.
  4. Auto-registration must be turned off if explicit registration is used for any entity. Keeping it turned on while using explicit registration can sometimes result in undesired behaviours like nested child devices getting registered as immediate child devices.

More details on the issue and the solution proposals can be found in the design decisions doc included in this PR. Some basic test coverage was also added in device_registration.robot. Persisting the entity store on disk and restoring it on startup, to minimize message races on startup, was deferred for later.

@gligorisaev gligorisaev self-assigned this Dec 14, 2023
@albinsuresh albinsuresh deleted the feat/2428/registration-message-processing-priority branch December 14, 2023 10:30
@gligorisaev
Copy link
Contributor

QA has thoroughly checked the feature and here are the results:

  • Test for ticket exists in the test suite.
  • QA has tested the function and it's functioning according description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants