Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve operation workflow concurrency model #2527

Conversation

didier-wenzek
Copy link
Contributor

@didier-wenzek didier-wenzek commented Dec 15, 2023

Proposed changes

Change the way MQTT is used to trigger state updates along an operation workflow

  • MQTT is no more used inside the agent to move forward a command request
  • The new source of truth is a CommandBoard used by the operation supervisor
  • MQTT is used to init new command request
  • MQTT is used to notify the progress of a command request
  • The CommandBoard is persisted and restored when the agent restart
  • The await-agent-restart can be used to validate a state transition on agent restart

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link

codecov bot commented Dec 15, 2023

Codecov Report

Attention: 169 lines in your changes are missing coverage. Please review.

Comparison is base (141c1d0) 76.0% compared to head (bd4ff60) 75.9%.

Additional details and impacted files
Files Coverage Δ
...dge_agent/src/tedge_operation_converter/builder.rs 90.2% <100.0%> (+0.1%) ⬆️
crates/core/tedge_api/src/workflow/error.rs 0.0% <ø> (ø)
...tedge_agent/src/tedge_operation_converter/tests.rs 91.8% <92.8%> (+0.2%) ⬆️
crates/core/tedge_agent/src/agent.rs 0.0% <0.0%> (ø)
crates/core/tedge_api/src/workflow/state.rs 80.4% <57.1%> (+2.7%) ⬆️
crates/core/tedge_api/src/messages.rs 83.7% <81.2%> (-0.2%) ⬇️
crates/core/tedge_api/src/workflow/mod.rs 58.6% <50.0%> (-2.5%) ⬇️
...edge_agent/src/tedge_operation_converter/config.rs 0.0% <0.0%> (ø)
crates/core/tedge_api/src/workflow/toml_config.rs 68.6% <0.0%> (-1.9%) ⬇️
crates/core/tedge_api/src/workflow/on_disk.rs 37.5% <37.5%> (ø)
... and 2 more

... and 6 files with indirect coverage changes

Copy link
Contributor

github-actions bot commented Dec 15, 2023

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
378 0 3 378 100 59m18.85s

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Not approving yet as I'm aware of your plans to move the awaiting logic into an action.

crates/core/tedge_api/src/workflow/error.rs Outdated Show resolved Hide resolved
///
/// TODO: use the timestamp to mark faulty any request making no progress
#[serde(flatten)]
commands: HashMap<TopicName, (Timestamp, GenericCommandState)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now, but we can eventually consider moving that timestamp into the GenericCommandState payload itself as external processes acting on those state transitions might also benefit from it.


_ => {
// TODO: Use the timestamp to filter out action pending since too long
Some(command.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? Shouldn't we be returning the on_failed target of these commands, as we can't expect to resume commands just like that from their current states, if they were interrupted in the middle of their execution last time, right? I mean, for it to work, we should be sure that the script actions provided by the users are also resumable even after a first incomplete attempt, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After making that comment, I realised that it has been like this since we started supporting state transitions using MQTT retained messages, which would have behaved exactly the same way, when the same retained messages are re-delivered to the agent post-restart. I'm just not sure if our users would be aware of the impact: "their scripts should essentially be resumable"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. I opt for this implementation precisely for the reason you give: "this been like this since we started supporting state transitions using MQTT retained messages".

And yes the scripts should be resumable for that to work. We can add an option to "fail on retry".

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving some doc rewording suggestions. Happy to approve once the test passes.

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
@didier-wenzek
Copy link
Contributor Author

didier-wenzek commented Dec 19, 2023

Happy to approve once the test passes.

There is indeed a race condition making the native reboot test flaky when the system successfully reboot. The issue is that currently the agent ignores SIGTERM signals when executing an operation action. In this specific case the timeout awaiting the agent to restart is delaying the restart. As a temporary workaround I increased the timeout used by the test and have the agent stopped by a SIGKILL. The proper fix is to have the agent stop on SIGTERM. I will address that.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As a first step, the command board is not connected to the main actor
of the agent. This actor will have to be updated to use the command
board as the source of truth instead of MQTT retained messages.
The latter will still be used but on very specific point in time:
- to init a new command
- to cleanup a command that has been fully processed
- to await the response of a peer
- to notify progress made by the agent on each command.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
MQTT role has been reduced to:
- on init message: add a new command to the board
- on cleanup message: remove the command from the board
- observability

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
@didier-wenzek didier-wenzek force-pushed the improve/operation-workflow-concurrency-model branch from aa10e84 to bd4ff60 Compare December 20, 2023 08:54
@didier-wenzek didier-wenzek merged commit b24460d into thin-edge:main Dec 20, 2023
18 checks passed
@didier-wenzek didier-wenzek deleted the improve/operation-workflow-concurrency-model branch December 20, 2023 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants