Improve operation workflow concurrency model #2527

didier-wenzek · 2023-12-15T09:56:15Z

Proposed changes

Change the way MQTT is used to trigger state updates along an operation workflow

MQTT is no more used inside the agent to move forward a command request
The new source of truth is a CommandBoard used by the operation supervisor
MQTT is used to init new command request
MQTT is used to notify the progress of a command request
The CommandBoard is persisted and restored when the agent restart
The await-agent-restart can be used to validate a state transition on agent restart

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

wrong state executed after workflow restart during a partition swap #2495

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

codecov · 2023-12-15T10:06:11Z

Codecov Report

Attention: 169 lines in your changes are missing coverage. Please review.

Comparison is base (141c1d0) 76.0% compared to head (bd4ff60) 75.9%.

Additional details and impacted files

Files	Coverage Δ
...dge_agent/src/tedge_operation_converter/builder.rs	`90.2% <100.0%> (+0.1%)`	⬆️
crates/core/tedge_api/src/workflow/error.rs	`0.0% <ø> (ø)`
...tedge_agent/src/tedge_operation_converter/tests.rs	`91.8% <92.8%> (+0.2%)`	⬆️
crates/core/tedge_agent/src/agent.rs	`0.0% <0.0%> (ø)`
crates/core/tedge_api/src/workflow/state.rs	`80.4% <57.1%> (+2.7%)`	⬆️
crates/core/tedge_api/src/messages.rs	`83.7% <81.2%> (-0.2%)`	⬇️
crates/core/tedge_api/src/workflow/mod.rs	`58.6% <50.0%> (-2.5%)`	⬇️
...edge_agent/src/tedge_operation_converter/config.rs	`0.0% <0.0%> (ø)`
crates/core/tedge_api/src/workflow/toml_config.rs	`68.6% <0.0%> (-1.9%)`	⬇️
crates/core/tedge_api/src/workflow/on_disk.rs	`37.5% <37.5%> (ø)`
... and 2 more

... and 6 files with indirect coverage changes

github-actions · 2023-12-15T10:19:03Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
378	0	3	378	100	59m18.85s

albinsuresh

LGTM

Not approving yet as I'm aware of your plans to move the awaiting logic into an action.

tests/RobotFramework/tests/tedge_agent/workflows/native-reboot.toml

crates/core/tedge_api/src/workflow/error.rs

albinsuresh · 2023-12-18T11:37:48Z

crates/core/tedge_api/src/workflow/supervisor.rs

+    ///
+    /// TODO: use the timestamp to mark faulty any request making no progress
+    #[serde(flatten)]
+    commands: HashMap<TopicName, (Timestamp, GenericCommandState)>,


Not for now, but we can eventually consider moving that timestamp into the GenericCommandState payload itself as external processes acting on those state transitions might also benefit from it.

crates/core/tedge_agent/src/tedge_operation_converter/actor.rs

albinsuresh · 2023-12-18T12:04:54Z

crates/core/tedge_api/src/workflow/supervisor.rs

+
+            _ => {
+                // TODO: Use the timestamp to filter out action pending since too long
+                Some(command.clone())


Is this safe? Shouldn't we be returning the on_failed target of these commands, as we can't expect to resume commands just like that from their current states, if they were interrupted in the middle of their execution last time, right? I mean, for it to work, we should be sure that the script actions provided by the users are also resumable even after a first incomplete attempt, right?

After making that comment, I realised that it has been like this since we started supporting state transitions using MQTT retained messages, which would have behaved exactly the same way, when the same retained messages are re-delivered to the agent post-restart. I'm just not sure if our users would be aware of the impact: "their scripts should essentially be resumable"

Very good point. I opt for this implementation precisely for the reason you give: "this been like this since we started supporting state transitions using MQTT retained messages".

And yes the scripts should be resumable for that to work. We can add an option to "fail on retry".

albinsuresh

Just leaving some doc rewording suggestions. Happy to approve once the test passes.

docs/src/references/agent/operation-workflow.md

didier-wenzek · 2023-12-19T08:37:21Z

Happy to approve once the test passes.

There is indeed a race condition making the native reboot test flaky when the system successfully reboot. The issue is that currently the agent ignores SIGTERM signals when executing an operation action. In this specific case the timeout awaiting the agent to restart is delaying the restart. As a temporary workaround I increased the timeout used by the test and have the agent stopped by a SIGKILL. The proper fix is to have the agent stop on SIGTERM. I will address that.

albinsuresh

LGTM

As a first step, the command board is not connected to the main actor of the agent. This actor will have to be updated to use the command board as the source of truth instead of MQTT retained messages. The latter will still be used but on very specific point in time: - to init a new command - to cleanup a command that has been fully processed - to await the response of a peer - to notify progress made by the agent on each command. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

MQTT role has been reduced to: - on init message: add a new command to the board - on cleanup message: remove the command from the board - observability Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek had a problem deploying to Test Pull Request December 15, 2023 10:03 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request December 15, 2023 15:45 — with GitHub Actions Inactive

didier-wenzek marked this pull request as ready for review December 16, 2023 08:33

didier-wenzek requested review from albinsuresh and reubenmiller December 16, 2023 08:33

didier-wenzek had a problem deploying to Test Pull Request December 16, 2023 08:43 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 16, 2023 18:03 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request December 16, 2023 18:31 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Pull Request December 17, 2023 18:01 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request December 17, 2023 18:52 — with GitHub Actions Inactive

albinsuresh reviewed Dec 18, 2023

View reviewed changes

didier-wenzek had a problem deploying to Test Pull Request December 18, 2023 16:42 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 18, 2023 21:09 — with GitHub Actions Failure

albinsuresh reviewed Dec 19, 2023

View reviewed changes

didier-wenzek temporarily deployed to Test Pull Request December 19, 2023 08:32 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Pull Request December 19, 2023 09:13 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 19, 2023 14:31 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request December 19, 2023 15:25 — with GitHub Actions Inactive

albinsuresh approved these changes Dec 19, 2023

View reviewed changes

didier-wenzek added 3 commits December 19, 2023 19:27

Let the agent use the command board

e783d39

MQTT role has been reduced to: - on init message: add a new command to the board - on cleanup message: remove the command from the board - observability Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Persist pending operation state

c1783d3

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek force-pushed the improve/operation-workflow-concurrency-model branch from 39386b4 to aa10e84 Compare December 19, 2023 18:27

didier-wenzek had a problem deploying to Test Pull Request December 19, 2023 18:34 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 20, 2023 05:32 — with GitHub Actions Failure

didier-wenzek added 2 commits December 20, 2023 09:51

Impl await agent-restart

adb096c

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Add version to on-disk workflow file

bd4ff60

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek force-pushed the improve/operation-workflow-concurrency-model branch from aa10e84 to bd4ff60 Compare December 20, 2023 08:54

didier-wenzek temporarily deployed to Test Pull Request December 20, 2023 09:01 — with GitHub Actions Inactive

didier-wenzek merged commit b24460d into thin-edge:main Dec 20, 2023
18 checks passed

didier-wenzek deleted the improve/operation-workflow-concurrency-model branch December 20, 2023 09:53

This was referenced Dec 20, 2023

wrong state executed after workflow restart during a partition swap #2495

Closed

Finalize Operation Workflow #2478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve operation workflow concurrency model #2527

Improve operation workflow concurrency model #2527

didier-wenzek commented Dec 15, 2023 •

edited

codecov bot commented Dec 15, 2023 •

edited

github-actions bot commented Dec 15, 2023 •

edited

albinsuresh left a comment

albinsuresh Dec 18, 2023

albinsuresh Dec 18, 2023

albinsuresh Dec 18, 2023

didier-wenzek Dec 18, 2023

albinsuresh left a comment

didier-wenzek commented Dec 19, 2023 •

edited

albinsuresh left a comment

Improve operation workflow concurrency model #2527

Improve operation workflow concurrency model #2527

Conversation

didier-wenzek commented Dec 15, 2023 • edited

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

codecov bot commented Dec 15, 2023 • edited

Codecov Report

github-actions bot commented Dec 15, 2023 • edited

Robot Results

albinsuresh left a comment

Choose a reason for hiding this comment

albinsuresh Dec 18, 2023

Choose a reason for hiding this comment

albinsuresh Dec 18, 2023

Choose a reason for hiding this comment

albinsuresh Dec 18, 2023

Choose a reason for hiding this comment

didier-wenzek Dec 18, 2023

Choose a reason for hiding this comment

albinsuresh left a comment

Choose a reason for hiding this comment

didier-wenzek commented Dec 19, 2023 • edited

albinsuresh left a comment

Choose a reason for hiding this comment

didier-wenzek commented Dec 15, 2023 •

edited

codecov bot commented Dec 15, 2023 •

edited

github-actions bot commented Dec 15, 2023 •

edited

didier-wenzek commented Dec 19, 2023 •

edited