-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve operation workflow concurrency model #2527
Improve operation workflow concurrency model #2527
Conversation
Codecov ReportAttention:
Additional details and impacted files
|
Robot Results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Not approving yet as I'm aware of your plans to move the awaiting
logic into an action
.
tests/RobotFramework/tests/tedge_agent/workflows/native-reboot.toml
Outdated
Show resolved
Hide resolved
/// | ||
/// TODO: use the timestamp to mark faulty any request making no progress | ||
#[serde(flatten)] | ||
commands: HashMap<TopicName, (Timestamp, GenericCommandState)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for now, but we can eventually consider moving that timestamp into the GenericCommandState
payload itself as external processes acting on those state transitions might also benefit from it.
|
||
_ => { | ||
// TODO: Use the timestamp to filter out action pending since too long | ||
Some(command.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? Shouldn't we be returning the on_failed
target of these commands, as we can't expect to resume commands just like that from their current states, if they were interrupted in the middle of their execution last time, right? I mean, for it to work, we should be sure that the script
actions provided by the users are also resumable even after a first incomplete attempt, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After making that comment, I realised that it has been like this since we started supporting state transitions using MQTT retained messages, which would have behaved exactly the same way, when the same retained messages are re-delivered to the agent post-restart. I'm just not sure if our users would be aware of the impact: "their scripts should essentially be resumable"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. I opt for this implementation precisely for the reason you give: "this been like this since we started supporting state transitions using MQTT retained messages".
And yes the scripts should be resumable for that to work. We can add an option to "fail on retry".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving some doc rewording suggestions. Happy to approve once the test passes.
There is indeed a race condition making the native reboot test flaky when the system successfully reboot. The issue is that currently the agent ignores SIGTERM signals when executing an operation action. In this specific case the timeout awaiting the agent to restart is delaying the restart. As a temporary workaround I increased the timeout used by the test and have the agent stopped by a SIGKILL. The proper fix is to have the agent stop on SIGTERM. I will address that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As a first step, the command board is not connected to the main actor of the agent. This actor will have to be updated to use the command board as the source of truth instead of MQTT retained messages. The latter will still be used but on very specific point in time: - to init a new command - to cleanup a command that has been fully processed - to await the response of a peer - to notify progress made by the agent on each command. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
MQTT role has been reduced to: - on init message: add a new command to the board - on cleanup message: remove the command from the board - observability Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
39386b4
to
aa10e84
Compare
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
aa10e84
to
bd4ff60
Compare
Proposed changes
Change the way MQTT is used to trigger state updates along an operation workflow
CommandBoard
used by the operation supervisorCommandBoard
is persisted and restored when the agent restartawait-agent-restart
can be used to validate a state transition on agent restartTypes of changes
Paste Link to the issue
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINESFurther comments