Add support for operation workflow triggering device restart #2479

didier-wenzek · 2023-11-23T18:40:47Z

Proposed changes

The following trigger a device restart, leading to "successful_restart" or "failed_restart" depending on the actual success of the reboot.

[scheduled]
script = "restart"
next = ["restarting", "successful_restart", "failed_restart"]

The next list of states is used to configure the states to which the workflow move on restart:

The first state ("restarting" in the example) is the executing state of the command: till the device restart the workflow will stay in that state.
The second state ("successful_restart" in the example) is where the workflow is moved after a successful restart.
The third state ("failed_restart" in the example) is where the workflow is moved after a failed restart.

Note This file format has to be discussed:

An alternative could be to have a specific action property instead of overloading the script property
For example: action = "restart --executing restarting --on-success successful_restart --on-error failed_restart"

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#2478

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

codecov · 2023-11-23T18:50:51Z

Codecov Report

Merging #2479 (b34e5ae) into main (ea254bf) will decrease coverage by 0.3%.
The diff coverage is 54.9%.

Additional details and impacted files

Files	Coverage Δ
crates/core/plugin_sm/src/operation_logs.rs	`78.6% <100.0%> (-0.9%)`	⬇️
crates/core/tedge_agent/src/agent.rs	`0.0% <ø> (ø)`
...ates/core/tedge_agent/src/restart_manager/tests.rs	`94.1% <100.0%> (+0.2%)`	⬆️
...tes/core/tedge_agent/src/software_manager/tests.rs	`93.7% <100.0%> (+0.9%)`	⬆️
...tes/core/tedge_agent/src/state_repository/error.rs	`0.0% <ø> (ø)`
...dge_agent/src/tedge_operation_converter/builder.rs	`90.1% <100.0%> (+0.3%)`	⬆️
...tedge_agent/src/tedge_operation_converter/tests.rs	`91.5% <100.0%> (-0.2%)`	⬇️
crates/extensions/c8y_mapper_ext/src/converter.rs	`81.3% <ø> (ø)`
crates/extensions/c8y_mapper_ext/src/tests.rs	`91.6% <100.0%> (-0.1%)`	⬇️
...rates/extensions/tedge_config_manager/src/actor.rs	`66.2% <ø> (ø)`
... and 11 more

... and 2 files with indirect coverage changes

github-actions · 2023-11-23T19:19:11Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
388	0	3	388	100	54m23.009999999s

albinsuresh

Now that restart is an action that can be triggered generically from any workflow, there is the possibility of multiple workflows (for different operations) requesting for a restart simultaneously as well. We'll have to extend the support for that in future so that all those requested operations succeed, even though one of those requests really restarted the device.

albinsuresh · 2023-11-27T07:01:03Z

tests/RobotFramework/tests/tedge_agent/workflows/custom_restart.toml

+next = ["scheduled"]
+
+[scheduled]
+script = "restart"


As you highlighted already, I'd also be in favour of introducing a different action key instead of overloading the script key.

The proposal is to with something along these lines:

trigger = { action = "restart", on_exec = "restarting", on_success = "successful_restart", on_error = "failed_restart" }

Why not continue using the next field itself for those successful and failed target states instead of this new on_success and on_error directives? So that it is consistent with the rest of the workflow actions?

That on_exec state is still a bit confusing to me. A user looking at the workflow definition to make sense out of the end-to-end control flow will be a bit lost seeing that restarting state, as there is no other reference to that state in the workflow file. I guess I'm still missing something about this contract.

As discussed offline, that restarting state being implicit, while the successful_restart and failed_restart states being explicit was the confusing aspect for me. One way to make things clearer would be to have that restarting state also explicitly defined in the workflow as follows:

operation = "controlled_restart" # ... [scheduled] action = "restart" next = ["restarting"] [restarting] action = "check-restart" next = ["successful_restart", "failed_restart"] [successful_restart] script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}" next = ["successful"] [failed_restart] script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status} ${.payload.restartError}" next = ["failed"] # ...

The issue with this approach is that we're forcing the user to explicitly define additional states as per the internal contract of the restart action. The need to have an explicit restarting state may not be that intuitive to a user.

Another option is to completely hide the restarting state by allowing the agent to auto-generate its own internal restarting states without exposing it to the user in the workflow files. For example, if the restart action is called from the scheduled state, then the status will be updated form scheduled to scheduled_restarting. If another state xyz calls this action, then it will be moved to xyz_restarting state. That way, we can avoid state value conflicts as well. The only issue with this approach is that the user will notice these "internal states" that he did not define in his workflow file. Being explicit makes the complete control flow clearer from the workflow definition itself and simplifies the implementation as well.

albinsuresh · 2023-11-27T07:04:38Z

tests/RobotFramework/tests/tedge_agent/workflows/custom_restart.toml

+
+[scheduled]
+script = "restart"
+next = ["restarting", "successful_restart", "failed_restart"]


I still don't fully get the idea behind the need for that first restarting state in this array. Why not just model it as an independent restarting state transitioning from the scheduled state? That restarting state can then have subsequent successful_restart and failed_restart states?

The shortcut I took I indeed confusing. I should not have hijack the next field and I will fix that.

What you describe is the correct workflow: moving from scheduled to restarting and then to successful_restart or failed_restart.

The actual purpose is different: these 3 state names are used as alias to the state names of the restart internal workflow. The states of the latter cannot be used unchanged (as the main workflow already has executing,
successful and failed states) nor hard-coded (as the main workflow might invoke many restarts).

Why is it wrong to model the same workflow as follows (note the updated definition of the scheduled state):

operation = "controlled_restart" [init] script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}" next = ["scheduled"] [scheduled] inbuilt-action = "restart" next = ["successful_restart", "failed_restart"] [successful_restart] script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}" next = ["successful"] [failed_restart] script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status} ${.payload.restartError}" next = ["failed"]

When the tedge-agent processes the inbuilt-action = restart, it knows that it was called in the context of the controlled_restart command, from the scheduled state. It can persist all this information before the restart and once the restart completes, it can just retrieve the state from which restart was invoked, which is scheduled in this case, and then just transition to either of the states defined in its next field for that state.

This is simplistic view that I have about handling restarts in the context of other operations by simply adding more states. I'm sure I'm missing something here. I'm just not able to see what it is yet.

When the tedge-agent processes the inbuilt-action = restart, it knows that it was called in the context of the controlled_restart command, from the scheduled state. It can persist all this information before the restart and once the restart completes, it can just retrieve the state from which restart was invoked, which is scheduled in this case, and then just transition to either of the states defined in its next field for that state.

With one exception described bellow, what you describe is what is actually implemented. The context of the restart (.i.e. the whole MQTT message describing the state of the main workflow is stored by the restart operation and resumed after the reboot).

This is simplistic view that I have about handling restarts in the context of other operations by simply adding more states. I'm sure I'm missing something here. I'm just not able to see what it is yet.

Moving from a state triggering a restart to a state waiting for the end of the restart is required to avoid a race condition.

With your example, where there is no restarting state, there is a risk for an infinite loop. If the retained message for the sheduled state is read before the restart completion is detected (these are two independent threads), then the workflow will erroneously trigger a new restart.

With your example, where there is no restarting state, there is a risk for an infinite loop. If the retained message for the sheduled state is read before the restart completion is detected (these are two independent threads), then the workflow will erroneously trigger a new restart.

Yeah, now the need for that restarting state is clear. I've documented some of my thoughts on how to make this flow clearer in this comment.

crates/core/tedge_agent/src/state_repository/state.rs

crates/core/tedge_agent/src/software_manager/actor.rs

crates/core/tedge_api/src/messages.rs

tests/RobotFramework/tests/tedge_agent/workflows/custom_restart.toml

albinsuresh

LGTM.

One confusing thing that I found is the inconsistent usage of the words state and step in docs and code synonymously. The docs primarily uses state (although there are a few references to step as well), and the code primarily uses step. I feel that using state consistently would make it easier for the users and in the code. I prefer state over step as that's what each step in the workflow represents at the end of the day.

crates/core/tedge_api/src/messages.rs

crates/core/tedge_api/src/workflow.rs

albinsuresh · 2023-12-01T06:46:13Z

crates/core/tedge_api/src/workflow.rs

@@ -240,7 +318,24 @@ impl From<&OperationState> for OperationAction {
    // TODO this must be called when an operation is registered, not when invoked.
    fn from(state: &OperationState) -> Self {
        match &state.script {
-            Some(script) => OperationAction::Script(script.to_owned()),
+            Some(script) if script.command == "restart" => {


In that follow-up PR where we revisit the workflow file format, we can consider adding a dedicated action or builtin-action key for restart instead of overloading script.

docs/src/references/agent/operation-workflow.md

albinsuresh · 2023-12-01T07:09:16Z

docs/src/references/agent/operation-workflow.md

- If the script returns a json payload with a `status` field this status is used as the new status for the command.
+- If the script successfully returns, its standard output is used to update the command state payload.
+  - From this output, only the excerpt between a `:::begin-tedge:::` header and a `:::end-tedge:::` trailer is decoded.
+    This is done to ease script authoring. A script can emit arbitrary output on its stdout,


Initially I was wondering why we couldn't just parse the last { and } chars instead of :::begin-tedge::: and :::end-tedge:::. But, I understand that this simplifies our parsing logic, if there are multiple JSON objects printed by earlier commands. But still, picking up the last JSON object should still be okay, as I can't imagine any script performing more steps after printing this status message, which is supposed to be the last thing that it does.

But, this simplification can be done later as well, as this contract is also acceptable. But, the lighter the contract the better.

This is a critical struct for operation persistence that must be as simple as possible. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Was implemented only once. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Previously the operation under execution was stored using an adhoc format. Now the command state is stored in the same format as for the cmd/+/+ topics. This simplify operation handling (for instance there no more specific *Restarting* status for the restart command: this state is stored as *executing*). Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

The current state of the command triggering the restart is stored in the restart command. So, after reboot, the former command can resume in the appropriate following state. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

The log clean logic (in /var/tedge/log/agent) was cleaning workflow log files (due to the false assumptions on file names). Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

To workflow relevant part of a script output has to be surrounded by a header and a footer as in: ``` :::begin-tedge::: {"status":"success"} :::end-tedge::: ``` The goal is to facilitate the creation of operation scripts. Previously any garbage printed on stdout by one of the script dependencies was corrupting the thin-edge related output. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

As an operation workflow can be customized, notably with user-specific states, the mapper should accept these unknown state - doing nothing specific, but without emitting errors. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek · 2023-12-01T08:54:19Z

gligorisaev · 2023-12-04T07:15:41Z

QA has thoroughly checked the feature and here are the results:

Test for ticket exists in the test suite.
tests/RobotFramework/tests/tedge_agent/workflows/custom_operation.robot
QA has tested the function and it's functioning according description.

didier-wenzek had a problem deploying to Test Pull Request November 23, 2023 18:47 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request November 23, 2023 21:08 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request November 24, 2023 08:13 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request November 24, 2023 10:48 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request November 24, 2023 11:17 — with GitHub Actions Inactive

didier-wenzek requested review from albinsuresh and reubenmiller November 24, 2023 12:48

didier-wenzek force-pushed the feat/restart-with-context branch from 535419a to 4b556be Compare November 24, 2023 13:35

didier-wenzek temporarily deployed to Test Pull Request November 24, 2023 13:42 — with GitHub Actions Inactive

albinsuresh reviewed Nov 27, 2023

View reviewed changes

didier-wenzek temporarily deployed to Test Pull Request November 27, 2023 17:21 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 28, 2023 09:24 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 28, 2023 10:02 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 28, 2023 17:35 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 28, 2023 20:35 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 29, 2023 12:16 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 29, 2023 14:56 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 29, 2023 15:14 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 29, 2023 15:28 — with GitHub Actions Inactive

didier-wenzek force-pushed the feat/restart-with-context branch from fba4f53 to 6c0b3ba Compare November 30, 2023 09:23

didier-wenzek had a problem deploying to Test Pull Request November 30, 2023 09:30 — with GitHub Actions Failure

didier-wenzek force-pushed the feat/restart-with-context branch from 6c0b3ba to 4cd5bae Compare November 30, 2023 10:26

didier-wenzek had a problem deploying to Test Pull Request November 30, 2023 10:33 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request November 30, 2023 11:01 — with GitHub Actions Inactive

didier-wenzek mentioned this pull request Nov 30, 2023

Add tedge config agent.state.path setting #2492

Merged

11 tasks

didier-wenzek temporarily deployed to Test Pull Request November 30, 2023 17:28 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 30, 2023 20:22 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request November 30, 2023 21:23 — with GitHub Actions Inactive

albinsuresh approved these changes Dec 1, 2023

View reviewed changes

didier-wenzek added 7 commits December 1, 2023 08:55

Simplify AgentStateRepository

0ffb225

This is a critical struct for operation persistence that must be as simple as possible. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Deprecate trait StateRepository

f1e3cf0

Was implemented only once. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Impl Serialize for Topic

dba0382

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Add an operation workflow restart command

2f26689

The current state of the command triggering the restart is stored in the restart command. So, after reboot, the former command can resume in the appropriate following state. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Test workflow triggering a restart

cebdc7d

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Log signal when an operation is killed

b1d128f

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek temporarily deployed to Test Pull Request December 1, 2023 08:01 — with GitHub Actions Inactive

didier-wenzek added 6 commits December 1, 2023 09:04

Log whole operation workflow

5c7b2c2

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Sync workflow log file after each step

5ee8d69

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Fix log cleaning logic removing too many files

0c9d386

The log clean logic (in /var/tedge/log/agent) was cleaning workflow log files (due to the false assumptions on file names). Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Add CommandStatus::Unknown variant

44c2d10

As an operation workflow can be customized, notably with user-specific states, the mapper should accept these unknown state - doing nothing specific, but without emitting errors. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Improve worklow logging and script pretty-printing

b34e5ae

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek force-pushed the feat/restart-with-context branch from 6839df1 to b34e5ae Compare December 1, 2023 08:10

didier-wenzek temporarily deployed to Test Pull Request December 1, 2023 08:17 — with GitHub Actions Inactive

didier-wenzek merged commit 1c9e284 into thin-edge:main Dec 1, 2023
18 checks passed

didier-wenzek deleted the feat/restart-with-context branch December 1, 2023 08:46

didier-wenzek mentioned this pull request Dec 1, 2023

Finalize Operation Workflow #2478

Closed

15 tasks

didier-wenzek assigned gligorisaev Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for operation workflow triggering device restart #2479

Add support for operation workflow triggering device restart #2479

didier-wenzek commented Nov 23, 2023 •

edited

codecov bot commented Nov 23, 2023 •

edited

github-actions bot commented Nov 23, 2023 •

edited

albinsuresh left a comment

albinsuresh Nov 27, 2023

didier-wenzek Nov 28, 2023

albinsuresh Nov 28, 2023

albinsuresh Nov 29, 2023 •

edited

albinsuresh Nov 27, 2023

didier-wenzek Nov 28, 2023 •

edited

albinsuresh Nov 28, 2023 •

edited

didier-wenzek Nov 29, 2023

albinsuresh Nov 29, 2023

albinsuresh left a comment •

edited

albinsuresh Dec 1, 2023

albinsuresh Dec 1, 2023 •

edited

didier-wenzek commented Dec 1, 2023

gligorisaev commented Dec 4, 2023

Add support for operation workflow triggering device restart #2479

Add support for operation workflow triggering device restart #2479

Conversation

didier-wenzek commented Nov 23, 2023 • edited

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

codecov bot commented Nov 23, 2023 • edited

Codecov Report

github-actions bot commented Nov 23, 2023 • edited

Robot Results

albinsuresh left a comment

Choose a reason for hiding this comment

albinsuresh Nov 27, 2023

Choose a reason for hiding this comment

didier-wenzek Nov 28, 2023

Choose a reason for hiding this comment

albinsuresh Nov 28, 2023

Choose a reason for hiding this comment

albinsuresh Nov 29, 2023 • edited

Choose a reason for hiding this comment

albinsuresh Nov 27, 2023

Choose a reason for hiding this comment

didier-wenzek Nov 28, 2023 • edited

Choose a reason for hiding this comment

albinsuresh Nov 28, 2023 • edited

Choose a reason for hiding this comment

didier-wenzek Nov 29, 2023

Choose a reason for hiding this comment

albinsuresh Nov 29, 2023

Choose a reason for hiding this comment

albinsuresh left a comment • edited

Choose a reason for hiding this comment

albinsuresh Dec 1, 2023

Choose a reason for hiding this comment

albinsuresh Dec 1, 2023 • edited

Choose a reason for hiding this comment

didier-wenzek commented Dec 1, 2023

gligorisaev commented Dec 4, 2023

didier-wenzek commented Nov 23, 2023 •

edited

codecov bot commented Nov 23, 2023 •

edited

github-actions bot commented Nov 23, 2023 •

edited

albinsuresh Nov 29, 2023 •

edited

didier-wenzek Nov 28, 2023 •

edited

albinsuresh Nov 28, 2023 •

edited

albinsuresh left a comment •

edited

albinsuresh Dec 1, 2023 •

edited