Improve operation workflow definition #2496

didier-wenzek · 2023-12-01T19:09:01Z

Proposed changes

Proposal for improvements on how to specify a custom workflow:

Next step determined by script exit status
Next step determined by script output
Using a script to trigger a restart
New syntax for restart action
Setting script execution timeout
Deprecating the owner and next fields

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#2478

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

codecov · 2023-12-01T19:19:05Z

Codecov Report

Merging #2496 (dc06274) into main (2935e42) will increase coverage by 0.0%.
The diff coverage is 73.4%.

Additional details and impacted files

Files	Coverage Δ
crates/core/tedge_api/src/messages.rs	`83.4% <0.0%> (ø)`
crates/core/tedge_api/src/workflow/error.rs	`0.0% <0.0%> (ø)`
crates/core/tedge_agent/src/agent.rs	`0.0% <0.0%> (ø)`
crates/core/tedge_api/src/workflow/supervisor.rs	`65.9% <65.9%> (ø)`
...tedge_agent/src/tedge_operation_converter/actor.rs	`46.8% <0.0%> (-3.6%)`	⬇️
crates/core/tedge_api/src/workflow/mod.rs	`61.1% <61.1%> (ø)`
crates/core/tedge_api/src/workflow/state.rs	`77.7% <77.7%> (ø)`
crates/core/tedge_api/src/workflow/script.rs	`81.9% <81.9%> (ø)`
crates/core/tedge_api/src/workflow/toml_config.rs	`70.4% <70.4%> (ø)`

... and 2 files with indirect coverage changes

albinsuresh

The proposed enhancements look good. Here are a few more enhancements that I can think of (a few taken from older PRs, just for the sake of having everything documented at one place):

Including the previous state also in the payload when a state transition occurs. This helps in defining generic states like a timeout or failure state that can be used to just report the state from which the failure happened. It would also help in defining any cleanup/rollback logics in those states based on the previous state.
Ability to execute custom scripts as part of terminal states: successful and failed: This would enable users to attach any post-workflow cleanup logics to these states, like clearing any temp resources, clearing the command status etc. When combined with the previous-state info, they can be even smarter.

docs/src/references/agent/operation-workflow.md

albinsuresh · 2023-12-05T07:43:28Z

docs/src/references/agent/operation-workflow.md

+
+["waiting-for-restart"]
+builtin_action = "waiting-for-restart"


One additional benefit that I see with having this state explicitly is the freedom that the user gets to override the inbuilt waiting-for-restart validation logic as well.

Yes, this would be nicer. I will give a try to this representation to see the impact on the code.

albinsuresh · 2023-12-05T07:46:58Z

docs/src/references/agent/operation-workflow.md

+
+Cons:
+- less specific than states with specific purpose as on `on_success` or `on_exit.1`


Despite the redundancy, I really liked the explicitness of the keywords like on_success, on_failure etc, esp since we have added more variants like on_kill, on_timeout etc. The moment you have more than 2 values in that next array, it starts to become unclear.

So let's go in that direction. Removing the next field from the configuration file.

docs/src/references/agent/operation-workflow.md

albinsuresh · 2023-12-05T08:08:15Z

docs/src/references/agent/operation-workflow.md

Suggested change

timeout_second = 300

on_timeout = { status = "failed", reason = "timeout" }

on_timeout = { duration_seconds = 300, status = "failed", reason = "timeout" }

How about merging them so that everything related to timeouts are at one place?

github-actions · 2023-12-05T09:06:56Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
369	0	3	369	100	50m10.534s

didier-wenzek · 2023-12-05T09:31:37Z

Including the previous state also in the payload when a state transition occurs. This helps in defining generic states like a timeout or failure state that can be used to just report the state from which the failure happened. It would also help in defining any cleanup/rollback logics in those states based on the previous state.

Why not. Adding an updated-at timestamp might also be useful to control timeouts.

This has to be considered in a larger scope to fix that issue: #2495.

Ability to execute custom scripts as part of terminal states: successful and failed: This would enable users to attach any post-workflow cleanup logics to these states, like clearing any temp resources, clearing the command status etc. When combined with the previous-state info, they can be even smarter.

Here, I'm less convinced. Sure we need to fix #2484,
but attaching an action to these states leads to a race condition, as both the agent and the mapper will react on those and the mapper clearing the state once done.

My proposal, is to augment the builtin workflows with intermediate states, e.g. successful-software-update and failed-software-update with default transitions to successful and failed. These default transitions can then be overwritten to include clean-up or rollback logic, as you suggest. The difference is a clear ownership: successful-software-update and failed-software-update are handled by the agent, and successful and failed by the mapper.

crates/core/tedge_api/src/workflow/toml_config.rs

albinsuresh

The code looks good. A few missing points that I noticed:

Documentation of the on_stdout handler
Even though the default exit handler feature is mentioned in the timeout handling section, a dedicated section would be nice, highlighting that default handlers can be defined for any exit case.

albinsuresh · 2023-12-07T13:12:13Z

crates/core/tedge_api/src/workflow/toml_config.rs

+/// - `on_success` and `on_exit.0` are are synonyms and cannot be both provided
+/// - `on_error` and `on_exit._` are are synonyms and cannot be both provided
+/// - `on_success` and `on_stdout` are incompatible, as the next state is either determined from the script stdout or its exit codes
+/// - `on_exec` is only meaningful in the context of a background script or a builtin action


Not sure if that restriction is really needed. Might come in handy before a long running script is triggered.

Sure. But this would be more something along what is proposed here: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-0bc2ab86d1dc663436d9764c1ee3b48333013f3d52ce410aac0af84705f0bf7cR364

.i.e. to have two states.

["agent-restart"] background_script = "sudo systemctl restart tedge-agent" on_exec = "waiting-for-restart" ["waiting-for-restart"] script = "/some/script.sh checking restart" on_success = "successful_restart" on_error = "failed_restart" on_timeout = "waiting-for-restart"

albinsuresh · 2023-12-07T13:20:32Z

crates/core/tedge_api/src/workflow/toml_config.rs

+            input.handlers.on_timeout,
+            Some(TomlStateUpdate::Simple("timeout".to_string()))
+        );
+    }


I'm guessing you forgot to assert the rest of the states.

albinsuresh · 2023-12-07T13:33:00Z

crates/core/tedge_api/src/workflow/script.rs

+    on_error: Option<GenericStateUpdate>,
+    on_kill: Option<GenericStateUpdate>,
+    on_exit: Vec<(u8, u8, GenericStateUpdate)>,
+    on_stdout: Vec<String>,


Are we making the user list out the expected state values here, just so that it is easier to figure out the entire control flow of a workflow, without having to audit the scripts? Or there are some other benefits as well?

I see that they are unused now. Are we planning to validate that the output of the script is definitely one of the values provided in this list?

Are we making the user list out the expected state values here, just so that it is easier to figure out the entire control flow of a workflow, without having to audit the scripts? Or there are some other benefits as well?

Being able to compute the list of possible following states when in a given state will be a key to solve #2495.

I see that they are unused now. Are we planning to validate that the output of the script is definitely one of the values provided in this list?

Indeed, not used for now. My former plan was to fail a command when moving to some unexpected state. But to solve the former issue we might have to be more subtle.

albinsuresh · 2023-12-07T13:47:38Z

docs/src/references/agent/operation-workflow.md

Why restrict that overriding rule to non-successful status codes alone? It can be applied for all exit cases, right? If an explicit exist handler is provided in the workflow, for any exit code, including the successful ones, it trumps over the status returned in the output.

Yes, definitely.

Currently, even stronger. If the script is non successful its output is ignored.

But maybe this is not what you have in mind. Do you think consuming the output can also be useful in case of exit code with a specific handler? Somehow these are expected errors.

After discussion, it appears that things are a bit more subtle:

the status given in the workflow definition trumps any status value published on stdout (the idea is that sequence of states is statically defined)

the reason published on stdout trumps the reason provided in the workflow definition (the rational being that an error message is dynamically provided by the running process).

these rules applies only for the exit codes which are somehow expected .i.e. with a defined on_exit handler.

docs/src/references/agent/operation-workflow.md

albinsuresh · 2023-12-13T05:47:01Z

crates/core/tedge_agent/src/agent.rs

@@ -341,8 +340,7 @@ async fn read_operation_workflow(path: &Path) -> Result<OperationWorkflow, anyho
    let context = || format!("Reading operation workflow from {path:?}");
    let bytes = tokio::fs::read(path).await.with_context(context)?;
    let input = std::str::from_utf8(&bytes).with_context(context)?;
-    let toml = toml::from_str::<TomlOperationWorkflow>(input)?; //.with_context(context)?;
-    let workflow = TryInto::<OperationWorkflow>::try_into(toml).with_context(context)?;
+    let workflow = toml::from_str::<OperationWorkflow>(input)?; //.with_context(context)?;


Commenting out that with_context was intentional?

I will fix that. Adding a context here is in fact worse: not only there is no context but also the root error is lost and not logged.

As highlighted by @jarhodes314 the main issue is elsewhere: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-03553c8b7afdbd134b09403b3f4a3931fee4c368f4a4f8a0c38bfdacbacc5018R330. One needs {err:?} as anyhow only shows the full error message in Debug, not Display.

albinsuresh · 2023-12-13T05:58:58Z

docs/src/references/agent/operation-workflow.md

 script = "/some/script.sh checking restart"
-next = ["waiting-for-restart", "successful_restart", "failed_restart"]
+on_stdout = ["waiting-for-restart", "successful_restart", "failed_restart"]


Suggested change

on_stdout = ["waiting-for-restart", "successful_restart", "failed_restart"]

on_stdout = ["successful_restart", "failed_restart"]

Expecting the same state that the workflow is currently in, is highly unlikely, right?

Removed.

However, not so unlikely even it's surely better to have a blocking wait with a timeout. This is the main issue with restart : one needs a way to detect a restart and this check must be patient (as the restart might be still pending) and robust (as the restart might fail).

albinsuresh · 2023-12-13T06:13:53Z

docs/src/references/agent/operation-workflow.md

+
+```toml
+["device-restart"]
+builtin_action = "restart"


Suggested change

builtin_action = "restart"

background_script = "sudo systemctl restart tedge-agent"

Since builtin_action is only introduced in the next section, this might be better for continuity.

The issue is that the former works but not the latter. More precisely, using a background script raises the question on how to detect success and failure.

albinsuresh · 2023-12-13T06:23:42Z

docs/src/references/agent/operation-workflow.md

+```toml
+["device-restart"]
+action = "restart"
+on_exec = "waiting-for-restart"
+on_success = "successful_restart"
+on_error = "failed_restart"
+```
+
+Alternative proposal:
+
+```toml
+["device-restart"]
+action = "restart"
+on_exec = "waiting-for-restart"
+
+["waiting-for-restart"]
+action = "waiting restart"
+on_success = "successful_restart"
+on_error = "failed_restart"
+```


Suggested change

```toml

["device-restart"]

action = "restart"

on_exec = "waiting-for-restart"

on_success = "successful_restart"

on_error = "failed_restart"

```

Alternative proposal:

```toml

["device-restart"]

action = "restart"

on_exec = "waiting-for-restart"

["waiting-for-restart"]

action = "waiting restart"

on_success = "successful_restart"

on_error = "failed_restart"

```

```toml

["reboot_required"]

action = "restart"

on_exec = "restarting"

["restarting"]

action = "waiting restart"

on_success = "successful_restart"

on_error = "failed_restart"

Removed the first proposal as we settled with the second one. Also adjusted it to reflect the state names mentioned in the description above.

Unfortunately, the second is not implemented yet. The main reason being that the "waiting restart" needs more thinking. The failing system test (aka "Trigger native-reboot within workflow (on_error) - missing sudoers entry for reboot"), highlights one of these points: the behavior in case of an error is not clear.

albinsuresh · 2023-12-13T06:26:52Z

docs/src/references/agent/operation-workflow.md

+
+```toml
+["<state>"]
+action = "waiting <delegate>"


That delegate needs some doc. In the restart example it was the restart keyword. These would be a set of pre-defined keywords like that or it can be any text that represents an eternal process?

The simplest is to remove that from the doc for now.

Something is wrong / missing / unclear when we combine this with error detection. See 4db304c

docs/src/references/agent/operation-workflow.md

albinsuresh · 2023-12-13T06:33:27Z

docs/src/references/agent/operation-workflow.md

+The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__
+and that a user can provide its own implementation.


Suggested change

The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__

and that a user can provide its own implementation.

The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__.

The proceed action directly moves the workflow to the on_success state, right? So, I don't see how the user can provide his own implementation while using proceed.

This can be changed by providing a new workflow definition as done here: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-0bc2ab86d1dc663436d9764c1ee3b48333013f3d52ce410aac0af84705f0bf7cR491

albinsuresh · 2023-12-13T06:45:57Z

crates/core/tedge_api/src/workflow/script.rs

+        self.timeout
+    }
+
+    pub fn forceful_timeout(&self) -> Option<Duration> {


Suggested change

pub fn forceful_timeout(&self) -> Option<Duration> {

pub fn forceful_timeout_extension(&self) -> Option<Duration> {

Changed.

But honestly, I don't think a name has to convey all the details of a contract.

didier-wenzek · 2023-12-13T12:44:02Z

The following commits will be removed:

Indeed, adding an on_error handler along an on_exec handler needs clarification. Indeed, this creates two threads of execution when an error is detected executing the script. This breaks the main assumption of a single thread of execution per command instance.

albinsuresh

LGTM

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

- [x] script with exit handlers - [x] builtin actions - [ ] timeouts - [ ] finalize naming Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

For now, this is done only for scripts. For actions, as a device reboot, where the agent might restart, one needs to persist on-disk the current state of each command including a last-updated-at timestamp. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek · 2023-12-13T14:33:00Z

This PR improves the way an operation workflow can be specified by a user.

gligorisaev · 2023-12-14T10:04:27Z

QA has thoroughly checked the feature and here are the results:

Test for ticket exists in the test suite.
QA has tested the function and it's functioning according description.

didier-wenzek had a problem deploying to Test Pull Request December 1, 2023 19:15 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 4, 2023 21:48 — with GitHub Actions Failure

albinsuresh reviewed Dec 5, 2023

View reviewed changes

didier-wenzek temporarily deployed to Test Pull Request December 5, 2023 08:55 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request December 5, 2023 09:14 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request December 5, 2023 18:45 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request December 5, 2023 20:35 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Pull Request December 6, 2023 21:00 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request December 7, 2023 10:21 — with GitHub Actions Inactive

reubenmiller reviewed Dec 7, 2023

View reviewed changes

crates/core/tedge_api/src/workflow/toml_config.rs Outdated Show resolved Hide resolved

albinsuresh reviewed Dec 7, 2023

View reviewed changes

didier-wenzek temporarily deployed to Test Pull Request December 7, 2023 18:13 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Pull Request December 8, 2023 15:16 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 8, 2023 17:29 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 8, 2023 17:51 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 11, 2023 08:46 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 11, 2023 11:10 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request December 11, 2023 12:26 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request December 11, 2023 15:02 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request December 11, 2023 17:37 — with GitHub Actions Inactive

didier-wenzek requested review from reubenmiller, albinsuresh and jarhodes314 December 11, 2023 17:58

didier-wenzek temporarily deployed to Test Pull Request December 12, 2023 11:22 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Pull Request December 12, 2023 13:49 — with GitHub Actions Failure

reubenmiller had a problem deploying to Test Pull Request December 12, 2023 14:07 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 12, 2023 17:38 — with GitHub Actions Failure

albinsuresh reviewed Dec 13, 2023

View reviewed changes

didier-wenzek had a problem deploying to Test Pull Request December 13, 2023 09:34 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request December 13, 2023 13:10 — with GitHub Actions Failure

albinsuresh approved these changes Dec 13, 2023

View reviewed changes

didier-wenzek added 12 commits December 13, 2023 14:15

Improve error message on syntax error in workflow.toml

69ecf69

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Define workflow.toml content

87bcd3b

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Promote workflow module to directory

6667631

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Add shell script exit handlers

6a0a5de

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Move TOML configuration related definitions in a sub-module

ba01f3e

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Implement new TOML representation of operation workflow

83fdf23

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

New operation workflow definition format

8ac70c9

- [x] script with exit handlers - [x] builtin actions - [ ] timeouts - [ ] finalize naming Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Polish workflow API docs and tests

587a9ef

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Impl background script for operation workflow

f690a55

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Polish Operation Workflow documentation

ef6bcb5

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Add default timeout

dc06274

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek force-pushed the improve/operation-workflow-definition branch from 69320f7 to dc06274 Compare December 13, 2023 13:20

didier-wenzek temporarily deployed to Test Pull Request December 13, 2023 13:27 — with GitHub Actions Inactive

didier-wenzek merged commit 1dadf29 into thin-edge:main Dec 13, 2023
18 checks passed

didier-wenzek deleted the improve/operation-workflow-definition branch December 13, 2023 14:13

didier-wenzek mentioned this pull request Dec 13, 2023

Finalize Operation Workflow #2478

Closed

15 tasks

didier-wenzek assigned gligorisaev Dec 13, 2023

reubenmiller mentioned this pull request Dec 15, 2023

Move config_update file download from tedge-mapper-c8y to tedge-agent #2511

Merged

12 tasks


		["waiting-for-restart"]
		builtin_action = "waiting-for-restart"


		Cons:
		- less specific than states with specific purpose as on `on_success` or `on_exit.1`

	timeout_second = 300
	on_timeout = { status = "failed", reason = "timeout" }
	on_timeout = { duration_seconds = 300, status = "failed", reason = "timeout" }

	on_stdout = ["waiting-for-restart", "successful_restart", "failed_restart"]
	on_stdout = ["successful_restart", "failed_restart"]

	builtin_action = "restart"
	background_script = "sudo systemctl restart tedge-agent"

		The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__
		and that a user can provide its own implementation.

	pub fn forceful_timeout(&self) -> Option<Duration> {
	pub fn forceful_timeout_extension(&self) -> Option<Duration> {

Improve operation workflow definition #2496

Improve operation workflow definition #2496

Conversation

didier-wenzek commented Dec 1, 2023 • edited

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

codecov bot commented Dec 1, 2023 • edited

Codecov Report

albinsuresh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 5, 2023 • edited

Robot Results

didier-wenzek commented Dec 5, 2023

albinsuresh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

didier-wenzek commented Dec 13, 2023

albinsuresh left a comment

Choose a reason for hiding this comment

didier-wenzek commented Dec 13, 2023

gligorisaev commented Dec 14, 2023

didier-wenzek commented Dec 1, 2023 •

edited

codecov bot commented Dec 1, 2023 •

edited

github-actions bot commented Dec 5, 2023 •

edited