Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve operation workflow definition #2496

Merged

Conversation

didier-wenzek
Copy link
Contributor

@didier-wenzek didier-wenzek commented Dec 1, 2023

Proposed changes

Proposal for improvements on how to specify a custom workflow:

  • Next step determined by script exit status
  • Next step determined by script output
  • Using a script to trigger a restart
  • New syntax for restart action
  • Setting script execution timeout
  • Deprecating the owner and next fields

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#2478

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link

codecov bot commented Dec 1, 2023

Codecov Report

Merging #2496 (dc06274) into main (2935e42) will increase coverage by 0.0%.
The diff coverage is 73.4%.

Additional details and impacted files
Files Coverage Δ
crates/core/tedge_api/src/messages.rs 83.4% <0.0%> (ø)
crates/core/tedge_api/src/workflow/error.rs 0.0% <0.0%> (ø)
crates/core/tedge_agent/src/agent.rs 0.0% <0.0%> (ø)
crates/core/tedge_api/src/workflow/supervisor.rs 65.9% <65.9%> (ø)
...tedge_agent/src/tedge_operation_converter/actor.rs 46.8% <0.0%> (-3.6%) ⬇️
crates/core/tedge_api/src/workflow/mod.rs 61.1% <61.1%> (ø)
crates/core/tedge_api/src/workflow/state.rs 77.7% <77.7%> (ø)
crates/core/tedge_api/src/workflow/script.rs 81.9% <81.9%> (ø)
crates/core/tedge_api/src/workflow/toml_config.rs 70.4% <70.4%> (ø)

... and 2 files with indirect coverage changes

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed enhancements look good. Here are a few more enhancements that I can think of (a few taken from older PRs, just for the sake of having everything documented at one place):

  1. Including the previous state also in the payload when a state transition occurs. This helps in defining generic states like a timeout or failure state that can be used to just report the state from which the failure happened. It would also help in defining any cleanup/rollback logics in those states based on the previous state.
  2. Ability to execute custom scripts as part of terminal states: successful and failed: This would enable users to attach any post-workflow cleanup logics to these states, like clearing any temp resources, clearing the command status etc. When combined with the previous-state info, they can be even smarter.

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved

["waiting-for-restart"]
builtin_action = "waiting-for-restart"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional benefit that I see with having this state explicitly is the freedom that the user gets to override the inbuilt waiting-for-restart validation logic as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be nicer. I will give a try to this representation to see the impact on the code.


Cons:
- less specific than states with specific purpose as on `on_success` or `on_exit.1`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the redundancy, I really liked the explicitness of the keywords like on_success, on_failure etc, esp since we have added more variants like on_kill, on_timeout etc. The moment you have more than 2 values in that next array, it starts to become unclear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's go in that direction. Removing the next field from the configuration file.

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
Comment on lines 419 to 420
on_timeout = { status = "failed", reason = "timeout" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout_second = 300
on_timeout = { status = "failed", reason = "timeout" }
on_timeout = { duration_seconds = 300, status = "failed", reason = "timeout" }

How about merging them so that everything related to timeouts are at one place?

Copy link
Contributor

github-actions bot commented Dec 5, 2023

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
369 0 3 369 100 50m10.534s

@didier-wenzek
Copy link
Contributor Author

  1. Including the previous state also in the payload when a state transition occurs. This helps in defining generic states like a timeout or failure state that can be used to just report the state from which the failure happened. It would also help in defining any cleanup/rollback logics in those states based on the previous state.

Why not. Adding an updated-at timestamp might also be useful to control timeouts.

This has to be considered in a larger scope to fix that issue: #2495.

  1. Ability to execute custom scripts as part of terminal states: successful and failed: This would enable users to attach any post-workflow cleanup logics to these states, like clearing any temp resources, clearing the command status etc. When combined with the previous-state info, they can be even smarter.

Here, I'm less convinced. Sure we need to fix #2484,
but attaching an action to these states leads to a race condition, as both the agent and the mapper will react on those and the mapper clearing the state once done.

My proposal, is to augment the builtin workflows with intermediate states, e.g. successful-software-update and failed-software-update with default transitions to successful and failed. These default transitions can then be overwritten to include clean-up or rollback logic, as you suggest. The difference is a clear ownership: successful-software-update and failed-software-update are handled by the agent, and successful and failed by the mapper.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. A few missing points that I noticed:

  1. Documentation of the on_stdout handler
  2. Even though the default exit handler feature is mentioned in the timeout handling section, a dedicated section would be nice, highlighting that default handlers can be defined for any exit case.

/// - `on_success` and `on_exit.0` are are synonyms and cannot be both provided
/// - `on_error` and `on_exit._` are are synonyms and cannot be both provided
/// - `on_success` and `on_stdout` are incompatible, as the next state is either determined from the script stdout or its exit codes
/// - `on_exec` is only meaningful in the context of a background script or a builtin action
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that restriction is really needed. Might come in handy before a long running script is triggered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But this would be more something along what is proposed here: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-0bc2ab86d1dc663436d9764c1ee3b48333013f3d52ce410aac0af84705f0bf7cR364

.i.e. to have two states.

["agent-restart"]
background_script = "sudo systemctl restart tedge-agent"
on_exec = "waiting-for-restart"

["waiting-for-restart"]
script = "/some/script.sh checking restart"
on_success = "successful_restart"
on_error =  "failed_restart"
on_timeout = "waiting-for-restart"

input.handlers.on_timeout,
Some(TomlStateUpdate::Simple("timeout".to_string()))
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you forgot to assert the rest of the states.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

on_error: Option<GenericStateUpdate>,
on_kill: Option<GenericStateUpdate>,
on_exit: Vec<(u8, u8, GenericStateUpdate)>,
on_stdout: Vec<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making the user list out the expected state values here, just so that it is easier to figure out the entire control flow of a workflow, without having to audit the scripts? Or there are some other benefits as well?

I see that they are unused now. Are we planning to validate that the output of the script is definitely one of the values provided in this list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making the user list out the expected state values here, just so that it is easier to figure out the entire control flow of a workflow, without having to audit the scripts? Or there are some other benefits as well?

Being able to compute the list of possible following states when in a given state will be a key to solve #2495.

I see that they are unused now. Are we planning to validate that the output of the script is definitely one of the values provided in this list?

Indeed, not used for now. My former plan was to fail a command when moving to some unexpected state. But to solve the former issue we might have to be more subtle.

Comment on lines 351 to 352
the `on_error` definition trumps over any `status` and `reason` fields provided over the script stdout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why restrict that overriding rule to non-successful status codes alone? It can be applied for all exit cases, right? If an explicit exist handler is provided in the workflow, for any exit code, including the successful ones, it trumps over the status returned in the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely.

  • Currently, even stronger. If the script is non successful its output is ignored.
  • But maybe this is not what you have in mind. Do you think consuming the output can also be useful in case of exit code with a specific handler? Somehow these are expected errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, it appears that things are a bit more subtle:

  • the status given in the workflow definition trumps any status value published on stdout (the idea is that sequence of states is statically defined)
  • the reason published on stdout trumps the reason provided in the workflow definition (the rational being that an error message is dynamically provided by the running process).
  • these rules applies only for the exit codes which are somehow expected .i.e. with a defined on_exit handler.

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
@@ -341,8 +340,7 @@ async fn read_operation_workflow(path: &Path) -> Result<OperationWorkflow, anyho
let context = || format!("Reading operation workflow from {path:?}");
let bytes = tokio::fs::read(path).await.with_context(context)?;
let input = std::str::from_utf8(&bytes).with_context(context)?;
let toml = toml::from_str::<TomlOperationWorkflow>(input)?; //.with_context(context)?;
let workflow = TryInto::<OperationWorkflow>::try_into(toml).with_context(context)?;
let workflow = toml::from_str::<OperationWorkflow>(input)?; //.with_context(context)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out that with_context was intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix that. Adding a context here is in fact worse: not only there is no context but also the root error is lost and not logged.

As highlighted by @jarhodes314 the main issue is elsewhere: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-03553c8b7afdbd134b09403b3f4a3931fee4c368f4a4f8a0c38bfdacbacc5018R330. One needs {err:?} as anyhow only shows the full error message in Debug, not Display.

script = "/some/script.sh checking restart"
next = ["waiting-for-restart", "successful_restart", "failed_restart"]
on_stdout = ["waiting-for-restart", "successful_restart", "failed_restart"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on_stdout = ["waiting-for-restart", "successful_restart", "failed_restart"]
on_stdout = ["successful_restart", "failed_restart"]

Expecting the same state that the workflow is currently in, is highly unlikely, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

However, not so unlikely even it's surely better to have a blocking wait with a timeout. This is the main issue with restart : one needs a way to detect a restart and this check must be patient (as the restart might be still pending) and robust (as the restart might fail).


```toml
["device-restart"]
builtin_action = "restart"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
builtin_action = "restart"
background_script = "sudo systemctl restart tedge-agent"

Since builtin_action is only introduced in the next section, this might be better for continuity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the former works but not the latter. More precisely, using a background script raises the question on how to detect success and failure.

Comment on lines 409 to 428
```toml
["device-restart"]
action = "restart"
on_exec = "waiting-for-restart"
on_success = "successful_restart"
on_error = "failed_restart"
```

Alternative proposal:

```toml
["device-restart"]
action = "restart"
on_exec = "waiting-for-restart"

["waiting-for-restart"]
action = "waiting restart"
on_success = "successful_restart"
on_error = "failed_restart"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```toml
["device-restart"]
action = "restart"
on_exec = "waiting-for-restart"
on_success = "successful_restart"
on_error = "failed_restart"
```
Alternative proposal:
```toml
["device-restart"]
action = "restart"
on_exec = "waiting-for-restart"
["waiting-for-restart"]
action = "waiting restart"
on_success = "successful_restart"
on_error = "failed_restart"
```
```toml
["reboot_required"]
action = "restart"
on_exec = "restarting"
["restarting"]
action = "waiting restart"
on_success = "successful_restart"
on_error = "failed_restart"

Removed the first proposal as we settled with the second one. Also adjusted it to reflect the state names mentioned in the description above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the second is not implemented yet. The main reason being that the "waiting restart" needs more thinking. The failing system test (aka "Trigger native-reboot within workflow (on_error) - missing sudoers entry for reboot"), highlights one of these points: the behavior in case of an error is not clear.


```toml
["<state>"]
action = "waiting <delegate>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That delegate needs some doc. In the restart example it was the restart keyword. These would be a set of pre-defined keywords like that or it can be any text that represents an eternal process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest is to remove that from the doc for now.

Something is wrong / missing / unclear when we combine this with error detection. See 4db304c

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
Comment on lines +479 to +459
The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__
and that a user can provide its own implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__
and that a user can provide its own implementation.
The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__.

The proceed action directly moves the workflow to the on_success state, right? So, I don't see how the user can provide his own implementation while using proceed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.timeout
}

pub fn forceful_timeout(&self) -> Option<Duration> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn forceful_timeout(&self) -> Option<Duration> {
pub fn forceful_timeout_extension(&self) -> Option<Duration> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

But honestly, I don't think a name has to convey all the details of a contract.

@didier-wenzek
Copy link
Contributor Author

The following commits will be removed:

Indeed, adding an on_error handler along an on_exec handler needs clarification. Indeed, this creates two threads of execution when an error is detected executing the script. This breaks the main assumption of a single thread of execution per command instance.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
- [x] script with exit handlers
- [x] builtin actions
- [ ] timeouts
- [ ] finalize naming

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
For now, this is done only for scripts.
For actions, as a device reboot, where the agent might restart,
one needs to persist on-disk the current state of each command
including a last-updated-at timestamp.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
@didier-wenzek didier-wenzek force-pushed the improve/operation-workflow-definition branch from 69320f7 to dc06274 Compare December 13, 2023 13:20
@didier-wenzek didier-wenzek merged commit 1dadf29 into thin-edge:main Dec 13, 2023
18 checks passed
@didier-wenzek didier-wenzek deleted the improve/operation-workflow-definition branch December 13, 2023 14:13
@didier-wenzek didier-wenzek mentioned this pull request Dec 13, 2023
15 tasks
@didier-wenzek
Copy link
Contributor Author

@gligorisaev
Copy link
Contributor

QA has thoroughly checked the feature and here are the results:

  • Test for ticket exists in the test suite.
  • QA has tested the function and it's functioning according description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants