-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve operation workflow definition #2496
Improve operation workflow definition #2496
Conversation
Codecov Report
Additional details and impacted files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed enhancements look good. Here are a few more enhancements that I can think of (a few taken from older PRs, just for the sake of having everything documented at one place):
- Including the previous state also in the payload when a state transition occurs. This helps in defining generic states like a timeout or failure state that can be used to just report the state from which the failure happened. It would also help in defining any cleanup/rollback logics in those states based on the previous state.
- Ability to execute custom scripts as part of terminal states:
successful
andfailed
: This would enable users to attach any post-workflow cleanup logics to these states, like clearing any temp resources, clearing the command status etc. When combined with theprevious-state
info, they can be even smarter.
|
||
["waiting-for-restart"] | ||
builtin_action = "waiting-for-restart" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional benefit that I see with having this state explicitly is the freedom that the user gets to override the inbuilt waiting-for-restart
validation logic as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be nicer. I will give a try to this representation to see the impact on the code.
|
||
Cons: | ||
- less specific than states with specific purpose as on `on_success` or `on_exit.1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the redundancy, I really liked the explicitness of the keywords like on_success
, on_failure
etc, esp since we have added more variants like on_kill
, on_timeout
etc. The moment you have more than 2 values in that next
array, it starts to become unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's go in that direction. Removing the next
field from the configuration file.
on_timeout = { status = "failed", reason = "timeout" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout_second = 300 | |
on_timeout = { status = "failed", reason = "timeout" } | |
on_timeout = { duration_seconds = 300, status = "failed", reason = "timeout" } |
How about merging them so that everything related to timeouts are at one place?
Robot Results
|
Why not. Adding an This has to be considered in a larger scope to fix that issue: #2495.
Here, I'm less convinced. Sure we need to fix #2484, My proposal, is to augment the builtin workflows with intermediate states, e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good. A few missing points that I noticed:
- Documentation of the
on_stdout
handler - Even though the default exit handler feature is mentioned in the timeout handling section, a dedicated section would be nice, highlighting that default handlers can be defined for any exit case.
/// - `on_success` and `on_exit.0` are are synonyms and cannot be both provided | ||
/// - `on_error` and `on_exit._` are are synonyms and cannot be both provided | ||
/// - `on_success` and `on_stdout` are incompatible, as the next state is either determined from the script stdout or its exit codes | ||
/// - `on_exec` is only meaningful in the context of a background script or a builtin action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that restriction is really needed. Might come in handy before a long running script is triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. But this would be more something along what is proposed here: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-0bc2ab86d1dc663436d9764c1ee3b48333013f3d52ce410aac0af84705f0bf7cR364
.i.e. to have two states.
["agent-restart"]
background_script = "sudo systemctl restart tedge-agent"
on_exec = "waiting-for-restart"
["waiting-for-restart"]
script = "/some/script.sh checking restart"
on_success = "successful_restart"
on_error = "failed_restart"
on_timeout = "waiting-for-restart"
input.handlers.on_timeout, | ||
Some(TomlStateUpdate::Simple("timeout".to_string())) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you forgot to assert the rest of the states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
on_error: Option<GenericStateUpdate>, | ||
on_kill: Option<GenericStateUpdate>, | ||
on_exit: Vec<(u8, u8, GenericStateUpdate)>, | ||
on_stdout: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we making the user list out the expected state values here, just so that it is easier to figure out the entire control flow of a workflow, without having to audit the scripts? Or there are some other benefits as well?
I see that they are unused now. Are we planning to validate that the output of the script is definitely one of the values provided in this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we making the user list out the expected state values here, just so that it is easier to figure out the entire control flow of a workflow, without having to audit the scripts? Or there are some other benefits as well?
Being able to compute the list of possible following states when in a given state will be a key to solve #2495.
I see that they are unused now. Are we planning to validate that the output of the script is definitely one of the values provided in this list?
Indeed, not used for now. My former plan was to fail a command when moving to some unexpected state. But to solve the former issue we might have to be more subtle.
the `on_error` definition trumps over any `status` and `reason` fields provided over the script stdout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why restrict that overriding rule to non-successful status codes alone? It can be applied for all exit cases, right? If an explicit exist handler is provided in the workflow, for any exit code, including the successful ones, it trumps over the status
returned in the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely.
- Currently, even stronger. If the script is non successful its output is ignored.
- But maybe this is not what you have in mind. Do you think consuming the output can also be useful in case of exit code with a specific handler? Somehow these are expected errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, it appears that things are a bit more subtle:
- the
status
given in the workflow definition trumps anystatus
value published on stdout (the idea is that sequence of states is statically defined) - the
reason
published on stdout trumps thereason
provided in the workflow definition (the rational being that an error message is dynamically provided by the running process). - these rules applies only for the exit codes which are somehow expected .i.e. with a defined
on_exit
handler.
crates/core/tedge_agent/src/agent.rs
Outdated
@@ -341,8 +340,7 @@ async fn read_operation_workflow(path: &Path) -> Result<OperationWorkflow, anyho | |||
let context = || format!("Reading operation workflow from {path:?}"); | |||
let bytes = tokio::fs::read(path).await.with_context(context)?; | |||
let input = std::str::from_utf8(&bytes).with_context(context)?; | |||
let toml = toml::from_str::<TomlOperationWorkflow>(input)?; //.with_context(context)?; | |||
let workflow = TryInto::<OperationWorkflow>::try_into(toml).with_context(context)?; | |||
let workflow = toml::from_str::<OperationWorkflow>(input)?; //.with_context(context)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting out that with_context
was intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix that. Adding a context here is in fact worse: not only there is no context but also the root error is lost and not logged.
As highlighted by @jarhodes314 the main issue is elsewhere: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-03553c8b7afdbd134b09403b3f4a3931fee4c368f4a4f8a0c38bfdacbacc5018R330. One needs {err:?} as anyhow only shows the full error message in Debug
, not Display
.
script = "/some/script.sh checking restart" | ||
next = ["waiting-for-restart", "successful_restart", "failed_restart"] | ||
on_stdout = ["waiting-for-restart", "successful_restart", "failed_restart"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on_stdout = ["waiting-for-restart", "successful_restart", "failed_restart"] | |
on_stdout = ["successful_restart", "failed_restart"] |
Expecting the same state that the workflow is currently in, is highly unlikely, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
However, not so unlikely even it's surely better to have a blocking wait with a timeout. This is the main issue with restart : one needs a way to detect a restart and this check must be patient (as the restart might be still pending) and robust (as the restart might fail).
|
||
```toml | ||
["device-restart"] | ||
builtin_action = "restart" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
builtin_action = "restart" | |
background_script = "sudo systemctl restart tedge-agent" |
Since builtin_action
is only introduced in the next section, this might be better for continuity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the former works but not the latter. More precisely, using a background script raises the question on how to detect success and failure.
```toml | ||
["device-restart"] | ||
action = "restart" | ||
on_exec = "waiting-for-restart" | ||
on_success = "successful_restart" | ||
on_error = "failed_restart" | ||
``` | ||
|
||
Alternative proposal: | ||
|
||
```toml | ||
["device-restart"] | ||
action = "restart" | ||
on_exec = "waiting-for-restart" | ||
|
||
["waiting-for-restart"] | ||
action = "waiting restart" | ||
on_success = "successful_restart" | ||
on_error = "failed_restart" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```toml | |
["device-restart"] | |
action = "restart" | |
on_exec = "waiting-for-restart" | |
on_success = "successful_restart" | |
on_error = "failed_restart" | |
``` | |
Alternative proposal: | |
```toml | |
["device-restart"] | |
action = "restart" | |
on_exec = "waiting-for-restart" | |
["waiting-for-restart"] | |
action = "waiting restart" | |
on_success = "successful_restart" | |
on_error = "failed_restart" | |
``` | |
```toml | |
["reboot_required"] | |
action = "restart" | |
on_exec = "restarting" | |
["restarting"] | |
action = "waiting restart" | |
on_success = "successful_restart" | |
on_error = "failed_restart" |
Removed the first proposal as we settled with the second one. Also adjusted it to reflect the state names mentioned in the description above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the second is not implemented yet. The main reason being that the "waiting restart"
needs more thinking. The failing system test (aka "Trigger native-reboot within workflow (on_error) - missing sudoers entry for reboot"), highlights one of these points: the behavior in case of an error is not clear.
|
||
```toml | ||
["<state>"] | ||
action = "waiting <delegate>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That delegate
needs some doc. In the restart example it was the restart
keyword. These would be a set of pre-defined keywords like that or it can be any text that represents an eternal process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplest is to remove that from the doc for now.
Something is wrong / missing / unclear when we combine this with error detection. See 4db304c
The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__ | ||
and that a user can provide its own implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__ | |
and that a user can provide its own implementation. | |
The action for the `"init"` state is a `"proceed"` action, meaning nothing specific is done by the __tedge-agent__. |
The proceed
action directly moves the workflow to the on_success
state, right? So, I don't see how the user can provide his own implementation while using proceed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be changed by providing a new workflow definition as done here: https://github.com/thin-edge/thin-edge.io/pull/2496/files#diff-0bc2ab86d1dc663436d9764c1ee3b48333013f3d52ce410aac0af84705f0bf7cR491
self.timeout | ||
} | ||
|
||
pub fn forceful_timeout(&self) -> Option<Duration> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn forceful_timeout(&self) -> Option<Duration> { | |
pub fn forceful_timeout_extension(&self) -> Option<Duration> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
But honestly, I don't think a name has to convey all the details of a contract.
The following commits will be removed: Indeed, adding an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
- [x] script with exit handlers - [x] builtin actions - [ ] timeouts - [ ] finalize naming Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
For now, this is done only for scripts. For actions, as a device reboot, where the agent might restart, one needs to persist on-disk the current state of each command including a last-updated-at timestamp. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
69320f7
to
dc06274
Compare
This PR improves the way an operation workflow can be specified by a user. |
QA has thoroughly checked the feature and here are the results:
|
Proposed changes
Proposal for improvements on how to specify a custom workflow:
owner
andnext
fieldsTypes of changes
Paste Link to the issue
#2478
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINESFurther comments