-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for operation workflow triggering device restart #2479
Add support for operation workflow triggering device restart #2479
Conversation
Codecov Report
Additional details and impacted files
|
Robot Results
|
535419a
to
4b556be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that restart
is an action that can be triggered generically from any workflow, there is the possibility of multiple workflows (for different operations) requesting for a restart simultaneously as well. We'll have to extend the support for that in future so that all those requested operations succeed, even though one of those requests really restarted the device.
next = ["scheduled"] | ||
|
||
[scheduled] | ||
script = "restart" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you highlighted already, I'd also be in favour of introducing a different action
key instead of overloading the script
key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal is to with something along these lines:
trigger = {
action = "restart",
on_exec = "restarting",
on_success = "successful_restart",
on_error = "failed_restart"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not continue using the next
field itself for those successful and failed target states instead of this new on_success
and on_error
directives? So that it is consistent with the rest of the workflow actions?
That on_exec
state is still a bit confusing to me. A user looking at the workflow definition to make sense out of the end-to-end control flow will be a bit lost seeing that restarting
state, as there is no other reference to that state in the workflow file. I guess I'm still missing something about this contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, that restarting
state being implicit, while the successful_restart
and failed_restart
states being explicit was the confusing aspect for me. One way to make things clearer would be to have that restarting
state also explicitly defined in the workflow as follows:
operation = "controlled_restart"
# ...
[scheduled]
action = "restart"
next = ["restarting"]
[restarting]
action = "check-restart"
next = ["successful_restart", "failed_restart"]
[successful_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}"
next = ["successful"]
[failed_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status} ${.payload.restartError}"
next = ["failed"]
# ...
The issue with this approach is that we're forcing the user to explicitly define additional states as per the internal contract of the restart
action. The need to have an explicit restarting
state may not be that intuitive to a user.
Another option is to completely hide the restarting
state by allowing the agent to auto-generate its own internal restarting
states without exposing it to the user in the workflow files. For example, if the restart
action is called from the scheduled
state, then the status
will be updated form scheduled
to scheduled_restarting
. If another state xyz
calls this action, then it will be moved to xyz_restarting
state. That way, we can avoid state value conflicts as well. The only issue with this approach is that the user will notice these "internal states" that he did not define in his workflow file. Being explicit makes the complete control flow clearer from the workflow definition itself and simplifies the implementation as well.
|
||
[scheduled] | ||
script = "restart" | ||
next = ["restarting", "successful_restart", "failed_restart"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully get the idea behind the need for that first restarting
state in this array. Why not just model it as an independent restarting
state transitioning from the scheduled
state? That restarting
state can then have subsequent successful_restart
and failed_restart
states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shortcut I took I indeed confusing. I should not have hijack the next field and I will fix that.
What you describe is the correct workflow: moving from scheduled to restarting and then to successful_restart or failed_restart.
The actual purpose is different: these 3 state names are used as alias to the state names of the restart internal workflow. The states of the latter cannot be used unchanged (as the main workflow already has executing,
successful and failed states) nor hard-coded (as the main workflow might invoke many restarts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it wrong to model the same workflow as follows (note the updated definition of the scheduled
state):
operation = "controlled_restart"
[init]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}"
next = ["scheduled"]
[scheduled]
inbuilt-action = "restart"
next = ["successful_restart", "failed_restart"]
[successful_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}"
next = ["successful"]
[failed_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status} ${.payload.restartError}"
next = ["failed"]
When the tedge-agent
processes the inbuilt-action = restart
, it knows that it was called in the context of the controlled_restart
command, from the scheduled
state. It can persist all this information before the restart and once the restart completes, it can just retrieve the state from which restart
was invoked, which is scheduled
in this case, and then just transition to either of the states defined in its next
field for that state.
This is simplistic view that I have about handling restarts in the context of other operations by simply adding more states. I'm sure I'm missing something here. I'm just not able to see what it is yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the
tedge-agent
processes theinbuilt-action = restart
, it knows that it was called in the context of thecontrolled_restart
command, from thescheduled
state. It can persist all this information before the restart and once the restart completes, it can just retrieve the state from whichrestart
was invoked, which isscheduled
in this case, and then just transition to either of the states defined in itsnext
field for that state.
With one exception described bellow, what you describe is what is actually implemented. The context of the restart (.i.e. the whole MQTT message describing the state of the main workflow is stored by the restart operation and resumed after the reboot).
This is simplistic view that I have about handling restarts in the context of other operations by simply adding more states. I'm sure I'm missing something here. I'm just not able to see what it is yet.
Moving from a state triggering a restart to a state waiting for the end of the restart is required to avoid a race condition.
With your example, where there is no restarting state, there is a risk for an infinite loop. If the retained message for the sheduled state is read before the restart completion is detected (these are two independent threads), then the workflow will erroneously trigger a new restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your example, where there is no restarting state, there is a risk for an infinite loop. If the retained message for the sheduled state is read before the restart completion is detected (these are two independent threads), then the workflow will erroneously trigger a new restart.
Yeah, now the need for that restarting
state is clear. I've documented some of my thoughts on how to make this flow clearer in this comment.
tests/RobotFramework/tests/tedge_agent/workflows/custom_restart.toml
Outdated
Show resolved
Hide resolved
fba4f53
to
6c0b3ba
Compare
6c0b3ba
to
4cd5bae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
One confusing thing that I found is the inconsistent usage of the words state
and step
in docs and code synonymously. The docs primarily uses state
(although there are a few references to step
as well), and the code primarily uses step
. I feel that using state
consistently would make it easier for the users and in the code. I prefer state
over step
as that's what each step in the workflow represents at the end of the day.
@@ -240,7 +318,24 @@ impl From<&OperationState> for OperationAction { | |||
// TODO this must be called when an operation is registered, not when invoked. | |||
fn from(state: &OperationState) -> Self { | |||
match &state.script { | |||
Some(script) => OperationAction::Script(script.to_owned()), | |||
Some(script) if script.command == "restart" => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that follow-up PR where we revisit the workflow file format, we can consider adding a dedicated action
or builtin-action
key for restart
instead of overloading script
.
- If the script returns a json payload with a `status` field this status is used as the new status for the command. | ||
- If the script successfully returns, its standard output is used to update the command state payload. | ||
- From this output, only the excerpt between a `:::begin-tedge:::` header and a `:::end-tedge:::` trailer is decoded. | ||
This is done to ease script authoring. A script can emit arbitrary output on its stdout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was wondering why we couldn't just parse the last {
and }
chars instead of :::begin-tedge:::
and :::end-tedge:::
. But, I understand that this simplifies our parsing logic, if there are multiple JSON objects printed by earlier commands. But still, picking up the last JSON object should still be okay, as I can't imagine any script performing more steps after printing this status message, which is supposed to be the last thing that it does.
But, this simplification can be done later as well, as this contract is also acceptable. But, the lighter the contract the better.
This is a critical struct for operation persistence that must be as simple as possible. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Was implemented only once. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Previously the operation under execution was stored using an adhoc format. Now the command state is stored in the same format as for the cmd/+/+ topics. This simplify operation handling (for instance there no more specific *Restarting* status for the restart command: this state is stored as *executing*). Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
The current state of the command triggering the restart is stored in the restart command. So, after reboot, the former command can resume in the appropriate following state. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
The log clean logic (in /var/tedge/log/agent) was cleaning workflow log files (due to the false assumptions on file names). Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
To workflow relevant part of a script output has to be surrounded by a header and a footer as in: ``` :::begin-tedge::: {"status":"success"} :::end-tedge::: ``` The goal is to facilitate the creation of operation scripts. Previously any garbage printed on stdout by one of the script dependencies was corrupting the thin-edge related output. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
As an operation workflow can be customized, notably with user-specific states, the mapper should accept these unknown state - doing nothing specific, but without emitting errors. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
6839df1
to
b34e5ae
Compare
QA has thoroughly checked the feature and here are the results:
|
Proposed changes
The following trigger a device restart, leading to "successful_restart" or "failed_restart" depending on the actual success of the reboot.
The
next
list of states is used to configure the states to which the workflow move on restart:Note This file format has to be discussed:
action
property instead of overloading thescript
propertyaction = "restart --executing restarting --on-success successful_restart --on-error failed_restart"
Types of changes
Paste Link to the issue
#2478
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINESFurther comments