Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for operation workflow triggering device restart #2479

Merged
merged 13 commits into from
Dec 1, 2023

Conversation

didier-wenzek
Copy link
Contributor

@didier-wenzek didier-wenzek commented Nov 23, 2023

Proposed changes

The following trigger a device restart, leading to "successful_restart" or "failed_restart" depending on the actual success of the reboot.

[scheduled]
script = "restart"
next = ["restarting", "successful_restart", "failed_restart"]

The next list of states is used to configure the states to which the workflow move on restart:

  • The first state ("restarting" in the example) is the executing state of the command: till the device restart the workflow will stay in that state.
  • The second state ("successful_restart" in the example) is where the workflow is moved after a successful restart.
  • The third state ("failed_restart" in the example) is where the workflow is moved after a failed restart.

Note This file format has to be discussed:

  • An alternative could be to have a specific action property instead of overloading the script property
  • For example: action = "restart --executing restarting --on-success successful_restart --on-error failed_restart"

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#2478

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link

codecov bot commented Nov 23, 2023

Codecov Report

Merging #2479 (b34e5ae) into main (ea254bf) will decrease coverage by 0.3%.
The diff coverage is 54.9%.

Additional details and impacted files
Files Coverage Δ
crates/core/plugin_sm/src/operation_logs.rs 78.6% <100.0%> (-0.9%) ⬇️
crates/core/tedge_agent/src/agent.rs 0.0% <ø> (ø)
...ates/core/tedge_agent/src/restart_manager/tests.rs 94.1% <100.0%> (+0.2%) ⬆️
...tes/core/tedge_agent/src/software_manager/tests.rs 93.7% <100.0%> (+0.9%) ⬆️
...tes/core/tedge_agent/src/state_repository/error.rs 0.0% <ø> (ø)
...dge_agent/src/tedge_operation_converter/builder.rs 90.1% <100.0%> (+0.3%) ⬆️
...tedge_agent/src/tedge_operation_converter/tests.rs 91.5% <100.0%> (-0.2%) ⬇️
crates/extensions/c8y_mapper_ext/src/converter.rs 81.3% <ø> (ø)
crates/extensions/c8y_mapper_ext/src/tests.rs 91.6% <100.0%> (-0.1%) ⬇️
...rates/extensions/tedge_config_manager/src/actor.rs 66.2% <ø> (ø)
... and 11 more

... and 2 files with indirect coverage changes

Copy link
Contributor

github-actions bot commented Nov 23, 2023

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
388 0 3 388 100 54m23.009999999s

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that restart is an action that can be triggered generically from any workflow, there is the possibility of multiple workflows (for different operations) requesting for a restart simultaneously as well. We'll have to extend the support for that in future so that all those requested operations succeed, even though one of those requests really restarted the device.

next = ["scheduled"]

[scheduled]
script = "restart"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you highlighted already, I'd also be in favour of introducing a different action key instead of overloading the script key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal is to with something along these lines:

trigger = {
    action = "restart",
    on_exec = "restarting",
    on_success = "successful_restart",
    on_error = "failed_restart"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not continue using the next field itself for those successful and failed target states instead of this new on_success and on_error directives? So that it is consistent with the rest of the workflow actions?

That on_exec state is still a bit confusing to me. A user looking at the workflow definition to make sense out of the end-to-end control flow will be a bit lost seeing that restarting state, as there is no other reference to that state in the workflow file. I guess I'm still missing something about this contract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, that restarting state being implicit, while the successful_restart and failed_restart states being explicit was the confusing aspect for me. One way to make things clearer would be to have that restarting state also explicitly defined in the workflow as follows:

operation = "controlled_restart"

# ...

[scheduled]
action = "restart"
next = ["restarting"]

[restarting]
action = "check-restart"
next = ["successful_restart", "failed_restart"]

[successful_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}"
next = ["successful"]

[failed_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status} ${.payload.restartError}"
next = ["failed"]

# ...

The issue with this approach is that we're forcing the user to explicitly define additional states as per the internal contract of the restart action. The need to have an explicit restarting state may not be that intuitive to a user.

Another option is to completely hide the restarting state by allowing the agent to auto-generate its own internal restarting states without exposing it to the user in the workflow files. For example, if the restart action is called from the scheduled state, then the status will be updated form scheduled to scheduled_restarting. If another state xyz calls this action, then it will be moved to xyz_restarting state. That way, we can avoid state value conflicts as well. The only issue with this approach is that the user will notice these "internal states" that he did not define in his workflow file. Being explicit makes the complete control flow clearer from the workflow definition itself and simplifies the implementation as well.


[scheduled]
script = "restart"
next = ["restarting", "successful_restart", "failed_restart"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't fully get the idea behind the need for that first restarting state in this array. Why not just model it as an independent restarting state transitioning from the scheduled state? That restarting state can then have subsequent successful_restart and failed_restart states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shortcut I took I indeed confusing. I should not have hijack the next field and I will fix that.

What you describe is the correct workflow: moving from scheduled to restarting and then to successful_restart or failed_restart.

The actual purpose is different: these 3 state names are used as alias to the state names of the restart internal workflow. The states of the latter cannot be used unchanged (as the main workflow already has executing,
successful and failed states) nor hard-coded (as the main workflow might invoke many restarts).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it wrong to model the same workflow as follows (note the updated definition of the scheduled state):

operation = "controlled_restart"

[init]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}"
next = ["scheduled"]

[scheduled]
inbuilt-action = "restart"
next = ["successful_restart", "failed_restart"]

[successful_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status}"
next = ["successful"]

[failed_restart]
script = "/etc/tedge/operations/log-restart.sh ${.topic.cmd_id} ${.payload.status} ${.payload.restartError}"
next = ["failed"]

When the tedge-agent processes the inbuilt-action = restart, it knows that it was called in the context of the controlled_restart command, from the scheduled state. It can persist all this information before the restart and once the restart completes, it can just retrieve the state from which restart was invoked, which is scheduled in this case, and then just transition to either of the states defined in its next field for that state.

This is simplistic view that I have about handling restarts in the context of other operations by simply adding more states. I'm sure I'm missing something here. I'm just not able to see what it is yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the tedge-agent processes the inbuilt-action = restart, it knows that it was called in the context of the controlled_restart command, from the scheduled state. It can persist all this information before the restart and once the restart completes, it can just retrieve the state from which restart was invoked, which is scheduled in this case, and then just transition to either of the states defined in its next field for that state.

With one exception described bellow, what you describe is what is actually implemented. The context of the restart (.i.e. the whole MQTT message describing the state of the main workflow is stored by the restart operation and resumed after the reboot).

This is simplistic view that I have about handling restarts in the context of other operations by simply adding more states. I'm sure I'm missing something here. I'm just not able to see what it is yet.

Moving from a state triggering a restart to a state waiting for the end of the restart is required to avoid a race condition.

With your example, where there is no restarting state, there is a risk for an infinite loop. If the retained message for the sheduled state is read before the restart completion is detected (these are two independent threads), then the workflow will erroneously trigger a new restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your example, where there is no restarting state, there is a risk for an infinite loop. If the retained message for the sheduled state is read before the restart completion is detected (these are two independent threads), then the workflow will erroneously trigger a new restart.

Yeah, now the need for that restarting state is clear. I've documented some of my thoughts on how to make this flow clearer in this comment.

crates/core/tedge_agent/src/software_manager/actor.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/messages.rs Show resolved Hide resolved
crates/core/tedge_api/src/messages.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One confusing thing that I found is the inconsistent usage of the words state and step in docs and code synonymously. The docs primarily uses state (although there are a few references to step as well), and the code primarily uses step. I feel that using state consistently would make it easier for the users and in the code. I prefer state over step as that's what each step in the workflow represents at the end of the day.

crates/core/tedge_api/src/messages.rs Show resolved Hide resolved
crates/core/tedge_api/src/workflow.rs Outdated Show resolved Hide resolved
@@ -240,7 +318,24 @@ impl From<&OperationState> for OperationAction {
// TODO this must be called when an operation is registered, not when invoked.
fn from(state: &OperationState) -> Self {
match &state.script {
Some(script) => OperationAction::Script(script.to_owned()),
Some(script) if script.command == "restart" => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that follow-up PR where we revisit the workflow file format, we can consider adding a dedicated action or builtin-action key for restart instead of overloading script.

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
- If the script returns a json payload with a `status` field this status is used as the new status for the command.
- If the script successfully returns, its standard output is used to update the command state payload.
- From this output, only the excerpt between a `:::begin-tedge:::` header and a `:::end-tedge:::` trailer is decoded.
This is done to ease script authoring. A script can emit arbitrary output on its stdout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was wondering why we couldn't just parse the last { and } chars instead of :::begin-tedge::: and :::end-tedge:::. But, I understand that this simplifies our parsing logic, if there are multiple JSON objects printed by earlier commands. But still, picking up the last JSON object should still be okay, as I can't imagine any script performing more steps after printing this status message, which is supposed to be the last thing that it does.

But, this simplification can be done later as well, as this contract is also acceptable. But, the lighter the contract the better.

This is a critical struct for operation persistence
that must be as simple as possible.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Was implemented only once.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Previously the operation under execution was stored using an adhoc
format. Now the command state is stored in the same format as for the
cmd/+/+ topics. This simplify operation handling (for instance there no
more specific *Restarting* status for the restart command: this state is
stored as *executing*).

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
The current state of the command triggering the restart
is stored in the restart command. So, after reboot, the former command
can resume in the appropriate following state.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
The log clean logic (in /var/tedge/log/agent) was cleaning
workflow log files (due to the false assumptions on file names).

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
To workflow relevant part of a script output has to be surrounded
by a header and a footer as in:

```
:::begin-tedge:::
{"status":"success"}
:::end-tedge:::
```

The goal is to facilitate the creation of operation scripts.
Previously any garbage printed on stdout by one of the script
dependencies was corrupting the thin-edge related output.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
As an operation workflow can be customized,
notably with user-specific states, the mapper should accept these
unknown state - doing nothing specific, but without emitting errors.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
@didier-wenzek didier-wenzek merged commit 1c9e284 into thin-edge:main Dec 1, 2023
18 checks passed
@didier-wenzek didier-wenzek deleted the feat/restart-with-context branch December 1, 2023 08:46
@didier-wenzek didier-wenzek mentioned this pull request Dec 1, 2023
15 tasks
@gligorisaev
Copy link
Contributor

QA has thoroughly checked the feature and here are the results:

  • Test for ticket exists in the test suite.
  • tests/RobotFramework/tests/tedge_agent/workflows/custom_operation.robot
  • QA has tested the function and it's functioning according description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants