fix(execution): Resume parent pipeline when a failed stage in a child pipeline restarts #3317

AbdulRahmanAlHamali · 2019-11-25T20:30:13Z

If we have a pipeline, where one of the stages starts another pipeline, and then a stage fails in the inner pipeline. We can restart that inner stage, but the parent pipeline will not be notified of the restart, and thus will stay stuck. The only way to be able to actually move forward is to restart the whole failed inner pipeline, which is expensive and dangerous.

This PR provides a method to solve this problem. The main concept of the fix is to find and restart the stage in the parent pipeline, but to inform it that the child has already been restarted, and that it just needs to monitor it

spinnakerbot · 2019-11-25T20:35:18Z

The following commits need their title changed:

26bd8ee: restart parent stage when child restarts

Please format your commit title into the form:

<type>(<scope>): <subject>, e.g. fix(kubernetes): address NPE in status check

This allows us to easily generate changelogs & determine semantic version numbers when cutting releases. You can read more about commit conventions here.

marchello2000 · 2019-11-25T22:35:16Z

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt

          repository.updateStatus(topStage.execution.type, topStage.execution.id, RUNNING)
          queue.push(StartStage(startMessage))
        }
      }
    }
  }

+  private fun restartParentPipelineIfNeeded(message: RestartStage, topStage: Stage) {
+    val trigger = topStage.execution.trigger
+    if (!topStage.execution.trigger.type.equals("pipeline")) {


you are probably better of checking the type of the trigger in a type-safe way

Suggested change

if (!topStage.execution.trigger.type.equals("pipeline")) {

if (!topStage.execution.trigger instanceof PipelineTrigger) {

marchello2000 · 2019-11-25T22:35:56Z

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt

+      return
+    }
+
+    val parentExecution = trigger.other["parentExecution"] as Execution


then you can do (trigger as PipelineTrigger).parentExecution

marchello2000 · 2019-11-25T23:24:26Z

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt

+   * Inform the parent stage when it restarts that the child is already running
+   */
+  private fun Stage.addAlreadyRunning(message: RestartStage) {
+    message.withExecution { execution ->


this seems weird, why add these props on the execution it should be on the stage... but also... let me think about a better way of doing this

ok, so i think it would be better to do the following:

add something to the stage context like context["_skipPipelineRestart"] = true

No need to add executionId or executionName since those are already there

You will need to store the execution to persist this change to the stage in the executionRepository

PipelineStage.prepareStageForRestart should check the _skipPipelineRestart field, and if true not delete the executionId/executionName, from the context. It should also clear the _skipPipelineRestart flag so that if the user wanted to actually restart it explicitly afterwards it would work as expected (e.g. restart the child)

the PipelineStage.taskGraph should only create the start pipelinetask if there is no executionId present in the current stage context.

Alright I have addressed your points, thanks for the feedback!

I did not really understand point 3 though. I think this is already being done elsewhere in the code, but I might be wrong.

Also, regarding point 5, the stage context passed to taskGraph is the current one, right?

Let me know what you think, and thanks again!

marchello2000

@AbdulRahmanAlHamali Thanks for the bias for action on this - really appreciate it.
In general, I think it's a good idea (see my comments on how i recommend proceeding with implementation).

However, I am quite concerned as to side-effects of a change like this. In particular there are a bunch of edge cases, such as:

what happens if the user specified they don't want to wait for the child pipeline to complete?
what happens if the parent pipeline is still running from the previous run?

I will mull this over a bit, but let me know what you think

AbdulRahmanAlHamali · 2019-11-26T15:29:53Z

Hi @marchello2000 I really appreciate the feedback. I will address your points and submit a fix.

Regarding your concerns:

what happens if the user specified they don't want to wait for the child pipeline to complete?

I think we can pass down in the trigger a flag that says: I'm not waiting for you, and based on that we simply ignore the whole process.

what happens if the parent pipeline is still running from the previous run?

If the parent pipeline is waiting for the child pipeline, and it failed because of the child pipeline's failure, then this scenario wouldn't happen. Unless there is some specific case where it could happen?

marchello2000 · 2019-11-27T00:49:53Z

what happens if the parent pipeline is still running from the previous run?

If the parent pipeline is waiting for the child pipeline, and it failed because of the child pipeline's failure, then this scenario wouldn't happen. Unless there is some specific case where it could happen?

Oh, this can absolutely happen.

Imagine a pipeline where the user has specified "continue on failure" and the "continuation" part is LONG (like hours - which is pretty common) - this is similar to my first concern
Imagine a pipeline that has two branches: one is the where this child pipeline lives and another, parallel branch. Now the pipeline stage can say "failPipeline=true" and "continuePipeline=true" which means it would wait for the parallel branch to complete before completing the parent pipeline.

I am not sure what the best approach here.
I will take a look at your commits later - gotta run

AbdulRahmanAlHamali · 2019-12-02T17:39:41Z

Hey @marchello2000, sorry for the late reply.

OK, so we can say we have three principal cases:

The parent pipeline is halted, either because of the failure of the child pipeline, or at a later stage. In that case, a restart of the child pipeline should simply restart the pipeline stage in the parent pipeline, as I have already implemented.
The parent pipeline is running a stage that is dependent directly or indirectly on the pipeline stage: I think we can detect that easily in the code, and in that case, we do not restart the pipeline stage.
The parent pipeline is running a stage that is independent from the pipeline stage: In that case the restart should work normally. I think we already allow restarting stages while others are still running, right?

So, as a solution, we can simply check before restarting the parent pipeline that it is either halted, or running a stage that is not dependent on that one (so the branch is halted). Thoughts?

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt

marchello2000 · 2019-12-05T04:54:29Z

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt

+   * Inform the parent stage when it restarts that the child is already running
+   */
+  private fun Stage.addSkipRestart() {
+    context["_skipPipelineRestart"] = "true"


why not

Suggested change

context["_skipPipelineRestart"] = "true"

context["_skipPipelineRestart"] = true

marchello2000 · 2019-12-05T04:56:56Z

orca-front50/src/main/groovy/com/netflix/spinnaker/orca/front50/pipeline/PipelineStage.java

@@ -65,8 +68,18 @@ public void taskGraph(Stage stage, TaskNode.Builder builder) {
  @Override
  public void prepareStageForRestart(Stage stage) {
    stage.getContext().remove("status");
-    stage.getContext().remove("executionName");
-    stage.getContext().remove("executionId");
+    if (!stage


nit: i, personally, find this really hard to read. I would maybe rewrite as:

StageContext context = (StageContext)stage.getContext(); boolean restartPipeline = (boolean)(context.getCurrentOnly("_skipPipelineRestart", false)); if (restartPipeline) { .. } else { .. }

Note the use of getCurrentOnly - that way you are only inspecting the context of the current stage not any of the outputs of its parents (a bug that we've run into a bunch)

marchello2000 · 2019-12-05T05:04:37Z

@AbdulRahmanAlHamali : thanks for pushing on this.
At a high level, you are correct. I think there are subtleties though (e.g. a direct or indirect descendant can have a conditional on the success of the stage in question and hence will fall into your case #2 but a restart will probably be a better bet.

I am leaning towards "only restart parent pipeline stage if the parent pipeline is halted" - it's simpler, more predictable, and is much easier to reason about (both in code and from user's perspective). I also think it will cover the large majority of the use cases. How do you feel about that?

Also, would love input from @robfletcher or @ajordens - as they have a ton more operational experience in this area

AbdulRahmanAlHamali · 2019-12-05T13:03:26Z

Hey Mark, thanks for the review. I will submit changes to address your comment in a bit.

Regarding restarting only in the halted pipeline case. I don't mind that. We might have some twisted use cases for our users that are not covered by this. But for now I think this covers almost everything

AbdulRahmanAlHamali · 2019-12-05T15:00:27Z

orca-core/src/main/java/com/netflix/spinnaker/orca/pipeline/model/StageContext.java

@@ -71,7 +71,7 @@ public Object get(@Nullable Object key) {
   * @param defaultValue default value to return if key is not present
   * @return value or null if not present
   */
-  Object getCurrentOnly(@Nullable Object key, Object defaultValue) {
+  public Object getCurrentOnly(@Nullable Object key, Object defaultValue) {


made this public to be able to access it from PipelineStage

AbdulRahmanAlHamali · 2019-12-05T15:01:29Z

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt

+
+    val trigger = topStage.execution.trigger as PipelineTrigger
+
+    if (trigger.parentExecution.status != TERMINAL) {


I used TERMINAL here, but I could have used something else like STOPPED. Not sure which is best

Why only restart in the TERMINAL case? I think it's ok to "restart" so long as the parent is not RUNNING, no?
In that case, you can

Suggested change

if (trigger.parentExecution.status != TERMINAL) {

if (trigger.parentExecution.status.isComplete()) {

marchello2000

Thanks again for being diligent on this!
Looks good minus the one comment re TERMINAL

It would probably be prudent to make a unittest for this

AbdulRahmanAlHamali · 2019-12-05T18:13:33Z

From what I see, isComplete is set to true for success, cancellation and other cases. Do we want to restart even in those cases?

AbdulRahmanAlHamali · 2019-12-05T18:15:48Z

Regarding the testing, I agree; unit tests don't add much here, since the change is mostly about the interaction with the different components. But I will run some end-to-end tests on that before we merge it.

marchello2000 · 2019-12-05T20:30:01Z

From what I see, isComplete is set to true for success, cancellation and other cases. Do we want to restart even in those cases?

I think so... I guess your team is running into these use cases - curious what your perspective is. For me, I just wanted to make sure that we don't "restart" a stage while the pipeline is running because that could lead to some weird situations that I am having a hard time visualizing...

I was thinking that perhaps you have a child pipeline stage that is marked as "ignore failure"so the parent pipeline will succeed but you might still want to rerun the child if it failed

AbdulRahmanAlHamali · 2019-12-06T18:47:08Z

Hey mark.

My exact use case is this: I have multiple environments: dev, qa, prd, etc. each represented by its own pipeline.

Then, I have a parent pipeline, which triggers those pipelines one after the other, and does some other extra logic.

If the 10th stage in dev fails for example, the parent pipeline stops. And the only reason to make the parent pipeline resume is by restarting all of dev environment, which is not optimal, since 10 steps have already been executed in dev. So what I really want is for the user to simply restart the failed stage in dev, which will awaken the parent pipeline, so that it goes back to waiting for dev, and when dev is finished, it continues deploying in the other environments.

In case the parent pipeline ignores the failure of the child pipeline, then it really is not a problem for me. The only problem is when the parent pipeline fails because of the failure of dev, which could either be immediate (failPipeline=true), or not (completeOtherBranchesThenFail=true).

AbdulRahmanAlHamali · 2019-12-06T18:47:59Z

I think the above are the only two cases where a restart is necessary, and maybe we can add a more specific check just for them.

What do you think?

marchello2000 · 2019-12-06T23:20:08Z

Got it. I think we can keep the check as is for now (e.g. if the pipeline is not running restart/awaken it). I just think the more complex the check the harder it is to get it right and, more importantly, reason about it for the end users

AbdulRahmanAlHamali · 2019-12-09T16:18:54Z

Hey mark, did you mean to keep it as TERMINAL?

marchello2000 · 2019-12-10T21:12:58Z

Just to capture our offline convo: I think the change here look good, thank you for pushing them through. I will help test it (today/tomorrow) and then mark it approved/merge

…names

…o restart-parent-pipeline

AbdulRahmanAlHamali · 2019-12-11T20:12:02Z

@spinnakerbot cherry-pick 1.17

spinnakerbot · 2019-12-11T20:15:29Z

Cherry pick failed: Command failed (cherry pick commit f1a0f89) with exit code 128:

fatal: bad object f1a0f89845283dff8382af6f937bbfd7f5fcc3c2

marchello2000

looks good!

AbdulRahmanAlHamali · 2019-12-11T21:17:51Z

@spinnakerbot cherry-pick 1.17

… pipeline restarts (#3317) * restart parent stage when child restarts * fix based on code review * remove unused function * better method name * improvements based on feedback * fix static analysis error * replace terminal by iscomplete * fetch the live version of the parent execution, and improve variable names

spinnakerbot · 2019-12-11T21:20:30Z

Cherry pick successful: #3340

… pipeline restarts (#3317) (#3340) * restart parent stage when child restarts * fix based on code review * remove unused function * better method name * improvements based on feedback * fix static analysis error * replace terminal by iscomplete * fetch the live version of the parent execution, and improve variable names

restart parent stage when child restarts

26bd8ee

dreynaud requested a review from marchello2000 November 25, 2019 21:40

marchello2000 reviewed Nov 25, 2019

View reviewed changes

marchello2000 requested changes Nov 25, 2019

View reviewed changes

AbdulRahman AlHamali added 3 commits November 26, 2019 10:52

fix based on code review

04ed397

remove unused function

e2a5b4f

better method name

3e9a179

marchello2000 reviewed Dec 5, 2019

View reviewed changes

orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt Show resolved Hide resolved

marchello2000 reviewed Dec 5, 2019

View reviewed changes

AbdulRahman AlHamali added 2 commits December 5, 2019 09:56

improvements based on feedback

48d1790

fix static analysis error

904f1eb

AbdulRahmanAlHamali commented Dec 5, 2019

View reviewed changes

marchello2000 reviewed Dec 5, 2019

View reviewed changes

replace terminal by iscomplete

d6167c6

AbdulRahmanAlHamali added 3 commits December 11, 2019 10:58

Merge branch 'master' into restart-parent-pipeline

9624ea3

fetch the live version of the parent execution, and improve variable …

b70c701

…names

Merge branch 'restart-parent-pipeline' of github.com:coveord/orca int…

168ad12

…o restart-parent-pipeline

marchello2000 approved these changes Dec 11, 2019

View reviewed changes

marchello2000 added the ready to merge Approved and ready for merge label Dec 11, 2019

mergify bot merged commit eef7087 into spinnaker:master Dec 11, 2019

mergify bot added the auto merged Merged automatically by a bot label Dec 11, 2019

spinnakerbot added the target-release/1.18 label Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(execution): Resume parent pipeline when a failed stage in a child pipeline restarts #3317

fix(execution): Resume parent pipeline when a failed stage in a child pipeline restarts #3317

AbdulRahmanAlHamali commented Nov 25, 2019 •

edited

Loading

spinnakerbot commented Nov 25, 2019

marchello2000 Nov 25, 2019

marchello2000 Nov 25, 2019

marchello2000 Nov 25, 2019

marchello2000 Nov 25, 2019

AbdulRahmanAlHamali Nov 26, 2019

marchello2000 left a comment

AbdulRahmanAlHamali commented Nov 26, 2019 •

edited

Loading

marchello2000 commented Nov 27, 2019

AbdulRahmanAlHamali commented Dec 2, 2019 •

edited

Loading

marchello2000 Dec 5, 2019

marchello2000 Dec 5, 2019 •

edited

Loading

marchello2000 commented Dec 5, 2019

AbdulRahmanAlHamali commented Dec 5, 2019

AbdulRahmanAlHamali Dec 5, 2019

AbdulRahmanAlHamali Dec 5, 2019

marchello2000 Dec 5, 2019

marchello2000 left a comment

AbdulRahmanAlHamali commented Dec 5, 2019

AbdulRahmanAlHamali commented Dec 5, 2019

marchello2000 commented Dec 5, 2019

AbdulRahmanAlHamali commented Dec 6, 2019

AbdulRahmanAlHamali commented Dec 6, 2019

marchello2000 commented Dec 6, 2019

AbdulRahmanAlHamali commented Dec 9, 2019

marchello2000 commented Dec 10, 2019

AbdulRahmanAlHamali commented Dec 11, 2019

spinnakerbot commented Dec 11, 2019

marchello2000 left a comment

AbdulRahmanAlHamali commented Dec 11, 2019

spinnakerbot commented Dec 11, 2019

	if (!topStage.execution.trigger.type.equals("pipeline")) {
	if (!topStage.execution.trigger instanceof PipelineTrigger) {

	context["_skipPipelineRestart"] = "true"
	context["_skipPipelineRestart"] = true


		val trigger = topStage.execution.trigger as PipelineTrigger

		if (trigger.parentExecution.status != TERMINAL) {

	if (trigger.parentExecution.status != TERMINAL) {
	if (trigger.parentExecution.status.isComplete()) {

fix(execution): Resume parent pipeline when a failed stage in a child pipeline restarts #3317

fix(execution): Resume parent pipeline when a failed stage in a child pipeline restarts #3317

Conversation

AbdulRahmanAlHamali commented Nov 25, 2019 • edited Loading

spinnakerbot commented Nov 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marchello2000 left a comment

Choose a reason for hiding this comment

AbdulRahmanAlHamali commented Nov 26, 2019 • edited Loading

marchello2000 commented Nov 27, 2019

AbdulRahmanAlHamali commented Dec 2, 2019 • edited Loading

Choose a reason for hiding this comment

marchello2000 Dec 5, 2019 • edited Loading

Choose a reason for hiding this comment

marchello2000 commented Dec 5, 2019

AbdulRahmanAlHamali commented Dec 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marchello2000 left a comment

Choose a reason for hiding this comment

AbdulRahmanAlHamali commented Dec 5, 2019

AbdulRahmanAlHamali commented Dec 5, 2019

marchello2000 commented Dec 5, 2019

AbdulRahmanAlHamali commented Dec 6, 2019

AbdulRahmanAlHamali commented Dec 6, 2019

marchello2000 commented Dec 6, 2019

AbdulRahmanAlHamali commented Dec 9, 2019

marchello2000 commented Dec 10, 2019

AbdulRahmanAlHamali commented Dec 11, 2019

spinnakerbot commented Dec 11, 2019

marchello2000 left a comment

Choose a reason for hiding this comment

AbdulRahmanAlHamali commented Dec 11, 2019

spinnakerbot commented Dec 11, 2019

AbdulRahmanAlHamali commented Nov 25, 2019 •

edited

Loading

AbdulRahmanAlHamali commented Nov 26, 2019 •

edited

Loading

AbdulRahmanAlHamali commented Dec 2, 2019 •

edited

Loading

marchello2000 Dec 5, 2019 •

edited

Loading