Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(execution): Resume parent pipeline when a failed stage in a child pipeline restarts #3317

Merged
merged 10 commits into from
Dec 11, 2019

Conversation

AbdulRahmanAlHamali
Copy link
Contributor

@AbdulRahmanAlHamali AbdulRahmanAlHamali commented Nov 25, 2019

If we have a pipeline, where one of the stages starts another pipeline, and then a stage fails in the inner pipeline. We can restart that inner stage, but the parent pipeline will not be notified of the restart, and thus will stay stuck. The only way to be able to actually move forward is to restart the whole failed inner pipeline, which is expensive and dangerous.

This PR provides a method to solve this problem. The main concept of the fix is to find and restart the stage in the parent pipeline, but to inform it that the child has already been restarted, and that it just needs to monitor it

@spinnakerbot
Copy link
Contributor

The following commits need their title changed:

  • 26bd8ee: restart parent stage when child restarts

Please format your commit title into the form:

<type>(<scope>): <subject>, e.g. fix(kubernetes): address NPE in status check

This allows us to easily generate changelogs & determine semantic version numbers when cutting releases. You can read more about commit conventions here.

repository.updateStatus(topStage.execution.type, topStage.execution.id, RUNNING)
queue.push(StartStage(startMessage))
}
}
}
}

private fun restartParentPipelineIfNeeded(message: RestartStage, topStage: Stage) {
val trigger = topStage.execution.trigger
if (!topStage.execution.trigger.type.equals("pipeline")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are probably better of checking the type of the trigger in a type-safe way

Suggested change
if (!topStage.execution.trigger.type.equals("pipeline")) {
if (!topStage.execution.trigger instanceof PipelineTrigger) {

return
}

val parentExecution = trigger.other["parentExecution"] as Execution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then you can do (trigger as PipelineTrigger).parentExecution

* Inform the parent stage when it restarts that the child is already running
*/
private fun Stage.addAlreadyRunning(message: RestartStage) {
message.withExecution { execution ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems weird, why add these props on the execution it should be on the stage... but also... let me think about a better way of doing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so i think it would be better to do the following:

  1. add something to the stage context like context["_skipPipelineRestart"] = true
  2. No need to add executionId or executionName since those are already there
  3. You will need to store the execution to persist this change to the stage in the executionRepository
  4. PipelineStage.prepareStageForRestart should check the _skipPipelineRestart field, and if true not delete the executionId/executionName, from the context. It should also clear the _skipPipelineRestart flag so that if the user wanted to actually restart it explicitly afterwards it would work as expected (e.g. restart the child)
  5. the PipelineStage.taskGraph should only create the start pipelinetask if there is no executionId present in the current stage context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I have addressed your points, thanks for the feedback!

I did not really understand point 3 though. I think this is already being done elsewhere in the code, but I might be wrong.

Also, regarding point 5, the stage context passed to taskGraph is the current one, right?

Let me know what you think, and thanks again!

Copy link
Contributor

@marchello2000 marchello2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbdulRahmanAlHamali Thanks for the bias for action on this - really appreciate it.
In general, I think it's a good idea (see my comments on how i recommend proceeding with implementation).

However, I am quite concerned as to side-effects of a change like this. In particular there are a bunch of edge cases, such as:

  1. what happens if the user specified they don't want to wait for the child pipeline to complete?
  2. what happens if the parent pipeline is still running from the previous run?

I will mull this over a bit, but let me know what you think

@AbdulRahmanAlHamali
Copy link
Contributor Author

AbdulRahmanAlHamali commented Nov 26, 2019

Hi @marchello2000 I really appreciate the feedback. I will address your points and submit a fix.

Regarding your concerns:

what happens if the user specified they don't want to wait for the child pipeline to complete?

I think we can pass down in the trigger a flag that says: I'm not waiting for you, and based on that we simply ignore the whole process.

what happens if the parent pipeline is still running from the previous run?

If the parent pipeline is waiting for the child pipeline, and it failed because of the child pipeline's failure, then this scenario wouldn't happen. Unless there is some specific case where it could happen?

@marchello2000
Copy link
Contributor

what happens if the parent pipeline is still running from the previous run?

If the parent pipeline is waiting for the child pipeline, and it failed because of the child pipeline's failure, then this scenario wouldn't happen. Unless there is some specific case where it could happen?

Oh, this can absolutely happen.

  1. Imagine a pipeline where the user has specified "continue on failure" and the "continuation" part is LONG (like hours - which is pretty common) - this is similar to my first concern
  2. Imagine a pipeline that has two branches: one is the where this child pipeline lives and another, parallel branch. Now the pipeline stage can say "failPipeline=true" and "continuePipeline=true" which means it would wait for the parallel branch to complete before completing the parent pipeline.

I am not sure what the best approach here.
I will take a look at your commits later - gotta run

@AbdulRahmanAlHamali
Copy link
Contributor Author

AbdulRahmanAlHamali commented Dec 2, 2019

Hey @marchello2000, sorry for the late reply.

OK, so we can say we have three principal cases:

  1. The parent pipeline is halted, either because of the failure of the child pipeline, or at a later stage. In that case, a restart of the child pipeline should simply restart the pipeline stage in the parent pipeline, as I have already implemented.
  2. The parent pipeline is running a stage that is dependent directly or indirectly on the pipeline stage: I think we can detect that easily in the code, and in that case, we do not restart the pipeline stage.
  3. The parent pipeline is running a stage that is independent from the pipeline stage: In that case the restart should work normally. I think we already allow restarting stages while others are still running, right?

So, as a solution, we can simply check before restarting the parent pipeline that it is either halted, or running a stage that is not dependent on that one (so the branch is halted). Thoughts?

* Inform the parent stage when it restarts that the child is already running
*/
private fun Stage.addSkipRestart() {
context["_skipPipelineRestart"] = "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

Suggested change
context["_skipPipelineRestart"] = "true"
context["_skipPipelineRestart"] = true

@@ -65,8 +68,18 @@ public void taskGraph(Stage stage, TaskNode.Builder builder) {
@Override
public void prepareStageForRestart(Stage stage) {
stage.getContext().remove("status");
stage.getContext().remove("executionName");
stage.getContext().remove("executionId");
if (!stage
Copy link
Contributor

@marchello2000 marchello2000 Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i, personally, find this really hard to read. I would maybe rewrite as:

StageContext context = (StageContext)stage.getContext();
boolean restartPipeline = (boolean)(context.getCurrentOnly("_skipPipelineRestart", false));

if (restartPipeline) {
  ..
 } else {
  ..
}

Note the use of getCurrentOnly - that way you are only inspecting the context of the current stage not any of the outputs of its parents (a bug that we've run into a bunch)

@marchello2000
Copy link
Contributor

@AbdulRahmanAlHamali : thanks for pushing on this.
At a high level, you are correct. I think there are subtleties though (e.g. a direct or indirect descendant can have a conditional on the success of the stage in question and hence will fall into your case #2 but a restart will probably be a better bet.

I am leaning towards "only restart parent pipeline stage if the parent pipeline is halted" - it's simpler, more predictable, and is much easier to reason about (both in code and from user's perspective). I also think it will cover the large majority of the use cases. How do you feel about that?

Also, would love input from @robfletcher or @ajordens - as they have a ton more operational experience in this area

@AbdulRahmanAlHamali
Copy link
Contributor Author

Hey Mark, thanks for the review. I will submit changes to address your comment in a bit.

Regarding restarting only in the halted pipeline case. I don't mind that. We might have some twisted use cases for our users that are not covered by this. But for now I think this covers almost everything

@@ -71,7 +71,7 @@ public Object get(@Nullable Object key) {
* @param defaultValue default value to return if key is not present
* @return value or null if not present
*/
Object getCurrentOnly(@Nullable Object key, Object defaultValue) {
public Object getCurrentOnly(@Nullable Object key, Object defaultValue) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this public to be able to access it from PipelineStage


val trigger = topStage.execution.trigger as PipelineTrigger

if (trigger.parentExecution.status != TERMINAL) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used TERMINAL here, but I could have used something else like STOPPED. Not sure which is best

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only restart in the TERMINAL case? I think it's ok to "restart" so long as the parent is not RUNNING, no?
In that case, you can

Suggested change
if (trigger.parentExecution.status != TERMINAL) {
if (trigger.parentExecution.status.isComplete()) {

Copy link
Contributor

@marchello2000 marchello2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for being diligent on this!
Looks good minus the one comment re TERMINAL

It would probably be prudent to make a unittest for this

@AbdulRahmanAlHamali
Copy link
Contributor Author

From what I see, isComplete is set to true for success, cancellation and other cases. Do we want to restart even in those cases?

@AbdulRahmanAlHamali
Copy link
Contributor Author

Regarding the testing, I agree; unit tests don't add much here, since the change is mostly about the interaction with the different components. But I will run some end-to-end tests on that before we merge it.

@marchello2000
Copy link
Contributor

From what I see, isComplete is set to true for success, cancellation and other cases. Do we want to restart even in those cases?

I think so... I guess your team is running into these use cases - curious what your perspective is. For me, I just wanted to make sure that we don't "restart" a stage while the pipeline is running because that could lead to some weird situations that I am having a hard time visualizing...

I was thinking that perhaps you have a child pipeline stage that is marked as "ignore failure"so the parent pipeline will succeed but you might still want to rerun the child if it failed

@AbdulRahmanAlHamali
Copy link
Contributor Author

Hey mark.

My exact use case is this: I have multiple environments: dev, qa, prd, etc. each represented by its own pipeline.

Then, I have a parent pipeline, which triggers those pipelines one after the other, and does some other extra logic.

If the 10th stage in dev fails for example, the parent pipeline stops. And the only reason to make the parent pipeline resume is by restarting all of dev environment, which is not optimal, since 10 steps have already been executed in dev. So what I really want is for the user to simply restart the failed stage in dev, which will awaken the parent pipeline, so that it goes back to waiting for dev, and when dev is finished, it continues deploying in the other environments.

In case the parent pipeline ignores the failure of the child pipeline, then it really is not a problem for me. The only problem is when the parent pipeline fails because of the failure of dev, which could either be immediate (failPipeline=true), or not (completeOtherBranchesThenFail=true).

@AbdulRahmanAlHamali
Copy link
Contributor Author

I think the above are the only two cases where a restart is necessary, and maybe we can add a more specific check just for them.

What do you think?

@marchello2000
Copy link
Contributor

Got it. I think we can keep the check as is for now (e.g. if the pipeline is not running restart/awaken it). I just think the more complex the check the harder it is to get it right and, more importantly, reason about it for the end users

@AbdulRahmanAlHamali
Copy link
Contributor Author

Hey mark, did you mean to keep it as TERMINAL?

@marchello2000
Copy link
Contributor

Just to capture our offline convo: I think the change here look good, thank you for pushing them through. I will help test it (today/tomorrow) and then mark it approved/merge

@AbdulRahmanAlHamali
Copy link
Contributor Author

@spinnakerbot cherry-pick 1.17

@spinnakerbot
Copy link
Contributor

Cherry pick failed: Command failed (cherry pick commit f1a0f89) with exit code 128:

fatal: bad object f1a0f89845283dff8382af6f937bbfd7f5fcc3c2

Copy link
Contributor

@marchello2000 marchello2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@marchello2000 marchello2000 added the ready to merge Approved and ready for merge label Dec 11, 2019
@mergify mergify bot merged commit eef7087 into spinnaker:master Dec 11, 2019
@mergify mergify bot added the auto merged Merged automatically by a bot label Dec 11, 2019
@AbdulRahmanAlHamali
Copy link
Contributor Author

@spinnakerbot cherry-pick 1.17

spinnakerbot pushed a commit that referenced this pull request Dec 11, 2019
… pipeline restarts (#3317)

* restart parent stage when child restarts

* fix based on code review

* remove unused function

* better method name

* improvements based on feedback

* fix static analysis error

* replace terminal by iscomplete

* fetch the live version of the parent execution, and improve variable names
@spinnakerbot
Copy link
Contributor

Cherry pick successful: #3340

ttomsu pushed a commit that referenced this pull request Dec 12, 2019
… pipeline restarts (#3317) (#3340)

* restart parent stage when child restarts

* fix based on code review

* remove unused function

* better method name

* improvements based on feedback

* fix static analysis error

* replace terminal by iscomplete

* fetch the live version of the parent execution, and improve variable names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto merged Merged automatically by a bot ready to merge Approved and ready for merge target-release/1.18
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants