-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(execution): Resume parent pipeline when a failed stage in a child pipeline restarts #3317
Conversation
The following commits need their title changed:
Please format your commit title into the form:
This allows us to easily generate changelogs & determine semantic version numbers when cutting releases. You can read more about commit conventions here. |
repository.updateStatus(topStage.execution.type, topStage.execution.id, RUNNING) | ||
queue.push(StartStage(startMessage)) | ||
} | ||
} | ||
} | ||
} | ||
|
||
private fun restartParentPipelineIfNeeded(message: RestartStage, topStage: Stage) { | ||
val trigger = topStage.execution.trigger | ||
if (!topStage.execution.trigger.type.equals("pipeline")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are probably better of checking the type of the trigger in a type-safe way
if (!topStage.execution.trigger.type.equals("pipeline")) { | |
if (!topStage.execution.trigger instanceof PipelineTrigger) { |
return | ||
} | ||
|
||
val parentExecution = trigger.other["parentExecution"] as Execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you can do (trigger as PipelineTrigger).parentExecution
* Inform the parent stage when it restarts that the child is already running | ||
*/ | ||
private fun Stage.addAlreadyRunning(message: RestartStage) { | ||
message.withExecution { execution -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems weird, why add these props on the execution it should be on the stage... but also... let me think about a better way of doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so i think it would be better to do the following:
- add something to the stage context like
context["_skipPipelineRestart"] = true
- No need to add
executionId
orexecutionName
since those are already there - You will need to store the execution to persist this change to the stage in the executionRepository
PipelineStage.prepareStageForRestart
should check the_skipPipelineRestart
field, and iftrue
not delete theexecutionId/executionName
, from the context. It should also clear the_skipPipelineRestart
flag so that if the user wanted to actually restart it explicitly afterwards it would work as expected (e.g. restart the child)- the
PipelineStage.taskGraph
should only create the start pipelinetask if there is noexecutionId
present in the current stage context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I have addressed your points, thanks for the feedback!
I did not really understand point 3 though. I think this is already being done elsewhere in the code, but I might be wrong.
Also, regarding point 5, the stage context passed to taskGraph
is the current one, right?
Let me know what you think, and thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AbdulRahmanAlHamali Thanks for the bias for action on this - really appreciate it.
In general, I think it's a good idea (see my comments on how i recommend proceeding with implementation).
However, I am quite concerned as to side-effects of a change like this. In particular there are a bunch of edge cases, such as:
- what happens if the user specified they don't want to wait for the child pipeline to complete?
- what happens if the parent pipeline is still running from the previous run?
I will mull this over a bit, but let me know what you think
Hi @marchello2000 I really appreciate the feedback. I will address your points and submit a fix. Regarding your concerns:
I think we can pass down in the trigger a flag that says: I'm not waiting for you, and based on that we simply ignore the whole process.
If the parent pipeline is waiting for the child pipeline, and it failed because of the child pipeline's failure, then this scenario wouldn't happen. Unless there is some specific case where it could happen? |
Oh, this can absolutely happen.
I am not sure what the best approach here. |
Hey @marchello2000, sorry for the late reply. OK, so we can say we have three principal cases:
So, as a solution, we can simply check before restarting the parent pipeline that it is either halted, or running a stage that is not dependent on that one (so the branch is halted). Thoughts? |
orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RestartStageHandler.kt
Show resolved
Hide resolved
* Inform the parent stage when it restarts that the child is already running | ||
*/ | ||
private fun Stage.addSkipRestart() { | ||
context["_skipPipelineRestart"] = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not
context["_skipPipelineRestart"] = "true" | |
context["_skipPipelineRestart"] = true |
@@ -65,8 +68,18 @@ public void taskGraph(Stage stage, TaskNode.Builder builder) { | |||
@Override | |||
public void prepareStageForRestart(Stage stage) { | |||
stage.getContext().remove("status"); | |||
stage.getContext().remove("executionName"); | |||
stage.getContext().remove("executionId"); | |||
if (!stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i, personally, find this really hard to read. I would maybe rewrite as:
StageContext context = (StageContext)stage.getContext();
boolean restartPipeline = (boolean)(context.getCurrentOnly("_skipPipelineRestart", false));
if (restartPipeline) {
..
} else {
..
}
Note the use of getCurrentOnly
- that way you are only inspecting the context of the current stage not any of the outputs of its parents (a bug that we've run into a bunch)
@AbdulRahmanAlHamali : thanks for pushing on this. I am leaning towards "only restart parent pipeline stage if the parent pipeline is halted" - it's simpler, more predictable, and is much easier to reason about (both in code and from user's perspective). I also think it will cover the large majority of the use cases. How do you feel about that? Also, would love input from @robfletcher or @ajordens - as they have a ton more operational experience in this area |
Hey Mark, thanks for the review. I will submit changes to address your comment in a bit. Regarding restarting only in the halted pipeline case. I don't mind that. We might have some twisted use cases for our users that are not covered by this. But for now I think this covers almost everything |
@@ -71,7 +71,7 @@ public Object get(@Nullable Object key) { | |||
* @param defaultValue default value to return if key is not present | |||
* @return value or null if not present | |||
*/ | |||
Object getCurrentOnly(@Nullable Object key, Object defaultValue) { | |||
public Object getCurrentOnly(@Nullable Object key, Object defaultValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this public to be able to access it from PipelineStage
|
||
val trigger = topStage.execution.trigger as PipelineTrigger | ||
|
||
if (trigger.parentExecution.status != TERMINAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used TERMINAL
here, but I could have used something else like STOPPED
. Not sure which is best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only restart in the TERMINAL
case? I think it's ok to "restart" so long as the parent is not RUNNING, no?
In that case, you can
if (trigger.parentExecution.status != TERMINAL) { | |
if (trigger.parentExecution.status.isComplete()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for being diligent on this!
Looks good minus the one comment re TERMINAL
It would probably be prudent to make a unittest for this
From what I see, |
Regarding the testing, I agree; unit tests don't add much here, since the change is mostly about the interaction with the different components. But I will run some end-to-end tests on that before we merge it. |
I think so... I guess your team is running into these use cases - curious what your perspective is. For me, I just wanted to make sure that we don't "restart" a stage while the pipeline is running because that could lead to some weird situations that I am having a hard time visualizing... I was thinking that perhaps you have a child pipeline stage that is marked as "ignore failure"so the parent pipeline will succeed but you might still want to rerun the child if it failed |
Hey mark. My exact use case is this: I have multiple environments: dev, qa, prd, etc. each represented by its own pipeline. Then, I have a parent pipeline, which triggers those pipelines one after the other, and does some other extra logic. If the 10th stage in dev fails for example, the parent pipeline stops. And the only reason to make the parent pipeline resume is by restarting all of dev environment, which is not optimal, since 10 steps have already been executed in In case the parent pipeline ignores the failure of the child pipeline, then it really is not a problem for me. The only problem is when the parent pipeline fails because of the failure of |
I think the above are the only two cases where a restart is necessary, and maybe we can add a more specific check just for them. What do you think? |
Got it. I think we can keep the check as is for now (e.g. if the pipeline is not running restart/awaken it). I just think the more complex the check the harder it is to get it right and, more importantly, reason about it for the end users |
Hey mark, did you mean to keep it as |
Just to capture our offline convo: I think the change here look good, thank you for pushing them through. I will help test it (today/tomorrow) and then mark it approved/merge |
@spinnakerbot cherry-pick 1.17 |
Cherry pick failed: Command failed (cherry pick commit f1a0f89) with exit code 128:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
@spinnakerbot cherry-pick 1.17 |
… pipeline restarts (#3317) * restart parent stage when child restarts * fix based on code review * remove unused function * better method name * improvements based on feedback * fix static analysis error * replace terminal by iscomplete * fetch the live version of the parent execution, and improve variable names
Cherry pick successful: #3340 |
… pipeline restarts (#3317) (#3340) * restart parent stage when child restarts * fix based on code review * remove unused function * better method name * improvements based on feedback * fix static analysis error * replace terminal by iscomplete * fetch the live version of the parent execution, and improve variable names
If we have a pipeline, where one of the stages starts another pipeline, and then a stage fails in the inner pipeline. We can restart that inner stage, but the parent pipeline will not be notified of the restart, and thus will stay stuck. The only way to be able to actually move forward is to restart the whole failed inner pipeline, which is expensive and dangerous.
This PR provides a method to solve this problem. The main concept of the fix is to find and restart the stage in the parent pipeline, but to inform it that the child has already been restarted, and that it just needs to monitor it