Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 1059 - TaskRun controller update status and labels on error #1204

Merged
merged 1 commit into from
Oct 23, 2019

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented Aug 16, 2019

Changes

TaskRun controller update status and labels even when an error occurs.
As described in #1059 the TaskRun controller, in case an error is received from reconcile, fails to update the status. The comments in the code even suggest this behaviour, but the implementation does not match them. This PR fixes that.

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes
are included, additive changes
must be approved by at least two OWNERS
and backwards incompatible changes
must be approved by more than 50% of the OWNERS,
and they must first be added
in a backwards compatible way.

Release Notes

N/A - no API changes.

@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 16, 2019
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Aug 16, 2019
@tekton-robot tekton-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 16, 2019
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/v1alpha1/taskrun/taskrun.go 69.9% 69.8% -0.1

@afrittoli afrittoli changed the title WIP Fix 1059 Fix 1059 - TaskRun controller update status and labels on error Aug 16, 2019
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 16, 2019
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go Do not exist 82.4%

@afrittoli
Copy link
Member Author

/approve cancel

@tekton-robot tekton-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2019
@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

Copy link
Contributor

@gavinfish gavinfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the use cases for this - from #1059:

When reconcile returns an error, there status, labels and annotations of the TaskRun/PipelineRun should still be sync'ed back.
In case of error, the error is returned, and no update happens.
There is only a little number of cases where this is relevant; nonetheless it should be fixed and unit test coverage added

I want to double check that we actually want to update the status in this case. I can't really think of a reason not to, however just to make sure we're on the same page about the behaviour if you look at how Reconcile is called (https://github.com/knative/pkg/blob/1a3a36a996dec7645bee7084153ef8b7273e1028/controller/controller.go#L299) if an error is returned, the item will get queued again (so theoretically it wouldn't stay in this state very long

i think part of the reasoning behind not trying to update the status also might be that if something is wrong in the underlying system, then there's a good chance the attempt to update the status itself will fail. using the multierror should at least avoid hiding that error, but im just not 100% clear that its worth updating the status

pkg/reconciler/taskrun/taskrun.go Outdated Show resolved Hide resolved

// In case of reconcile errors, we store the error in a multierror, attempt
// to update, and return the original error combined with any update error
merr error
Copy link
Collaborator

@bobcatfish bobcatfish Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find that global variables tend to cause problems (http://wiki.c2.com/?GlobalVariablesAreBad seems like an okay source of info about this) - for example if more than one Reconcile happens at a time (and I think it might, I'm not sure) then multiple goroutines could try to write to this simultaneously

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this and for the pointers - I've got some reading and investigating to do.
I think it's reasonable to assume that more than one Reconcile could happen at the time, so I should fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@afrittoli
Copy link
Member Author

Question about the use cases for this - from #1059:

When reconcile returns an error, there status, labels and annotations of the TaskRun/PipelineRun should still be sync'ed back.
In case of error, the error is returned, and no update happens.
There is only a little number of cases where this is relevant; nonetheless it should be fixed and unit test coverage added

I want to double check that we actually want to update the status in this case. I can't really think of a reason not to, however just to make sure we're on the same page about the behaviour if you look at how Reconcile is called (https://github.com/knative/pkg/blob/1a3a36a996dec7645bee7084153ef8b7273e1028/controller/controller.go#L299) if an error is returned, the item will get queued again (so theoretically it wouldn't stay in this state very long

i think part of the reasoning behind not trying to update the status also might be that if something is wrong in the underlying system, then there's a good chance the attempt to update the status itself will fail. using the multierror should at least avoid hiding that error, but im just not 100% clear that its worth updating the status

OK, I see your point. In my understanding there are two types of failures that may happen during a reconcile, recoverable and unrecoverable (or transient vs permanent). All validation type of failures - e.g. a Task is not found or a Param does not match the spec - are considered unrecoverable. When they are encountered, reconcile updates the Condition and returns nil - a change in the status is detected and the status is sync'ed via API. In very few cases reconcile returns an error:

  • failure to initialize the artifact storage - which could be transient
  • in case cancelPipelineRun fails
  • when createTaskRun fails
  • when c.makeConditionCheckContainer fails

Is it correct that the last three are "recoverable"? In any case like you said Reconcile will propagate the error and the item will be re-queued, so Reconcile will be invoked again.
I cannot think of an update that could have happened that we would miss in the next reconcile cycle, but I cannot either say I'm sure there is none, so I still think it may be worth having this.

It may be worth adding some documentation to reconcile so that it's more obvious when writing and reviewing code, when reconcile should and should not return and error.

BTW, there is also the case of dag.GetSchedulable where the error is logged but reconcile continues. I need to check further to see if that's the expected behaviour.

@bobcatfish
Copy link
Collaborator

It may be worth adding some documentation to reconcile so that it's more obvious when writing and reviewing code, when reconcile should and should not return and error.

I think that's a really good idea - I had to go and track down the code in knative/pkg to double check :S

BTW, there is also the case of dag.GetSchedulable where the error is logged but reconcile continues.

That doesn't sound right 😅

Is it correct that the last three are "recoverable"?

I guess it's hard to say! It would depend probably on what caused the issue - I think my rule of thumb would be if the error was something related to the user's actions (i.e. something user did incorrectly), it's "unrecoverable", but if it's not the user, if it's the underlying system (e.g. the service account running the pipelinerun controller cant create TaskRuns or something) then it's "recoverable"

@dibyom
Copy link
Member

dibyom commented Oct 15, 2019

Is it correct that the last three are "recoverable"? In any case like you said Reconcile will propagate the error and the item will be re-queued, so Reconcile will be invoked again.

The code assumes that these errors such as those from makeConditionCheckContainer are recoverable, and are re-queued. In #1427 we ran into a case where the error was not recoverable (a validation error when creating the container). However, since we never updated the status, and the condition was the first thing to run, users saw a blank status until the pipelinerun timed out an hour later. This leads me to think that we should attempt to update the status even when we think error is "recoverable".

All validation type of failures - e.g. a Task is not found or a Param does not match the spec - are considered unrecoverable. When they are encountered, reconcile updates the Condition and returns nil - a change in the status is detected and the status is sync'ed via API.

Have not actually tried this out but from looking at the code it seems like instead of returning nil we could also return a NewPermanentError which will ensure that it does not get re-queued.

Copy link
Member

@dibyom dibyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be good to go once we ensure that merr is not a global variable.

@afrittoli
Copy link
Member Author

I think this should be good to go once we ensure that merr is not a global variable.

Fixed that and rebased

The TaskRun controller reconcile returns errors immediately.
It should instead first attempt to update status and labels and then
return both the original error plus any further error raised during
the update.

Fixes tektoncd#1059
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 83.5% 83.5% 0.1
pkg/reconciler/taskrun/taskrun.go 72.4% 72.5% 0.1

@afrittoli
Copy link
Member Author

I think this should be good to go once we ensure that merr is not a global variable.

Done!

Copy link
Member

@dibyom dibyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2019
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gavinfish, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2019
@tekton-robot tekton-robot merged commit 4df394e into tektoncd:master Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants