Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in controller when step fails before image digest exporter #2222

Merged
merged 1 commit into from Mar 17, 2020
Merged

Panic in controller when step fails before image digest exporter #2222

merged 1 commit into from Mar 17, 2020

Conversation

ghost
Copy link

@ghost ghost commented Mar 13, 2020

Changes

Fixes #2220

The image digest exporter (part of the Image Output Resource) is configured with "terminationMessagePolicy": "FallbackToLogsOnError",.

When a previous step has failed in a Task our entrypoint wrapping the exporter emits a log line like 2020/03/13 12:03:26 Skipping step because a previous step failed. Because the image digest exporter is set to FallbackToLogsOnError Kubernetes slurps up this log line as the termination message.

That line gets read by the Tekton controller, which is looking for JSON in the termination message. It fails to parse and stops trying to read step statuses.

That in turn results in a mismatch in the length of the list of steps and the length of the list of step statuses. Finally we attempt to sort the list of step statuses alongside the list of steps. This method panics with an out of bounds error because it assumes the lengths of the two lists are the same.

So, this PR does the following things:

  1. The image digest exporter has the FallbackToLogsOnError policy removed. I can't think of a reason that we need this anymore.
  2. The Tekton controller no longer breaks out of the loop while it's parsing step statuses and instead simply ignores non-JSON termination messages.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

@ghost ghost added kind/bug Categorizes issue or PR as related to a bug. needs-cherry-pick Indicates a PR needs to be cherry-pick to a release branch labels Mar 13, 2020
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Mar 13, 2020
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 13, 2020
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this so quickly!!!

  • Even tho this fixes a panic, if it's possible to get some tests in before merging I'd prefer? Can be convinced to merge anyway tho
  • TerminationMessagePolicy is referenced in some builders still, i think we can remove that now?

logger.Errorf("Could not parse json message %q because of %w", msg, err)
break
// step generated non-JSON termination message; ignore
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change seems a bit inconsistent with the other error handling in this function, which will still be logging and breaking, 2 thoughts:

  1. are we sure we want to continue processing messages when this happens? altho we dont want to panic, maybe it's reasonably to assume that encountering a badly formatted message indicates the rest are bogus, or is it that we could have some steps with good termination messages, and some with bad?
  2. maybe logging here still makes sense? if you want to track down what happened later?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note / scope creep: why are we creating a new logger in this function? if this is in the controller, can we pass in the (properly configured) logger instead? <-- probably the material for a separate issue

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change seems a bit inconsistent with the other error handling in this function, which will still be logging and breaking

This is true, though the other breaks in this function, lines 136 & 148, are inside another nested for loop. That means when those lines are hit the outer for loop is allowed to continue on processing the rest of the steps.

are we sure we want to continue processing messages when this happens? altho we dont want to panic, maybe it's reasonably to assume that encountering a badly formatted message indicates the rest are bogus, or is it that we could have some steps with good termination messages, and some with bad?

My feeling is that this issue is evidence of some steps with good termination messages, and some with bad. There are other ways that a Step might inadvertently break the termination message and I think we can afford to be a little more robust in that scenario. Off the top of my head one example where this could happen again would be that a process spawned inside the container could race with the entrypoint to write the termination log and clobber it. That would be some awful user error but it isn't inconceivable, and I think Tekton should be able to handle that gracefully without abandoning the rest of the steps.

maybe logging here still makes sense? if you want to track down what happened later?

Yeah that's totally fair. I'll add this back in and look at threading the Logger through from the caller.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've threaded the logger through from the controller and added the log line back in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've factored the inner for loop into its own function. I think this makes the control flow a little clearer. I've also added a test for the panic-inducing case.

}
return false
})
return taskRunSteps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part of fixing the issue or is it a separate improvement? if the latter, maybe a separate PR makes sense?

Copy link
Author

@ghost ghost Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the panic originates here. I have two thoughts related to this :-

  1. we shouldn't really be leaving a sort func lying around that panics in the face of a length mismatch.
  2. this sorting code felt very complicated when compared to its intended purpose. I spent a good amount of time just trying to puzzle out what it was trying to achieve when the out-of-bounds error occurred. I would guess that the number of steps in a task is always going to be relatively small and I'm not convinced that the performance optimization this func was implementing was really worthwhile (even if the number of steps went into the hundreds or thousands... I don't think this would have changed the runtime really dramatically).

I guess we could approach this a couple of different ways (that I can think of, happy to go a different way if there are more I don't mention):

  • as done here, sort the statuses and push any that don't appear in taskRunSteps to the back of the list. I should probably also add a comment describing this behaviour.
  • check if len(taskRunSteps) and len(taskSpecSteps) are equal. If not, return the list unsorted. Leave the stepStateSorter in place. comment about the constraint.
  • document the panic with a comment.
  • leave alone for now and revisit in a separate PR.

I'm kinda happy with the change as it stands (though it needs a comment) but I also don't mind if you think we should tackle this in a separate PR or a totally different way. WDYT?

Edit: secret option number 5 - change this func's signature to return an error as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've added a comment describing the new sort behaviour. still happy to take this change out or modify if preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (and note): sort.Slice uses reflection (whereas sort.Sort doesn't, rely on the interface contract). This means, it's gonna be slower — might be a premature optimization thinking that though :stuck_out_tongue:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put the original sort func back. Don't want to hold up this PR if that change is controversial.

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 13, 2020
pkg/pod/status.go Outdated Show resolved Hide resolved
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

@afrittoli
Copy link
Member

I wonder it #2029 was a different manifestation of this issue. In that case the overall Task failure status was set to that of the skipped image digest exporter step, which is incorrect since the failure was in a different step, and the digest exporter step should have been only skept - see the full yaml.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

@ghost
Copy link
Author

ghost commented Mar 16, 2020

/test pull-tekton-pipeline-integration-tests

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

The image digest exporter (part of the Image Output Resource) is
configured with `"terminationMessagePolicy": "FallbackToLogsOnError",`.

When a previous step has failed in a Task our entrypoint wrapping the
exporter emits a log line like `2020/03/13 12:03:26 Skipping
step because a previous step failed`. Because the image digest exporter
is set to FallbackToLogsOnError Kubernetes slurps up this log line as
the termination message.

That line gets read by the Tekton controller, which is looking for JSON
in the termination message. It fails to parse and stops trying to read
step statuses.

That in turn results in a mismatch in the length of the list of steps and
the length of the list of step statuses. Finally we attempt to sort
the list of step statuses alongside the list of steps. This method
panics with an out of bounds error because it assumes the lengths of the two
lists are the same.

So, this PR does the following things:

1. The image digest exporter has the FallbackToLogsOnError policy
removed. I can't think of a reason that we need this anymore.
2. The Tekton controller no longer breaks out of the loop while it's
parsing step statuses and instead simply ignores non-JSON termination
messages.
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/builder/container.go 86.8% 91.7% 4.8
test/builder/step.go 18.2% 19.0% 0.9

@ghost
Copy link
Author

ghost commented Mar 16, 2020

/retest

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2020
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates and explanations! Looks great! I had two minor pieces of feedback but also happy to merge as is!

/lgtm

// updateStatusStartTime searches for a result called "StartedAt" in the JSON-formatted termination message
// of a step and sets the State.Terminated.StartedAt field to this time if it's found. The "StartedAt" result
// is also removed from the list of results in the container status.
func updateStatusStartTime(s *corev1.ContainerStatus) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i love that this is a separate function! i dont mind if you want to ignore this, but since its separate id expect to see unit tests for it, and also i find that once we have a private/unexported function that 1) can have meaningful tests and 2) has good reason to have a docstring, it often makes sense to move it into its own package and make it public/exported

@@ -600,6 +603,94 @@ func TestMakeTaskRunStatus(t *testing.T) {
}},
},
},
}, {
desc: "non-json-termination-message-with-steps-afterwards-shouldnt-panic",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -624,7 +715,8 @@ func TestMakeTaskRunStatus(t *testing.T) {
},
}

got := MakeTaskRunStatus(tr, pod, v1alpha1.TaskSpec{})
logger, _ := logging.NewLogger("", "status")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for threading the logger through! its kind of a pain but i think it's right way to do it :D

break
}
if err := updateStatusStartTime(&s); err != nil {
logger.Errorf("error setting the start time of step %q in taskrun %q: %w", s.Name, tr.Name, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay so we're just logging an error but continuing onward - i could see including a comment here about why, since it might look like an oversight if just reading the code

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2020
@tekton-robot tekton-robot merged commit 153f1d1 into tektoncd:master Mar 17, 2020
@ghost ghost mentioned this pull request Mar 17, 2020
3 tasks
@ghost ghost removed the needs-cherry-pick Indicates a PR needs to be cherry-pick to a release branch label Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic when submitting TaskRun
5 participants