Panic in controller when step fails before image digest exporter #2222

ghost · 2020-03-13T13:17:17Z

Changes

The image digest exporter (part of the Image Output Resource) is configured with "terminationMessagePolicy": "FallbackToLogsOnError",.

When a previous step has failed in a Task our entrypoint wrapping the exporter emits a log line like 2020/03/13 12:03:26 Skipping step because a previous step failed. Because the image digest exporter is set to FallbackToLogsOnError Kubernetes slurps up this log line as the termination message.

That line gets read by the Tekton controller, which is looking for JSON in the termination message. It fails to parse and stops trying to read step statuses.

That in turn results in a mismatch in the length of the list of steps and the length of the list of step statuses. Finally we attempt to sort the list of step statuses alongside the list of steps. This method panics with an out of bounds error because it assumes the lengths of the two lists are the same.

So, this PR does the following things:

The image digest exporter has the FallbackToLogsOnError policy removed. I can't think of a reason that we need this anymore.
The Tekton controller no longer breaks out of the loop while it's parsing step statuses and instead simply ignores non-JSON termination messages.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Commit messages follow commit message best practices

See the contribution guide for more details.

bobcatfish

Thanks for fixing this so quickly!!!

Even tho this fixes a panic, if it's possible to get some tests in before merging I'd prefer? Can be convinced to merge anyway tho
TerminationMessagePolicy is referenced in some builders still, i think we can remove that now?

bobcatfish · 2020-03-13T13:55:23Z

pkg/pod/status.go

-					logger.Errorf("Could not parse json message %q because of %w", msg, err)
-					break
+					// step generated non-JSON termination message; ignore
+					continue


this change seems a bit inconsistent with the other error handling in this function, which will still be logging and breaking, 2 thoughts:

are we sure we want to continue processing messages when this happens? altho we dont want to panic, maybe it's reasonably to assume that encountering a badly formatted message indicates the rest are bogus, or is it that we could have some steps with good termination messages, and some with bad?

maybe logging here still makes sense? if you want to track down what happened later?

side note / scope creep: why are we creating a new logger in this function? if this is in the controller, can we pass in the (properly configured) logger instead? <-- probably the material for a separate issue

this change seems a bit inconsistent with the other error handling in this function, which will still be logging and breaking

This is true, though the other breaks in this function, lines 136 & 148, are inside another nested for loop. That means when those lines are hit the outer for loop is allowed to continue on processing the rest of the steps.

are we sure we want to continue processing messages when this happens? altho we dont want to panic, maybe it's reasonably to assume that encountering a badly formatted message indicates the rest are bogus, or is it that we could have some steps with good termination messages, and some with bad?

My feeling is that this issue is evidence of some steps with good termination messages, and some with bad. There are other ways that a Step might inadvertently break the termination message and I think we can afford to be a little more robust in that scenario. Off the top of my head one example where this could happen again would be that a process spawned inside the container could race with the entrypoint to write the termination log and clobber it. That would be some awful user error but it isn't inconceivable, and I think Tekton should be able to handle that gracefully without abandoning the rest of the steps.

maybe logging here still makes sense? if you want to track down what happened later?

Yeah that's totally fair. I'll add this back in and look at threading the Logger through from the caller.

I've threaded the logger through from the controller and added the log line back in.

I've factored the inner for loop into its own function. I think this makes the control flow a little clearer. I've also added a test for the panic-inducing case.

bobcatfish · 2020-03-13T13:56:59Z

pkg/pod/status.go

+		}
+		return false
+	})
+	return taskRunSteps


is this part of fixing the issue or is it a separate improvement? if the latter, maybe a separate PR makes sense?

So, the panic originates here. I have two thoughts related to this :-

we shouldn't really be leaving a sort func lying around that panics in the face of a length mismatch.

this sorting code felt very complicated when compared to its intended purpose. I spent a good amount of time just trying to puzzle out what it was trying to achieve when the out-of-bounds error occurred. I would guess that the number of steps in a task is always going to be relatively small and I'm not convinced that the performance optimization this func was implementing was really worthwhile (even if the number of steps went into the hundreds or thousands... I don't think this would have changed the runtime really dramatically).

I guess we could approach this a couple of different ways (that I can think of, happy to go a different way if there are more I don't mention):

as done here, sort the statuses and push any that don't appear in taskRunSteps to the back of the list. I should probably also add a comment describing this behaviour.

check if len(taskRunSteps) and len(taskSpecSteps) are equal. If not, return the list unsorted. Leave the stepStateSorter in place. comment about the constraint.

document the panic with a comment.

leave alone for now and revisit in a separate PR.

I'm kinda happy with the change as it stands (though it needs a comment) but I also don't mind if you think we should tackle this in a separate PR or a totally different way. WDYT?

Edit: secret option number 5 - change this func's signature to return an error as well

i've added a comment describing the new sort behaviour. still happy to take this change out or modify if preferred.

Nit (and note): sort.Slice uses reflection (whereas sort.Sort doesn't, rely on the interface contract). This means, it's gonna be slower — might be a premature optimization thinking that though :stuck_out_tongue:

I've put the original sort func back. Don't want to hold up this PR if that change is controversial.

pkg/pod/status.go

tekton-robot · 2020-03-13T16:16:07Z