This repository has been archived by the owner on Sep 14, 2020. It is now read-only.
React on the errors and unknown events from k8s API #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is an exception sporadically happening for the objects:
It is, in turn, is caused by the unexpected event types coming from the Kubernetes API, specifically from the watch call (note that the object has the metadata, but no uid or standard k8s-object fields):
This ERROR event, in turn, is caused by how the watch-calls to the API are implemented in the Kubernetes library (be that Python or any other language):
?watch=true
argument.resourceVersion
is remembered from the latest event (technically, for every event, but the latest one overrides).resourceVersion
is passed to the next GET-call, so that the stream continues from that point only.However, due nothing is happening for few minutes, the
resourceVersion
somehow becomes "too old", i.e. not remembered by Kubernetes already. This behaviour is documented in the k8s docs (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes):So, the Kubernetes API returns and yields these ERROR events. In theory, it should die with an exception (I would expect that instead of the "normal" ERROR events). As observed, these "too old resource version" errors are streamed very fast, non-stop (also strange).
The only valid way here is to restart the watch call from scratch, i.e. no
resourceVersion
provided.This is an equivalent of the operator restart: every object is listed again, and goes through the handling cycle (usually a do-nothing handling). But the restart is "soft": the queues, the ascyncio tasks, and generally the state of the operator is not lost, and the time is not wasted (for the pod allocation).
This PR does these 3 things: