Skip to content
This repository has been archived by the owner on Sep 14, 2020. It is now read-only.

React on the errors and unknown events from k8s API #8

Merged
merged 3 commits into from
Mar 27, 2019
Merged

Conversation

nolar
Copy link
Contributor

@nolar nolar commented Mar 27, 2019

Issues: #10

There is an exception sporadically happening for the objects:

Traceback (most recent call last):
  ………
  File "/usr/local/lib/python3.7/dist-packages/kopf/reactor/queueing.py", line 83, in watcher
    key = (resource, event['object']['metadata']['uid'])
KeyError: 'uid'

It is, in turn, is caused by the unexpected event types coming from the Kubernetes API, specifically from the watch call (note that the object has the metadata, but no uid or standard k8s-object fields):

{'object': {'apiVersion': 'v1',
            'code': 410,
            'kind': 'Status',
            'message': 'too old resource version: 190491269 (208223535)',
            'metadata': {},
            'reason': 'Gone',
            'status': 'Failure'},
 'raw_object': {'apiVersion': 'v1',
                'code': 410,
                'kind': 'Status',
                'message': 'too old resource version: 190491269 (208223535)',
                'metadata': {},
                'reason': 'Gone',
                'status': 'Failure'},
 'type': 'ERROR'}

This ERROR event, in turn, is caused by how the watch-calls to the API are implemented in the Kubernetes library (be that Python or any other language):

  • A GET call is made to get a list of objects, with the ?watch=true argument.
  • The response for the watch-calls is the JSON-stream (one JSON per line) of events.
  • The library parses ("unmarshalls") the events and yields them to the caller.
  • The resourceVersion is remembered from the latest event (technically, for every event, but the latest one overrides).
  • When one GET call is terminated by the server (usually in few second), a new one is made, and this continues forever.
  • The latest known resourceVersion is passed to the next GET-call, so that the stream continues from that point only.

However, due nothing is happening for few minutes, the resourceVersion somehow becomes "too old", i.e. not remembered by Kubernetes already. This behaviour is documented in the k8s docs (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes):

A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

So, the Kubernetes API returns and yields these ERROR events. In theory, it should die with an exception (I would expect that instead of the "normal" ERROR events). As observed, these "too old resource version" errors are streamed very fast, non-stop (also strange).

The only valid way here is to restart the watch call from scratch, i.e. no resourceVersion provided.

This is an equivalent of the operator restart: every object is listed again, and goes through the handling cycle (usually a do-nothing handling). But the restart is "soft": the queues, the ascyncio tasks, and generally the state of the operator is not lost, and the time is not wasted (for the pod allocation).

This PR does these 3 things:

  • Soft-restarts the watching cycle on the "too old resource version" errors.
  • Fails on the ERROR event types for the unknown errors.
  • Warns about the unknown event types, which can appear in the future, and ignores them.

@nolar nolar requested a review from samurang87 as a code owner March 27, 2019 13:07
@zincr
Copy link

zincr bot commented Mar 27, 2019

🤖 zincr found 0 problems , 0 warnings

✅ Large Commits
✅ Approvals
✅ Specification
✅ Dependency Licensing

@nolar nolar merged commit 84a99f5 into master Mar 27, 2019
@nolar nolar deleted the error-events branch April 21, 2019 21:55
@nolar nolar added the bug Something isn't working label Apr 26, 2019
@nolar nolar added this to the 1.0 milestone Apr 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants