React on the errors and unknown events from k8s API #8

nolar · 2019-03-27T13:07:02Z

Issues: #10

There is an exception sporadically happening for the objects:

Traceback (most recent call last):
  ………
  File "/usr/local/lib/python3.7/dist-packages/kopf/reactor/queueing.py", line 83, in watcher
    key = (resource, event['object']['metadata']['uid'])
KeyError: 'uid'

It is, in turn, is caused by the unexpected event types coming from the Kubernetes API, specifically from the watch call (note that the object has the metadata, but no uid or standard k8s-object fields):

{'object': {'apiVersion': 'v1',
            'code': 410,
            'kind': 'Status',
            'message': 'too old resource version: 190491269 (208223535)',
            'metadata': {},
            'reason': 'Gone',
            'status': 'Failure'},
 'raw_object': {'apiVersion': 'v1',
                'code': 410,
                'kind': 'Status',
                'message': 'too old resource version: 190491269 (208223535)',
                'metadata': {},
                'reason': 'Gone',
                'status': 'Failure'},
 'type': 'ERROR'}

This ERROR event, in turn, is caused by how the watch-calls to the API are implemented in the Kubernetes library (be that Python or any other language):

A GET call is made to get a list of objects, with the ?watch=true argument.
The response for the watch-calls is the JSON-stream (one JSON per line) of events.
The library parses ("unmarshalls") the events and yields them to the caller.
The resourceVersion is remembered from the latest event (technically, for every event, but the latest one overrides).
When one GET call is terminated by the server (usually in few second), a new one is made, and this continues forever.
The latest known resourceVersion is passed to the next GET-call, so that the stream continues from that point only.

However, due nothing is happening for few minutes, the resourceVersion somehow becomes "too old", i.e. not remembered by Kubernetes already. This behaviour is documented in the k8s docs (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes):

A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

So, the Kubernetes API returns and yields these ERROR events. In theory, it should die with an exception (I would expect that instead of the "normal" ERROR events). As observed, these "too old resource version" errors are streamed very fast, non-stop (also strange).

The only valid way here is to restart the watch call from scratch, i.e. no resourceVersion provided.

This is an equivalent of the operator restart: every object is listed again, and goes through the handling cycle (usually a do-nothing handling). But the restart is "soft": the queues, the ascyncio tasks, and generally the state of the operator is not lost, and the time is not wasted (for the pod allocation).

This PR does these 3 things:

Soft-restarts the watching cycle on the "too old resource version" errors.
Fails on the ERROR event types for the unknown errors.
Warns about the unknown event types, which can appear in the future, and ignores them.

zincr · 2019-03-27T13:07:13Z

🤖 zincr found 0 problems , 0 warnings

✅ Large Commits
✅ Approvals
✅ Specification
✅ Dependency Licensing

Sergey Vasilyev added 3 commits March 27, 2019 13:59

Ignore the unsupported event types (such as ERROR)

2e406d5

Fail on the ERROR events from the watch-stream of k8s API

bdaa4c9

Restart watching on the 410 error (resource version too old)

db28de7

nolar requested a review from samurang87 as a code owner March 27, 2019 13:07

samurang87 approved these changes Mar 27, 2019

View reviewed changes

nolar merged commit 84a99f5 into master Mar 27, 2019

nolar deleted the error-events branch April 21, 2019 21:55

nolar added the bug Something isn't working label Apr 26, 2019

nolar added this to the 1.0 milestone Apr 30, 2019

kopf-archiver bot mentioned this pull request Aug 19, 2020

[PR] React on the errors and unknown events from k8s API nolar/kopf#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

React on the errors and unknown events from k8s API #8

React on the errors and unknown events from k8s API #8

nolar commented Mar 27, 2019 •

edited

Loading

zincr bot commented Mar 27, 2019 •

edited

Loading

React on the errors and unknown events from k8s API #8

React on the errors and unknown events from k8s API #8

Conversation

nolar commented Mar 27, 2019 • edited Loading

zincr bot commented Mar 27, 2019 • edited Loading

🤖 zincr found 0 problems , 0 warnings

nolar commented Mar 27, 2019 •

edited

Loading

zincr bot commented Mar 27, 2019 •

edited

Loading