New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes: Rules get removed in case of transient API server fault #406

Closed
linki opened this Issue Aug 9, 2017 · 7 comments

Comments

4 participants
@linki
Copy link
Member

linki commented Aug 9, 2017

We had an issue with Skipper as Kubernetes Ingress Proxy which resulted in Rules being removed due to a temporary outage of the Kubernetes control plane (specifically the API server). Skipper kept running but every request was answered with a 404.

We identified the cause of the issue and explain it here:

Logs from Skipper (with some internal identifiers changed):

After fetching the Ingress objects, Skipper fails to retrieve the associated Services and removes the corresponding rules.

{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/my-nginx: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.104944951Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/something-else: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.106085158Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/pierone-test: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.107442304Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/some-demo: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.108444994Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/repo-manager: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.109524745Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/some-reporting: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.11079438Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/github-bot: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.112227931Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/tiny-http-server: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.113289809Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=error msg=\"error while getting service: Get https://10.3.0.1:443/api/v1/namespaces/default/services/user-deploy: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:11.11456491Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, route: kube__healthz: Path(\"/kube-system/healthz\") \u0026\u0026 Source(\"10.0.0.0/8\", \"192.168.0.0/16\", \"172.16.0.0/12\", \"127.0.0.1/32\", \"fd00::/8\", \"::1/32\") -\u003e status() -\u003e \u003cshunt\u003e\" \n","stream":"stderr","time":"2017-08-08T10:04:11.114907187Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__some_reporting__some_reporting_stups_test_zalan_do____some_reporting\" \n","stream":"stderr","time":"2017-08-08T10:04:11.114936854Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__pierone_test__pierone_stups_test_zalan_do____pierone_test\" \n","stream":"stderr","time":"2017-08-08T10:04:11.114945546Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__some_github_bot__some_github_bot_stups_test_zalan_do____some_github_bot\" \n","stream":"stderr","time":"2017-08-08T10:04:11.114952445Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__my_nginx_v2__my_nginx_v2_stups_test_zalan_do____my_nginx\" \n","stream":"stderr","time":"2017-08-08T10:04:11.114958369Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__something-else__maven_stups_test_zalan_do____something-else\" \n","stream":"stderr","time":"2017-08-08T10:04:11.114964428Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__some_demo__something_stups_test_zalan_do____some_demo\" \n","stream":"stderr","time":"2017-08-08T10:04:11.115160816Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__some_manager__some_manager_stups_test_zalan_do____some_manager\" \n","stream":"stderr","time":"2017-08-08T10:04:11.115176511Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:11Z\" level=info msg=\"route settings, update, deleted id: kube_default__tiny_http_server__tiny_http_server_stups_test_zalan_do____tiny_http_server\" \n","stream":"stderr","time":"2017-08-08T10:04:11.11518326Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:17Z\" level=info msg=\"route settings received\" \n","stream":"stderr","time":"2017-08-08T10:04:17.118303058Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:17Z\" level=info msg=\"route settings applied\" \n","stream":"stderr","time":"2017-08-08T10:04:17.118941316Z"}  

After that, even fetching the Ingress objects itself fails. The currently present rules will be retained.

{"log":"[APP]time=\"2017-08-08T10:04:14Z\" level=error msg=\"error while receiveing updated ingress routes;Get https://10.3.0.1:443/apis/extensions/v1beta1/ingresses: dial tcp 10.3.0.1:443: getsockopt: connection refused\" \n","stream":"stderr","time":"2017-08-08T10:04:14.116887079Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:14Z\" level=info msg=\"route settings, update, route: kube__healthz: Path(\"/kube-system/healthz\") \u0026\u0026 Source(\"10.0.0.0/8\", \"192.168.0.0/16\", \"172.16.0.0/12\", \"127.0.0.1/32\", \"fd00::/8\", \"::1/32\") -\u003e status() -\u003e \u003cshunt\u003e\" \n","stream":"stderr","time":"2017-08-08T10:04:14.116919464Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:14Z\" level=info msg=\"route settings received\" \n","stream":"stderr","time":"2017-08-08T10:04:14.11693004Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:14Z\" level=info msg=\"route settings applied\" \n","stream":"stderr","time":"2017-08-08T10:04:14.203067835Z"}  

Later, Skipper can both fetch the Ingresses and Services and consequently adds the rules back.

{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=error msg=\"error while getting service: request failed, status: 404, 404 Not Found\" \n","stream":"stderr","time":"2017-08-08T10:04:20.333691Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__my_nginx_v2__my_nginx_v2_stups_test_zalan_do____my_nginx: Host(/^my-nginx-v2[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.194:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.333840941Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__something-else__maven_stups_test_zalan_do____something-else: Host(/^maven[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.232:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.333856506Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__some_demo__something_stups_test_zalan_do____some_demo: Host(/^something[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.40:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.333863811Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__some_reporting__some_reporting_stups_test_zalan_do____some_reporting: Host(/^some-reporting[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.218:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.333870814Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__pierone_test__pierone_stups_test_zalan_do____pierone_test: Host(/^pierone[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.59:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.333877425Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__some_github_bot__some_github_bot_stups_test_zalan_do____some_github_bot: Host(/^github-bot[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.234:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.334097526Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__some_manager__some_manager_stups_test_zalan_do____some_manager: Host(/^some-manager[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.85:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.33411531Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube_default__tiny_http_server__tiny_http_server_stups_test_zalan_do____tiny_http_server: Host(/^tiny-http-server[.]stups-test[.]zalan[.]do$/) -\u003e \"http://10.3.0.196:80\"\" \n","stream":"stderr","time":"2017-08-08T10:04:20.334122493Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings, update, route: kube__healthz: Path(\"/kube-system/healthz\") \u0026\u0026 Source(\"10.0.0.0/8\", \"192.168.0.0/16\", \"172.16.0.0/12\", \"127.0.0.1/32\", \"fd00::/8\", \"::1/32\") -\u003e status() -\u003e \u003cshunt\u003e\" \n","stream":"stderr","time":"2017-08-08T10:04:20.334128909Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings received\" \n","stream":"stderr","time":"2017-08-08T10:04:20.334137009Z"}  
{"log":"[APP]time=\"2017-08-08T10:04:20Z\" level=info msg=\"route settings applied\" \n","stream":"stderr","time":"2017-08-08T10:04:20.335007483Z"}  
@linki

This comment has been minimized.

Copy link
Member Author

linki commented Aug 9, 2017

By now, we identified this as a commonly made programming mistake when writing these kind of Kubernetes controller loops. We made the conceptionally equivalent mistakes in other projects, too, independent from each other, across team boundaries:

  • Mate where a n+1 API request per ELB in order to detect that ELB's canonical hosted zone could fail (e.g. due to rate limiting) and would remove that ELB from the desired list, which would result in it being actively deleted if it existed because it was found successfully in the last sync loop.
  • ALB Ingress Controller where a n+1 request per ELB in order to fetch its listeners could fail (e.g. due to rate limiting) and would remove that ELB from the desired list. Consequences similar to above.
  • and somewhat similar in Cluster Autoscaler that fetched a non-complete list of running pods from the API server and subsequently lowered the desired size of an ASG just to restore it to its correct value in the next loop. However, at this time the ASG already terminated a lot of instances (while booting up new ones). This resulted in temporary drop of available nodes and thus Pods were evicted. This is now nicely handled by smoothing the downscaling process by limiting the allowed node difference per controller loop: hjacobs/kube-aws-autoscaler@20a7741#diff-3d86eab708633b65198caf26be6c8353R154
@hjacobs

This comment has been minimized.

Copy link
Member

hjacobs commented Aug 10, 2017

Is this issue resolved with #409 ?

@linki

This comment has been minimized.

Copy link
Member Author

linki commented Aug 10, 2017

We needed a follow up: #410 Currently testing it.

@szuecs

This comment has been minimized.

Copy link
Member

szuecs commented Aug 11, 2017

@linki so tests were successful or not?

@szuecs

This comment has been minimized.

Copy link
Member

szuecs commented Aug 11, 2017

Discussed offline

@szuecs szuecs closed this Aug 11, 2017

@Raffo

This comment has been minimized.

Copy link
Contributor

Raffo commented Aug 11, 2017

Can you please write here as well if this fixes the issue?

@szuecs

This comment has been minimized.

Copy link
Member

szuecs commented Aug 11, 2017

@Raffo sure it's fixed and tested also in production. That's why I closed the issue, sorry that this was not clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment