Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Upstream causes endless error logs in gloo container #3761

Closed
lgadban opened this issue Oct 19, 2020 · 5 comments · Fixed by #4515
Closed

Incorrect Upstream causes endless error logs in gloo container #3761

lgadban opened this issue Oct 19, 2020 · 5 comments · Fixed by #4515
Assignees
Labels
Type: Bug Something isn't working

Comments

@lgadban
Copy link
Contributor

lgadban commented Oct 19, 2020

Describe the bug
When there is an Upstream that contains an invalid destination, specifically a missing k8s Service, the gloo container will endlessly log errors.

To Reproduce
Steps to reproduce the behavior:

  1. Create an Upstream that points to a nonexistent Service
  2. Check gloo pod and see errors like:
{"level":"error","ts":1603124079.3074384,"logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer.kubernetes_eds","caller":"kubernetes/eds.go:206","msg":"upstream gloo-system.bad-upstream: port 8080 not found for service bad-petstore","version":"1.5.1","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.filterEndpoints\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:206\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).List\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:121\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func1\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:130\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:157"}
{"level":"error","ts":1603124079.8723693,"logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer.kubernetes_eds","caller":"kubernetes/eds.go:206","msg":"upstream gloo-system.bad-upstream: port 8080 not found for service bad-petstore","version":"1.5.1","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.filterEndpoints\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:206\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).List\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:121\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func1\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:130\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.5.1/projects/gloo/pkg/plugins/kubernetes/eds.go:157"}

Expected behavior
Not endless logs of errors

Additionally, is there a possibility that this scenario could fall under invalid route replacement?
For example if a route references an Upstream in the above state, i.e. when a "rejected" Upstream is used in a route, could it be treated the same as a route to a nonexistent Upstream?

Additional context
Add any other context about the problem here, e.g.

Client: {"version":"1.5.0"}
Server: {"type":"Gateway","enterprise":true,"kubernetes":{"containers":[{"Tag":"1.5.1","Name":"grpcserver-ui","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"grpcserver-ee","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"grpcserver-envoy","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"discovery","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"extauth-ee","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"gateway","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"gloo-ee-envoy-wrapper","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"gloo-ee","Registry":"quay.io/solo-io"},{"Tag":"1.5.1","Name":"rate-limit-ee","Registry":"quay.io/solo-io"},{"Tag":"5","Name":"redis","Registry":"docker.io"}],"namespace":"gloo-system"}}
@kdorosh
Copy link
Contributor

kdorosh commented Mar 26, 2021

w.r.t.

Additionally, is there a possibility that this scenario could fall under invalid route replacement?
For example if a route references an Upstream in the above state, i.e. when a "rejected" Upstream is used in a route, could it be treated the same as a route to a nonexistent Upstream?

Just tested this on 1.7.0-rc2 open source, as part of #4497 and confirmed that the case is covered.

@kdorosh kdorosh self-assigned this Mar 26, 2021
@kdorosh
Copy link
Contributor

kdorosh commented Mar 29, 2021

Sample upstream to apply and reproduce:

apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: bad-upstream
  namespace: gloo-system
spec:
  kube:
    serviceName: doesnotexist
    serviceNamespace: default
    servicePort: 443

@kdorosh
Copy link
Contributor

kdorosh commented Mar 29, 2021

One potential idea for a fix is to update the cache eds logic to fire less often here (ignore endpoint updates that don't result in hash differences, just like we do for the snapshot syncing)

we could do that here

return filterEndpoints(ctx, writeNamespace, endpointList, serviceList, podList, c.upstreams), nil

@kdorosh
Copy link
Contributor

kdorosh commented Mar 30, 2021

Have a potential fix for k8s to push up, tested on consul and the refresh is a lot more sane. No need to bring the change to other EDS types

@solo-changelog-bot solo-changelog-bot bot mentioned this issue Mar 30, 2021
8 tasks
@EItanya
Copy link
Member

EItanya commented Mar 30, 2021

One potential idea for a fix is to update the cache eds logic to fire less often here (ignore endpoint updates that don't result in hash differences, just like we do for the snapshot syncing)

we could do that here

return filterEndpoints(ctx, writeNamespace, endpointList, serviceList, podList, c.upstreams), nil

I really like that idea. Have you done any perf testing on that? I think we should test with a very large env before going too far with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants