Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[validation] failurePolicy makes the control-plane unstable #4468

Closed
bcollard opened this issue Mar 22, 2021 · 9 comments · Fixed by #6017
Closed

[validation] failurePolicy makes the control-plane unstable #4468

bcollard opened this issue Mar 22, 2021 · 9 comments · Fixed by #6017
Assignees
Labels
Area: Guardrails Size: M 3 - 5 days Type: Bug Something isn't working zendesk

Comments

@bcollard
Copy link
Contributor

Describe the bug
The webhook's failurePolicy=Fail is unstable.

To Reproduce
Steps to reproduce the behavior:

  1. helm install Gloo EE with these values:
cat << EOF > values.yaml
gloo:
  gateway:
    validation:
      allowWarnings: false
      alwaysAcceptResources: false
      failurePolicy: Fail
EOF
  1. control-plane pods will crashloopbackoff
@bcollard bcollard added the Type: Bug Something isn't working label Mar 22, 2021
@guydc
Copy link

guydc commented Aug 2, 2021

Also reproducing with Gloo 1.8.0.

Installation breaks:

❯ helm install gloo gloo/gloo --version=1.8.0 --namespace test-ns --set gateway.validation.alwaysAcceptResources=false --set gateway.validation.failurePolicy=Fail
Error: Internal error occurred: failed calling webhook "gateway.test-ns.svc": Post "https://gateway.test-ns.svc:443/validation?timeout=30s": dial tcp 10.96.51.174:443: connect: connection refused

control plane pods remain in crashed state:

❯ kubectl get pods -n test-ns                                                   
NAME                             READY   STATUS             RESTARTS   AGE
discovery-897f8c8cb-2xtpf        0/1     CrashLoopBackOff   6          10m
gateway-99bd84596-nnh68          0/1     CrashLoopBackOff   6          10m
gateway-proxy-7c8bf6fcb5-pb2xq   1/1     Running            0          10m
gloo-5ff796587c-bl4nh            0/1     CrashLoopBackOff   6          10m

@chrisgaun chrisgaun added Roadmap: NOW Type: Bug Something isn't working and removed Type: Bug Something isn't working Roadmap: NOW labels Oct 12, 2021
@sam-heilbron
Copy link
Contributor

sam-heilbron commented Oct 20, 2021

Validation Webhook FailurePolicy: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#failure-policy

failurePolicy defines how unrecognized errors and timeout errors from the admission webhook are handled. Allowed values are Ignore or Fail.

Ignore means that an error calling the webhook is ignored and the API request is allowed to continue.
Fail means that an error calling the webhook causes the admission to fail and the API request to be rejected.

@sam-heilbron sam-heilbron added this to the Phase 2 - Guardrails milestone Oct 20, 2021
@sam-heilbron
Copy link
Contributor

logs:

gloo-5ff796587c-q5qcm {"level":"info","ts":1634738583.8613143,"logger":"gloo.v1.event_loop","caller":"v1/setup_event_loop.sk.go:57","msg":"event loop started","version":"1.8.0"}
gloo-5ff796587c-q5qcm {"level":"fatal","ts":1634738583.9618957,"logger":"gloo","caller":"setuputils/main_setup.go:88","msg":"error in setup: finding bootstrap configuration: list did not find settings default.default","version":"1.8.0","stacktrace":"github.com/solo-io/gloo/pkg/utils/setuputils.Main\n\t/workspace/gloo/pkg/utils/setuputils/main_setup.go:88\ngithub.com/solo-io/gloo/projects/gloo/pkg/setup.startSetupLoop\n\t/workspace/gloo/projects/gloo/pkg/setup/setup.go:25\ngithub.com/solo-io/gloo/projects/gloo/pkg/setup.Main\n\t/workspace/gloo/projects/gloo/pkg/setup/setup.go:17\nmain.main\n\t/workspace/gloo/projects/gloo/cmd/main.go:14\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

@sam-heilbron sam-heilbron added Size: M 3 - 5 days and removed Size: TBD labels Oct 20, 2021
@nrjpoddar
Copy link

Here's a sequence of events that can possibly cause this failure:

  1. FailurePolicy: Fail is set
  2. Via Helm default Custom Resources (CRs) are created before control plane (validation webhook) is ready.
  3. Control plane (CP) comes up which is dependent on the default CRs. As the CRs never get created, CP might be stuck in crash back loop.

Bunch of options to fix it depending upon the UX we want.

One possible way is to have a job scheduled via Helm to create default CRs which will wait for CRD definition, webhook config and webhook service to be ready in this order before creating the CRs.

@guydc
Copy link

guydc commented Oct 21, 2021

One possible way is to have a job scheduled via Helm to create default CRs which will wait for CRD definition, webhook config and webhook service to be ready in this order before creating the CRs.

@nrjpoddar this can also be a basis for a CRD upgrade solution, as mentioned here for gloo-mesh.

@byrdog55
Copy link

Comment from Zendesk:
Zendesk: 492 linked successfully.
Zendesk: #492
By: Jim Hayner <jim.hayner@solo.io>

@soloio-bot
Copy link

Zendesk ticket #492 has been linked to this issue.

@solo-io solo-io deleted a comment from soloio-bot Mar 1, 2022
@soloio-bot
Copy link

Zendesk ticket #492 has been linked to this issue.

@kdorosh kdorosh self-assigned this Mar 3, 2022
@kdorosh
Copy link
Contributor

kdorosh commented Mar 3, 2022

PR that should help: #6017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Guardrails Size: M 3 - 5 days Type: Bug Something isn't working zendesk
Projects
None yet
Development

Successfully merging a pull request may close this issue.