Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693

Closed
tmmorin opened this issue Apr 8, 2024 · 10 comments
Assignees
Labels
area/security kind/bug Something isn't working

Comments

@tmmorin
Copy link

tmmorin commented Apr 8, 2024

Context:

  • a new RKE2 v1.28.8+rke2r1 server node (control plane node) is being added on a 1.27 cluster (this 1.27 cluster is being upgraded to 1.28)
  • this cluster has webhooks defined on the Namespace resource ( Rancher built-in rancher.cattle.io.namespaces, and also a Kyverno admission webhook)
  • RKE2 fails to start and produces the following error message:
Apr 08 19:17:33 management-cluster-cp-d5098df345-mnpm4 rke2[3922111]: time="2024-04-08T19:17:33Z" level=fatal msg="Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\":](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s\%22:) context deadline exceeded"

This error is produced by this part of RKE2 code:

rke2/pkg/rke2/np.go

Lines 213 to 225 in bbda824

ns.Annotations[template.annotationKey] = cisAnnotationValue
if err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
if _, err := cs.CoreV1().Namespaces().Update(ctx, ns, metav1.UpdateOptions{}); err != nil {
if apierrors.IsConflict(err) {
return updateNamespaceRef(ctx, cs, ns)
}
return err
}
return nil
}); err != nil {
logrus.Fatalf("Failed to apply network policy %s to namespace %s: %v", template.name, ns.Name, err)
}

This code, after applying network policies for namespaces, is annotating those namespaces, which in the presence of webhooks triggering on updates of Namespaces, does not work at this early stage of RKE2 startup (this is due to another issue which has been around for a while, related to the fact that kube-proxy in early stages of RKE2 startup isn't ready to setup connectivity to webhook service, see #4781 (comment)).

@tmmorin
Copy link
Author

tmmorin commented Apr 8, 2024

for reference, the issue that we have in the Sylva project about this issue: https://gitlab.com/sylva-projects/sylva-core/-/issues/1155

@tmmorin
Copy link
Author

tmmorin commented Apr 9, 2024

hello @brandond -- I see you commented at #4781 (comment) which is related to this issue here

it seems to me that the class of possible cases where "RKE2 startup is prevented by a webhook acting on some API operation done before kube-proxy is ready" would need to be addressed ... could that be solved by changing when kube-proxy is setup ?

@brandond
Copy link
Member

RKE2 uses annotations on the system namespaces to track the state of various hardening processes that should only be performed once. Any products that deploy fail-closed webhooks that block modifications to the system namespaces are likely to break RKE2, if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

I personally think deploying fail-closed webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea. It is super common to end up with chicken-and-egg problems like this during a cold cluster restart - but it seems to be a reoccurring pattern across the ecosystem.

We can evaluate changing how we track our hardening to avoid modifying the system namespaces, but this is unlikely to be changed soon.

@tmmorin
Copy link
Author

tmmorin commented Apr 15, 2024

Any products that deploy fail-closed webhooks that block modifications to the system namespaces are likely to break RKE2, if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

Rancher Server itself would I think fall in this category, right ?

[...] if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

This includes simple scenarios like:

  • cluster is installed
  • Rancher is installed (or any Kyverno policy that checks things on Namespaces)
  • cluster is then upgraded with a newer version of RKE2 that does additional hardening

My feeling here is that the central issue is that RKE2 won't start if some API actions that it wants to do trigger some fail-closed webhook. It seems to me that addressing this issue is needed beyond this Namespace-hardening-specific issue here, and that solving it would solve this issue among others.

I don't disagree that perhaps "webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea", but given that this common place, in particular in the Rancher/RKE2 ecosystem, then isn't it worth making RKE2 more robust to this ?

Also, as a side-node: the RKE2 hardening code simply annotates the Namespaces apparently simply to keep track that the network policies have been applied.
I would tend to see some drawbacks of doing it like that:

  • it does not help updating the content of an existing Network policy
  • a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

Last, today, some of those network policies will be applied even if the component that they relate to isn't enabled in RKE2 (e.g. the ingress-nginx network policies are applied even if ingress-nginx deployment by RKE2 is disabled).

@brandond
Copy link
Member

brandond commented Apr 15, 2024

  • it does not help updating the content of an existing Network policy

That is intentional. Once the policies are installed and the annotation added, RKE2 will not change them, so that administrators can modify them as necessary to suit their needs. The annotations can be removed to force RKE2 to re-sync the policies.

  • a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

You are welcome to do that; once RKE2 has created them it will no longer modify them as long as the annotations on the NS remain in place.

Like I said earlier, we can look at different ways to do this, but RKE2 has functioned like this for quite a while, and we are unlikely to refactor it on short notice.

@tmmorin
Copy link
Author

tmmorin commented Apr 16, 2024

[...] we are unlikely to refactor it on short notice.

Of course, I understand this well, and would not ask for that.

We have already implemented what is a viable short-term workaround for this issue, by ensuring that these annotations are set before RKE2 upgrade (https://gitlab.com/sylva-projects/sylva-core/-/issues/1155).

a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

You are welcome to do that; once RKE2 has created them it will no longer modify them as long as the annotations on the NS remain in place.

Well, as said above, this works at short term, but for each new version of RKE2 we'll have to check/discover if new such annotations are necessary, and we have to maintain and test the code that ensures that this is done prior to the upgrade.

I'd rather prefer an approach where we could "opt out" of this : a configuration flag allowing to not have RKE2 handle these network policies. Or perhaps have them shipping as a Helm chart like some other base charts (e.g. the CNI). Or, for the particular case of network policies related to ingress-ngninx, have them bundled in the ingress-nginx chart (so that we won't not have the network policies if we set disableComponents.pluginComponents: [rke2-ingress-nginx]).

But again, the underlying issue behind that looks more important to me: the fact that we can't have any fail-close webhooks on any resource that RKE2 would try to touch during the early stages where kube-proxy isn't ready, is seriously limiting. I of course wouldn't ask for a short term fix on this either, but I'm interested to know what are the plans about this.

@brandond
Copy link
Member

matchConditions are GA in 1.30; I'd like to see folks start using those to exclude system users or groups from webhooks.

@fmoral2
Copy link
Contributor

fmoral2 commented Jun 21, 2024

Validated on Version:

-$  rke2 version v1.30.2-rc5+rke2r1 (3f678f964ad849e24449e49f0c2c44e75d944c9f)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
ubuntu
AMD

Cluster Configuration:
-3 node server
-1 node agents

  • 1 servers

Steps to validate the fix

  1. Install rke2
  2. Install helm webhooks
  3. Join a new node on a upgraded version
  4. Validate rke2 is up and running
  5. Validate that no error from webhook is seen in the logs
  6. Validate pods

Reproduction Issue:

 
 rke2 version v1.27.2+rke2r1 (300a06dabe679c779970112a9cb48b289c17536c)

helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm install rancher rancher-latest/rancher \
    --namespace cattle-system \
    --set hostname=rancher.yourdomain.com
 
kubectl create namespace kyverno
 helm repo add kyverno https://kyverno.github.io/kyverno/
helm install kyverno kyverno/kyverno --namespace kyverno
 

 

kubectl get validatingwebhookconfigurations
NAME                               WEBHOOKS   AGE
cert-manager-webhook               1          3m12s
rke2-ingress-nginx-admission       1          21m
rke2-snapshot-validation-webhook   1          21m
validating-webhook-configuration   12         86s


 :~> kubectl get mutatingwebhookconfigurations
NAME                             WEBHOOKS   AGE
cert-manager-webhook             1          3m21s
mutating-webhook-configuration   9          95s



On a new node joining the cluster upgrading version.
sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.28.8+rke2r1 sh -

 sudo journalctl -u rke2-server -f | grep "failed to call webhook"
Jun 21 12:01:31   rke2[2060]: time="2024-06-21T12:01:31Z" level=warning msg="Failed to create Kubernetes secret: Internal error occurred: failed calling webhook \"rancher.cattle.io.secrets\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/secrets?timeout=15s\": context deadline exceeded"




Validation Results:

 
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest


helm install rancher rancher-latest/rancher  --version 2.8.5    --namespace cattle-system     --create-namespace     --set hostname=rancher.yourdomain.com     


kubectl create namespace kyverno
 helm repo add kyverno https://kyverno.github.io/kyverno/
helm install kyverno kyverno/kyverno --namespace kyverno
 

 

kubectl get validatingwebhookconfigurations
NAME                               WEBHOOKS   AGE
cert-manager-webhook               1          3m12s
rke2-ingress-nginx-admission       1          21m
rke2-snapshot-validation-webhook   1          21m
validating-webhook-configuration   12         86s


 :~> kubectl get mutatingwebhookconfigurations
NAME                             WEBHOOKS   AGE
cert-manager-webhook             1          3m21s
mutating-webhook-configuration   9          95s




On a new node joining the cluster upgrading version.
sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.30.2-rc5+rke2r1 sh -

 sudo journalctl -u rke2-server -f | grep "failed to call webhook"
 <>

 

 


> kubectl get mutatingwebhookconfigurations
NAME                                    WEBHOOKS   AGE
cert-manager-webhook                    1          9m22s
kyverno-policy-mutating-webhook-cfg     1          6m50s
kyverno-resource-mutating-webhook-cfg   0          6m49s
kyverno-verify-mutating-webhook-cfg     1          6m49s


 kubectl get validatingwebhookconfigurations
NAME                                            WEBHOOKS   AGE
cert-manager-webhook                            1          9m35s
kyverno-cleanup-validating-webhook-cfg          1          7m35s
kyverno-exception-validating-webhook-cfg        1          7m3s
kyverno-global-context-validating-webhook-cfg   1          7m3s
kyverno-policy-validating-webhook-cfg           1          7m3s
kyverno-resource-validating-webhook-cfg         0          7m2s
kyverno-ttl-validating-webhook-cfg              1          7m35s
rke2-ingress-nginx-admission                    1          51m
rke2-snapshot-validation-webhook                1          51m


@fmoral2 fmoral2 closed this as completed Jun 21, 2024
@Kellen275
Copy link

Kellen275 commented Aug 3, 2024

It looks like this was backported to v1.28.11. Is there a recommended workaround solution for folks on earlier 1.28 versions?

@brandond
Copy link
Member

brandond commented Aug 4, 2024

If possible, you can temporarily edit the webhook configuration to fail open so that rke2 can start up successfully. Once that's done you can revert it to the desired configuration.

Preferably you would upgrade though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants