RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693

tmmorin · 2024-04-08T20:14:56Z

Context:

a new RKE2 v1.28.8+rke2r1 server node (control plane node) is being added on a 1.27 cluster (this 1.27 cluster is being upgraded to 1.28)
this cluster has webhooks defined on the Namespace resource ( Rancher built-in rancher.cattle.io.namespaces, and also a Kyverno admission webhook)
RKE2 fails to start and produces the following error message:

Apr 08 19:17:33 management-cluster-cp-d5098df345-mnpm4 rke2[3922111]: time="2024-04-08T19:17:33Z" level=fatal msg="Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\":](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s\%22:) context deadline exceeded"

This error is produced by this part of RKE2 code:

rke2/pkg/rke2/np.go

Lines 213 to 225 in bbda824

    
           ns.Annotations[template.annotationKey] = cisAnnotationValue 
        
           if err := retry.RetryOnConflict(retry.DefaultBackoff, func() error { 
        
           	if _, err := cs.CoreV1().Namespaces().Update(ctx, ns, metav1.UpdateOptions{}); err != nil { 
        
           		if apierrors.IsConflict(err) { 
        
           			return updateNamespaceRef(ctx, cs, ns) 
        
           		} 
        
           		return err 
        
           	} 
        
           	return nil 
        
           }); err != nil { 
        
           	logrus.Fatalf("Failed to apply network policy %s to namespace %s: %v", template.name, ns.Name, err) 
        
           }

This code, after applying network policies for namespaces, is annotating those namespaces, which in the presence of webhooks triggering on updates of Namespaces, does not work at this early stage of RKE2 startup (this is due to another issue which has been around for a while, related to the fact that kube-proxy in early stages of RKE2 startup isn't ready to setup connectivity to webhook service, see #4781 (comment)).

The text was updated successfully, but these errors were encountered:

tmmorin · 2024-04-08T20:29:58Z

for reference, the issue that we have in the Sylva project about this issue: https://gitlab.com/sylva-projects/sylva-core/-/issues/1155

tmmorin · 2024-04-09T13:01:19Z

hello @brandond -- I see you commented at #4781 (comment) which is related to this issue here

it seems to me that the class of possible cases where "RKE2 startup is prevented by a webhook acting on some API operation done before kube-proxy is ready" would need to be addressed ... could that be solved by changing when kube-proxy is setup ?

brandond · 2024-04-10T18:31:57Z

RKE2 uses annotations on the system namespaces to track the state of various hardening processes that should only be performed once. Any products that deploy fail-closed webhooks that block modifications to the system namespaces are likely to break RKE2, if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

I personally think deploying fail-closed webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea. It is super common to end up with chicken-and-egg problems like this during a cold cluster restart - but it seems to be a reoccurring pattern across the ecosystem.

We can evaluate changing how we track our hardening to avoid modifying the system namespaces, but this is unlikely to be changed soon.

tmmorin · 2024-04-15T15:18:12Z

Any products that deploy fail-closed webhooks that block modifications to the system namespaces are likely to break RKE2, if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

Rancher Server itself would I think fall in this category, right ?

[...] if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

This includes simple scenarios like:

cluster is installed
Rancher is installed (or any Kyverno policy that checks things on Namespaces)
cluster is then upgraded with a newer version of RKE2 that does additional hardening

My feeling here is that the central issue is that RKE2 won't start if some API actions that it wants to do trigger some fail-closed webhook. It seems to me that addressing this issue is needed beyond this Namespace-hardening-specific issue here, and that solving it would solve this issue among others.

I don't disagree that perhaps "webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea", but given that this common place, in particular in the Rancher/RKE2 ecosystem, then isn't it worth making RKE2 more robust to this ?

Also, as a side-node: the RKE2 hardening code simply annotates the Namespaces apparently simply to keep track that the network policies have been applied.
I would tend to see some drawbacks of doing it like that:

it does not help updating the content of an existing Network policy
a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

Last, today, some of those network policies will be applied even if the component that they relate to isn't enabled in RKE2 (e.g. the ingress-nginx network policies are applied even if ingress-nginx deployment by RKE2 is disabled).

brandond · 2024-04-15T21:46:08Z

it does not help updating the content of an existing Network policy

That is intentional. Once the policies are installed and the annotation added, RKE2 will not change them, so that administrators can modify them as necessary to suit their needs. The annotations can be removed to force RKE2 to re-sync the policies.

a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

You are welcome to do that; once RKE2 has created them it will no longer modify them as long as the annotations on the NS remain in place.

Like I said earlier, we can look at different ways to do this, but RKE2 has functioned like this for quite a while, and we are unlikely to refactor it on short notice.

tmmorin · 2024-04-16T08:04:02Z

[...] we are unlikely to refactor it on short notice.

Of course, I understand this well, and would not ask for that.

We have already implemented what is a viable short-term workaround for this issue, by ensuring that these annotations are set before RKE2 upgrade (https://gitlab.com/sylva-projects/sylva-core/-/issues/1155).

a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

You are welcome to do that; once RKE2 has created them it will no longer modify them as long as the annotations on the NS remain in place.

Well, as said above, this works at short term, but for each new version of RKE2 we'll have to check/discover if new such annotations are necessary, and we have to maintain and test the code that ensures that this is done prior to the upgrade.

I'd rather prefer an approach where we could "opt out" of this : a configuration flag allowing to not have RKE2 handle these network policies. Or perhaps have them shipping as a Helm chart like some other base charts (e.g. the CNI). Or, for the particular case of network policies related to ingress-ngninx, have them bundled in the ingress-nginx chart (so that we won't not have the network policies if we set disableComponents.pluginComponents: [rke2-ingress-nginx]).

But again, the underlying issue behind that looks more important to me: the fact that we can't have any fail-close webhooks on any resource that RKE2 would try to touch during the early stages where kube-proxy isn't ready, is seriously limiting. I of course wouldn't ask for a short term fix on this either, but I'm interested to know what are the plans about this.

brandond · 2024-05-13T19:31:37Z

matchConditions are GA in 1.30; I'd like to see folks start using those to exclude system users or groups from webhooks.

fmoral2 · 2024-06-21T13:10:19Z

Validated on Version:

-$  rke2 version v1.30.2-rc5+rke2r1 (3f678f964ad849e24449e49f0c2c44e75d944c9f)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
ubuntu
AMD

Cluster Configuration:
-3 node server
-1 node agents

1 servers

Steps to validate the fix

Install rke2
Install helm webhooks
Join a new node on a upgraded version
Validate rke2 is up and running
Validate that no error from webhook is seen in the logs
Validate pods

Reproduction Issue:

 
 rke2 version v1.27.2+rke2r1 (300a06dabe679c779970112a9cb48b289c17536c)

helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm install rancher rancher-latest/rancher \
    --namespace cattle-system \
    --set hostname=rancher.yourdomain.com
 
kubectl create namespace kyverno
 helm repo add kyverno https://kyverno.github.io/kyverno/
helm install kyverno kyverno/kyverno --namespace kyverno
 

 

kubectl get validatingwebhookconfigurations
NAME                               WEBHOOKS   AGE
cert-manager-webhook               1          3m12s
rke2-ingress-nginx-admission       1          21m
rke2-snapshot-validation-webhook   1          21m
validating-webhook-configuration   12         86s


 :~> kubectl get mutatingwebhookconfigurations
NAME                             WEBHOOKS   AGE
cert-manager-webhook             1          3m21s
mutating-webhook-configuration   9          95s



On a new node joining the cluster upgrading version.
sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.28.8+rke2r1 sh -

 sudo journalctl -u rke2-server -f | grep "failed to call webhook"
Jun 21 12:01:31   rke2[2060]: time="2024-06-21T12:01:31Z" level=warning msg="Failed to create Kubernetes secret: Internal error occurred: failed calling webhook \"rancher.cattle.io.secrets\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/secrets?timeout=15s\": context deadline exceeded"

Validation Results:

 
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest


helm install rancher rancher-latest/rancher  --version 2.8.5    --namespace cattle-system     --create-namespace     --set hostname=rancher.yourdomain.com     


kubectl create namespace kyverno
 helm repo add kyverno https://kyverno.github.io/kyverno/
helm install kyverno kyverno/kyverno --namespace kyverno
 

 

kubectl get validatingwebhookconfigurations
NAME                               WEBHOOKS   AGE
cert-manager-webhook               1          3m12s
rke2-ingress-nginx-admission       1          21m
rke2-snapshot-validation-webhook   1          21m
validating-webhook-configuration   12         86s


 :~> kubectl get mutatingwebhookconfigurations
NAME                             WEBHOOKS   AGE
cert-manager-webhook             1          3m21s
mutating-webhook-configuration   9          95s




On a new node joining the cluster upgrading version.
sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.30.2-rc5+rke2r1 sh -

 sudo journalctl -u rke2-server -f | grep "failed to call webhook"
 <>

 

 


> kubectl get mutatingwebhookconfigurations
NAME                                    WEBHOOKS   AGE
cert-manager-webhook                    1          9m22s
kyverno-policy-mutating-webhook-cfg     1          6m50s
kyverno-resource-mutating-webhook-cfg   0          6m49s
kyverno-verify-mutating-webhook-cfg     1          6m49s


 kubectl get validatingwebhookconfigurations
NAME                                            WEBHOOKS   AGE
cert-manager-webhook                            1          9m35s
kyverno-cleanup-validating-webhook-cfg          1          7m35s
kyverno-exception-validating-webhook-cfg        1          7m3s
kyverno-global-context-validating-webhook-cfg   1          7m3s
kyverno-policy-validating-webhook-cfg           1          7m3s
kyverno-resource-validating-webhook-cfg         0          7m2s
kyverno-ttl-validating-webhook-cfg              1          7m35s
rke2-ingress-nginx-admission                    1          51m
rke2-snapshot-validation-webhook                1          51m

Kellen275 · 2024-08-03T17:41:11Z

It looks like this was backported to v1.28.11. Is there a recommended workaround solution for folks on earlier 1.28 versions?

brandond · 2024-08-04T04:18:02Z

If possible, you can temporarily edit the webhook configuration to fail open so that rke2 can start up successfully. Once that's done you can revert it to the desired configuration.

Preferably you would upgrade though.

tmmorin mentioned this issue Apr 9, 2024

RKE2 Control Plane node rotation fails after installing kyverno #4781

Closed

MKlimuszka added team/frameworks kind/internal labels Apr 18, 2024

brandond mentioned this issue Apr 18, 2024

RKE2 upgrade from v1.27.10+rk2r1 to v1.28.8+rke2r1 fails #5800

Closed

brandond mentioned this issue May 10, 2024

Apply netpols async with retry #5909

Merged

brandond added this to the v1.30.2+rke2r1 milestone May 10, 2024

brandond self-assigned this May 10, 2024

brandond added kind/bug Something isn't working area/security and removed kind/internal team/frameworks labels May 10, 2024

ShylajaDevadiga assigned endawkins May 28, 2024

ShylajaDevadiga assigned fmoral2 and unassigned endawkins Jun 20, 2024

fmoral2 closed this as completed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693

RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693

tmmorin commented Apr 8, 2024 •

edited

Loading

tmmorin commented Apr 8, 2024

tmmorin commented Apr 9, 2024

brandond commented Apr 10, 2024

tmmorin commented Apr 15, 2024

brandond commented Apr 15, 2024 •

edited

Loading

tmmorin commented Apr 16, 2024

brandond commented May 13, 2024

fmoral2 commented Jun 21, 2024

Validation Results:

Kellen275 commented Aug 3, 2024 •

edited

Loading

brandond commented Aug 4, 2024

RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693

RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693

Comments

tmmorin commented Apr 8, 2024 • edited Loading

tmmorin commented Apr 8, 2024

tmmorin commented Apr 9, 2024

brandond commented Apr 10, 2024

tmmorin commented Apr 15, 2024

brandond commented Apr 15, 2024 • edited Loading

tmmorin commented Apr 16, 2024

brandond commented May 13, 2024

fmoral2 commented Jun 21, 2024

Validated on Version:

Environment Details

Steps to validate the fix

Reproduction Issue:

Validation Results:

Kellen275 commented Aug 3, 2024 • edited Loading

brandond commented Aug 4, 2024

tmmorin commented Apr 8, 2024 •

edited

Loading

brandond commented Apr 15, 2024 •

edited

Loading

Kellen275 commented Aug 3, 2024 •

edited

Loading