[BUG] Nodes are not added to the external load balancer backend pool after load balancer is active #38812
Labels
kind/bug
Issues that are defects reported by users or that we know have reached a real release
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
When creating the downstream RKE cluster in Azure using node pools (1 master pool with etcd+control plane roles and 3 worker pools), the master gets created, then the load balancer is also created from the user addon. The master gets registered, and the 1st and sometimes the 2dn worker is also registered, but it is very likely that the load balancer is not active in the virtual network yet, giving the worker a gateway that works, allowing it to register.
Meanwhile, the load balancer finally gets active, and any new workers don't get the load balancer as the gateway, breaking their registration. The other worker nodes get stuck in "Registering" state and any added worker node through the Rancher UI scaling feature, gets stuck in "IP Resolved" until it times out and gets deleted.
So, logic would be that first of all, the load balancer should be created, and Rancher should actually wait/verify that it is active in the virtual network before it starts adding nodes.
And that is not being done, making me believe that there is a logic bug in Rancher itself.
To Reproduce
Result
Only initial master nodes and first worker node is registered into the Kubernetes cluster. The other worker nodes get stuck in "Registering" state and no additional nodes can be added using the Rancher UI, they get stuck in "IP resolved".
Expected Result
All nodes are registered and I can scale up nodes through the Rancher UI.
Screenshots
Additional context
The following code uses terraform to create the downstream RKE1 cluster with 1 master node pool (control plane+etcd) and 3 worker pools (system, kafka, and general) and a user addon to create an external load balancer:
Addon used to expose the ingress controller using a cloud load balancer:
# external load balancer apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/component: controller app.kubernetes.io/instance: ingress-nginx app.kubernetes.io/name: ingress-nginx app.kubernetes.io/part-of: ingress-nginx name: ingress-nginx-controller namespace: ingress-nginx spec: externalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: http port: 80 protocol: TCP targetPort: http - name: https port: 443 protocol: TCP targetPort: https selector: app.kubernetes.io/component: controller app.kubernetes.io/instance: ingress-nginx app.kubernetes.io/name: ingress-nginx type: LoadBalancer
Information about nodes, pods, and services with the Rancher CLI
Provisioning Log for the cluster
DNS configuration
All nodes have the same DNS config.
Rancher agent container logs
Rancher agent logs of stuck node in "Registering" state.
The text was updated successfully, but these errors were encountered: