Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeRef is missing from machines #1062

Open
uhthomas opened this issue Feb 22, 2023 · 10 comments
Open

NodeRef is missing from machines #1062

uhthomas opened this issue Feb 22, 2023 · 10 comments
Assignees

Comments

@uhthomas
Copy link

I'm running into an issue with clusterctl move where it thinks the nodes are not provisioned. This is not the case as I am the Kubernetes cluster is running and healthy.

❯ clusterctl move --kubeconfig-context=admin@unwind-bootstrap --to-kubeconfig=$HOME/.kube/config --to-kubeconfig-context=admin@unwind -v10
No default config file available
Performing move...
Discovering Cluster API objects
MetalMachineTemplate Count=2
TalosControlPlane Count=1
Secret Count=15
MachineDeployment Count=1
MetalCluster Count=1
TalosConfigTemplate Count=1
ServerBinding Count=5
Machine Count=5
MetalMachine Count=5
Cluster Count=1
Environment Count=1
ConfigMap Count=1
ServerClass Count=1
Server Count=5
TalosConfig Count=5
MachineSet Count=1
Total objects Count=51
Excluding secret from move (not linked with any Cluster) name="siderolink"
Error: failed to get object graph: failed to check for provisioned infrastructure: [cannot start the move operation while the control plane for "/, Kind=" default/unwind is not yet initialized, cannot start the move operation while "/, Kind=" default/unwind-cp-dbnpd is still provisioning the node, cannot start the move operation while "/, Kind=" default/unwind-cp-89sgs is still provisioning the node, cannot start the move operation while "/, Kind=" default/unwind-cp-wrc8j is still provisioning the node, cannot start the move operation while "/, Kind=" default/unwind-cp-49wdk is still provisioning the node, cannot start the move operation while "/, Kind=" default/unwind-cp-mmg97 is still provisioning the node]
sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster.(*objectMover).Move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster/mover.go:96
sigs.k8s.io/cluster-api/cmd/clusterctl/client.(*clusterctlClient).move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/move.go:125
sigs.k8s.io/cluster-api/cmd/clusterctl/client.(*clusterctlClient).Move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/move.go:97
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.runMove
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/move.go:101
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.glob..func16
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/move.go:59
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.6.1/command.go:916
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.6.1/command.go:1044
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.6.1/command.go:968
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.Execute
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/root.go:99
main.main
        sigs.k8s.io/cluster-api/cmd/clusterctl/main.go:27
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_amd64.s:1594

This error message leads to here which suggests the machines are missing a NodeRef? Why? What information can I give you to help understand this? The logs don't show anything interesting from what I can see.


Originally raised here.

@smira
Copy link
Member

smira commented Feb 22, 2023

This should work as long as CAPI has access to the workload cluster. What you should see is that Node resources have a label set by Sidero, and Sidero should set ProviderID in the Node resource. After that CAPI core should pick this up and add a NodeRef to the Machine resource in the management cluster.

@uhthomas
Copy link
Author

I do see that Sidero has set a label on the node, but I do not see a ProviderID.

"metal.sidero.dev/uuid": "4c4c4544-0047-4410-8034-b9c04f575631",

The machine configuration shows the correct IP address and there is no indication of communication issues between the management cluster and the workload cluster otherwise.

@uhthomas
Copy link
Author

I still haven't been able to get this to work properly.

"conditions": [
    {
        "lastTransitionTime": "2023-03-18T20:10:06Z",
        "status": "True",
        "type": "Ready"
    },
    {
        "lastTransitionTime": "2023-03-18T20:09:46Z",
        "status": "True",
        "type": "BootstrapReady"
    },
    {
        "lastTransitionTime": "2023-03-18T20:10:06Z",
        "status": "True",
        "type": "InfrastructureReady"
    },
    {
        "lastTransitionTime": "2023-03-18T20:09:46Z",
        "reason": "WaitingForNodeRef",
        "severity": "Info",
        "status": "False",
        "type": "NodeHealthy"
    }
],

The IP addresses don't seem to match. Could this be the problem? I don't think there's really much I can do about it as the machines are assigned a DHCP IP when networking booting, but have a static IP in the config. Even if I leave DHCP enabled the address is different when the Kubernetes cluster actually comes up, likely because of bonding?

@dhess
Copy link

dhess commented Aug 20, 2023

@uhthomas I was having the same problem, namely no NodeRef on my Machines, and my search of issues on this repo led me here.

After some investigation, I noticed the following spammed errors in my capi-controller-manager logs:

E0820 16:24:17.507089       1 controller.go:324] "Reconciler error" err="failed to create cluster accessor: error creating client for remote cluster \"cluster-api-management-plane/management-plane\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://localhost:7445/api/v1?timeout=10s\": dial tcp [::1]:7445: connect: connection refused" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="cluster-api-management-plane/management-plane-workers-5886d6dcffxbl5c6-52vjt" namespace="cluster-api-management-plane" name="management-plane-workers-5886d6dcffxbl5c6-52vjt" reconcileID=3f451401-8a5e-4a64-93ae-6c82beb3adce

So the root cause was that capi-controller-manager was unable to communicate with the new management cluster. Once I resolved that, after a few minutes, the following showed up in the logs:

I0820 16:26:37.622182       1 machine_controller_noderef.go:95] "Infrastructure provider reporting spec.providerID, Kubernetes node is now available" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="cluster-api-management-plane/management-plane-cp-pn2f9" namespace="cluster-api-management-plane" name="management-plane-cp-pn2f9" reconcileID=2f545761-e67e-47d9-94ef-0ce07fdaa2f1 TalosControlPlane="cluster-api-management-plane/management-plane-cp" Cluster="cluster-api-management-plane/management-plane" MetalMachine="cluster-api-management-plane/management-plane-cp-rktxt" providerID="sidero://c41836fc-c063-498e-886d-9ab20e6ce1f2" node="mgmt-b"

Now all of my Machines have nodeRefs.

In my case, the issue was that I'm using the new KubePrism feature in Talos v1.5.0, so the server stanza in the generated kubeconfig for the new management cluster was https://localhost:7445. capi-controller-manager uses the generated kubeconfig to communicate with the cluster, and obviously this URL only works from within the cluster. I assume that the Talos CAPI providers will need to be modified to support this scenario, as currently I believe there's only one setting for the cluster API endpoint, and when you want to use KubePrism, you want to set that to localhost:7445, but that leads to the breakage I saw here.

Given when you filed this issue, it seems unlikely that my specific issue is the same as the one you're seeing, but my advice would be to check your capi-controller-manager logs and look for a similar error, and if you see it, then figure out why the bootstrap cluster might not be able to connect to the CAPI-generated workload cluster.

(By the way, as it took me awhile to figure out what capi-controller-manager was using to connect to the workload cluster, it's a secret named <clustername>-kubeconfig in the same namespace as the workload cluster config. I edited that secret by replacing the server property, and upon saving it, the controller picked it up soon thereafter.)

@smira
Copy link
Member

smira commented Aug 21, 2023

@dhess thanks for looking into it. I believe the original issue is unrelated, as it happened before Talos 1.5.0, but I will take a look into why KubePrism endpoint got into the kubeconfig (this is not expected).

@smira smira self-assigned this Aug 21, 2023
@dhess
Copy link

dhess commented Aug 21, 2023

@smira To be clear, I set CONTROL_PLANE_PORT=7445 and CONTROL_PLANE_ENDPOINT=localhost when I ran clusterctl generate cluster, so I was not surprised that these ended up in the kubeconfig. What needs clarification on my end is how to use the external IP & port for clusterctl generate cluster and still convince Talos to use the KubePrism endpoint once it's up: when I tried this, it appeared that the new cluster was using the external IP and port and not the KubePrism service for its own communications.

@smira
Copy link
Member

smira commented Aug 21, 2023

@dhess I'm more confused then. When KubePrism is enabled, Talos uses that automatically for Kubernetes components running on the host network. You don't need to do anything specifically for it. Your external endpoint should still be "external".

@dhess
Copy link

dhess commented Aug 21, 2023

@smira OK, so to be clear, when using clusterctl generate cluster I should specify the external endpoint details in the CONTROL_PLANE_* env vars, configure the kubePrism setting in each node's machine config, and then everything should work as expected?

@smira
Copy link
Member

smira commented Aug 21, 2023

@dhess yes, KubePrism only needs to be enabled. if you're installing e.g. Cilium, you can point it to the KubePrism endpoint (and same for any other host pods which need access to the Kubernetes API).

@dhess
Copy link

dhess commented Aug 25, 2023

Here's another issue I encountered where NodeRefs were missing and the cluster couldn't be reconciled. If I specify a DNS name in CONTROL_PLANE_ENDPOINT, caps-controller-manager fails repeatedly when trying to set the provider ID on nodes:

2023-08-20T23:20:31Z	INFO	controllers.MetalMachine.machine=management-plane-workers-5886d6dcffxqq9dh-pkk8m.cluster=management-plane	Failed to set provider ID	{"metalmachine": "cluster-api-management-plane/management-plane-workers-m8lh4", "error": "failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"cluster-api-management-plane/management-plane\": Get \"https://mgmt.example.com:6443/api?timeout=10s\": dial tcp 10.0.8.34:6443: connect: connection refused", "errorVerbose": "Get \"https://mgmt.example.com:6443/api?timeout=10s\": dial tcp 10.0.8.34:6443: connect: connection refused\nerror creating dynamic rest mapper for remote cluster \"cluster-api-management-plane/management-plane\"\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).createClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.4.1/controllers/remote/cluster_cache_tracker.go:384\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).newClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.4.1/controllers/remote/cluster_cache_tracker.go:254\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.4.1/controllers/remote/cluster_cache_tracker.go:233\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.4.1/controllers/remote/cluster_cache_tracker.go:151\ngithub.com/siderolabs/sidero/app/caps-controller-manager/controllers.(*MetalMachineReconciler).patchProviderID\n\t/src/app/caps-controller-manager/controllers/metalmachine_controller.go:395\ngithub.com/siderolabs/sidero/app/caps-controller-manager/controllers.(*MetalMachineReconciler).Reconcile\n\t/src/app/caps-controller-manager/controllers/metalmachine_controller.go:237\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598\nfailed to create cluster accessor\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.4.1/controllers/remote/cluster_cache_tracker.go:235\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.4.1/controllers/remote/cluster_cache_tracker.go:151\ngithub.com/siderolabs/sidero/app/caps-controller-manager/controllers.(*MetalMachineReconciler).patchProviderID\n\t/src/app/caps-controller-manager/controllers/metalmachine_controller.go:395\ngithub.com/siderolabs/sidero/app/caps-controller-manager/controllers.(*MetalMachineReconciler).Reconcile\n\t/src/app/caps-controller-manager/controllers/metalmachine_controller.go:237\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598"}

This is because cacppt-controller-manager can't get the cluster kubeconfig:

2023-08-25T10:46:32Z	INFO	controllers.TalosControlPlane	failed to get kubeconfig for the cluster	{"error": "failed to create cluster accessor: error creating client for remote cluster \"capi-cluster-management-plane/management-plane\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://mgmt.example.com:6443/api/v1?timeout=10s\": context deadline exceeded", "errorVerbose": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://mgmt.example.com:6443/api/v1?timeout=10s\": context deadline exceeded\nerror creating client for remote cluster \"capi-cluster-management-plane/management-plane\": error getting rest mapping\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).createClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:396\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).newClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:299\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:273\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598\nfailed to create cluster accessor\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).getClusterAccessor\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:275\nsigs.k8s.io/cluster-api/controllers/remote.(*ClusterCacheTracker).GetClient\n\t/.cache/mod/sigs.k8s.io/cluster-api@v1.5.0/controllers/remote/cluster_cache_tracker.go:180\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).updateStatus\n\t/src/controllers/taloscontrolplane_controller.go:562\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile.func1\n\t/src/controllers/taloscontrolplane_controller.go:155\ngithub.com/siderolabs/cluster-api-control-plane-provider-talos/controllers.(*TalosControlPlaneReconciler).Reconcile\n\t/src/controllers/taloscontrolplane_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/.cache/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/toolchain/go/src/runtime/asm_amd64.s:1598"}

I added the DNS name (mgmt.example.com in the logs above, but a real, resolvable DNS name in my actual config) to .machine.certSANs via a config patch in the TalosControlPlane and TalosConfigTemplate configs for this cluster, but that didn't help. I'm fairly certain the config was correct because talosctl was able to connect to the cluster endpoint just fine using the DNS name. (In my experience, talosctl it will complain if the DNS name isn't added to the .machine.certCANs config.) I tried both round-robin DNS pointing to all 3 control plane IPs, and just a single static mapping to one of the control plane IPs, with the same result.

I've tested this with multiple configs, and the only way I can get the Talos CAPI providers to properly resolve the cluster configuration is if I use an IP address in CONTROL_PLANE_ENDPOINT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants