[BUG] Rancher can no longer provision harvester machines after restart #44912

sarahhenkens · 2024-03-24T18:02:22Z

Rancher Server Setup

Rancher version: v2.8.0
Installation option (Docker install/Helm Chart): as a helm chart on a single-node k3s cluster
Proxy/Cert Details:

Information about the Cluster

Infrastructure Provider = Harvester

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- Admin

Describe the bug

After one of my harvester nodes was unexpected rebooted, rancher is no longer able to provision machines in the upstream harvester HCI infrastructure.

Trying to scale up an existing managed RKE2 cluster from rancher gets the following error:

 machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
 machine Doing /etc/rancher/ssl
 machine docker-machine-driver-harvester
 machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
 machine Trying to access option  which does not exist
 machine THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR
 machine Type assertion did not go smoothly to string for key
 machine Running pre-create checks...
 machine Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)"
 machine The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

And creating a brand new cluster has a different error:

 machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
 machine Doing /etc/rancher/ssl
 machine docker-machine-driver-harvester
 machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
 machine error loading host testing-pool1-31b05da3-dlchl: Docker machine "testing-pool1-31b05da3-dlchl" does not exist. Use "docker-machine ls" to list machines. Use "docker-machine create" to add a new one.

Looks like the connection between Rancher and Harvester is broken?

The text was updated successfully, but these errors were encountered:

bpedersen2 · 2024-03-27T12:36:12Z

maybe related to #44929 ?

bpedersen2 · 2024-03-27T17:05:22Z

Seems to occur even after the fix for #44929 , both on scaling and creating a new cluster

bpedersen2 · 2024-03-27T17:05:56Z

And I am on rancher v2.8.2

bpedersen2 · 2024-03-28T08:52:15Z

Looking at the created job (for a worker node scaleup):

"args": [ 8 items
"--driver-download-url=https://<host>/assets/docker-machine-driver-harvester",
"--driver-hash=a9c2847eff3234df6262973cf611a91c3926f3e558118fcd3f4197172eda3434",
"--secret-namespace=fleet-default",
"--secret-name=staging-pool-worker-bbfc2798-d5jsj-machine-state",
"rm",
"-y",
"--update-config",
"staging-pool-worker-bbfc2798-d5jsj"

the first thing the driver tries is to delete the non-exisiting pod and fails.... I would expect a create instead. I just don't know in where this command is generated

bpedersen2 · 2024-03-28T14:08:49Z

I could manually fix it:

go to the harvester embedded rancher and get the kube config
update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

sarahhenkens · 2024-03-29T19:27:13Z

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

sarahhenkens · 2024-03-29T19:37:33Z

Following the manual fix steps by getting the kubeconfig and manually updating the secret in Rancher worked for me!

bpedersen2 · 2024-04-02T06:52:32Z

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

No, it is running standalone.

bpedersen2 · 2024-04-02T11:50:25Z

What I observe is that the token in harvester changes.

Rancher is configured to use OIDC, and in the rancher logs I get

Error refreshing token principals, skipping: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] [keycloak oidc] GetPrincipal: error creating new http client: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] error syncing 'user-XXX': handler mgmt-auth-userattributes-controller: oauth2: "invalid_grant" "Token is not active", requeuing

With a local user, it seems to work

bpedersen2 · 2024-04-03T06:12:59Z

I reregistred the harvester cluster using a non-oidc admin account, now the connections seems to be stable again. It looks like a problem with token expiration to me.

dawid10353 · 2024-04-05T04:29:21Z

I have the same problem:

Failed creating server [fleet-default/rke2-rc-control-plane-2aae5bdf-2m48z] of kind (HarvesterMachine) for machine rke2-rc-control-plane-5b74797746x4dpcs-ncdxf in infrastructure provider: CreateError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)" The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

Rancher v2.8.2
Dashboard v2.8.0
Helm v2.16.8-rancher2
Machine v0.15.0-rancher106
Harvester: v1.2.1

dawid10353 · 2024-04-05T04:34:35Z

I have loop for many hours:
New VM is created, error, new VM is deleted, and new VM is created and again error and again new VM is deleted...

dawid10353 · 2024-04-05T04:44:21Z

I could manually fix it:

go to the harvester embedded rancher and get the kube config

update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

OK that's worked for me. I have Rancher with users provided by Active Directory.

dawid10353 · 2024-04-05T05:49:13Z

Now i have this error:

	Failed deleting server [fleet-default/rke2-rc-control-plane-3fba9236-dxptf] of kind (HarvesterMachine) for machine rke2-rc-control-plane-77f9455c9dx9xgsk-4kcwf in infrastructure provider: DeleteError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped About to remove rke2-rc-control-plane-3fba9236-dxptf WARNING: This action will delete both local reference and remote instance. Error removing host "rke2-rc-control-plane-3fba9236-dxptf": the server has asked for the client to provide credentials (get virtualmachines.kubevirt.io rke2-rc-control-plane-3fba9236-dxptf)

m-ildefons · 2024-05-07T14:07:42Z

Hi,
thanks for this bug report. May I ask which Harvester versions you were using, @bpedersen2, @sarahhenkens and when you last updated them?

bpedersen2 · 2024-05-07T14:15:06Z

I am on harvester 1.2.1 and rancher 2.8.3 ( and waiting for 1.2.2 to be able to upgrade to 1.3.x eventually)

sarahhenkens added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Rancher can no longer provision harvester machines after restart #44912

[BUG] Rancher can no longer provision harvester machines after restart #44912

sarahhenkens commented Mar 24, 2024

bpedersen2 commented Mar 27, 2024

bpedersen2 commented Mar 27, 2024

bpedersen2 commented Mar 27, 2024

bpedersen2 commented Mar 28, 2024

bpedersen2 commented Mar 28, 2024

sarahhenkens commented Mar 29, 2024

sarahhenkens commented Mar 29, 2024 •

edited

bpedersen2 commented Apr 2, 2024

bpedersen2 commented Apr 2, 2024

bpedersen2 commented Apr 3, 2024

dawid10353 commented Apr 5, 2024 •

edited

dawid10353 commented Apr 5, 2024

dawid10353 commented Apr 5, 2024

dawid10353 commented Apr 5, 2024

m-ildefons commented May 7, 2024

bpedersen2 commented May 7, 2024

[BUG] Rancher can no longer provision harvester machines after restart #44912

[BUG] Rancher can no longer provision harvester machines after restart #44912

Comments

sarahhenkens commented Mar 24, 2024

bpedersen2 commented Mar 27, 2024

bpedersen2 commented Mar 27, 2024

bpedersen2 commented Mar 27, 2024

bpedersen2 commented Mar 28, 2024

bpedersen2 commented Mar 28, 2024

sarahhenkens commented Mar 29, 2024

sarahhenkens commented Mar 29, 2024 • edited

bpedersen2 commented Apr 2, 2024

bpedersen2 commented Apr 2, 2024

bpedersen2 commented Apr 3, 2024

dawid10353 commented Apr 5, 2024 • edited

dawid10353 commented Apr 5, 2024

dawid10353 commented Apr 5, 2024

dawid10353 commented Apr 5, 2024

m-ildefons commented May 7, 2024

bpedersen2 commented May 7, 2024

sarahhenkens commented Mar 29, 2024 •

edited

dawid10353 commented Apr 5, 2024 •

edited