Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Rancher can no longer provision harvester machines after restart #44912

Open
sarahhenkens opened this issue Mar 24, 2024 · 16 comments
Open
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release

Comments

@sarahhenkens
Copy link

Rancher Server Setup

  • Rancher version: v2.8.0
  • Installation option (Docker install/Helm Chart): as a helm chart on a single-node k3s cluster
  • Proxy/Cert Details:

Information about the Cluster

  • Infrastructure Provider = Harvester

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • Admin

Describe the bug

After one of my harvester nodes was unexpected rebooted, rancher is no longer able to provision machines in the upstream harvester HCI infrastructure.

Trying to scale up an existing managed RKE2 cluster from rancher gets the following error:

 machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
 machine Doing /etc/rancher/ssl
 machine docker-machine-driver-harvester
 machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
 machine Trying to access option  which does not exist
 machine THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR
 machine Type assertion did not go smoothly to string for key
 machine Running pre-create checks...
 machine Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)"
 machine The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

And creating a brand new cluster has a different error:

 machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
 machine Doing /etc/rancher/ssl
 machine docker-machine-driver-harvester
 machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
 machine error loading host testing-pool1-31b05da3-dlchl: Docker machine "testing-pool1-31b05da3-dlchl" does not exist. Use "docker-machine ls" to list machines. Use "docker-machine create" to add a new one.

Looks like the connection between Rancher and Harvester is broken?

@sarahhenkens sarahhenkens added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Mar 24, 2024
@bpedersen2
Copy link

maybe related to #44929 ?

@bpedersen2
Copy link

Seems to occur even after the fix for #44929 , both on scaling and creating a new cluster

@bpedersen2
Copy link

And I am on rancher v2.8.2

@bpedersen2
Copy link

Looking at the created job (for a worker node scaleup):

"args": [ 8 items
"--driver-download-url=https://<host>/assets/docker-machine-driver-harvester",
"--driver-hash=a9c2847eff3234df6262973cf611a91c3926f3e558118fcd3f4197172eda3434",
"--secret-namespace=fleet-default",
"--secret-name=staging-pool-worker-bbfc2798-d5jsj-machine-state",
"rm",
"-y",
"--update-config",
"staging-pool-worker-bbfc2798-d5jsj"

the first thing the driver tries is to delete the non-exisiting pod and fails.... I would expect a create instead. I just don't know in where this command is generated

@bpedersen2
Copy link

I could manually fix it:

  1. go to the harvester embedded rancher and get the kube config
  2. update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

@sarahhenkens
Copy link
Author

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

@sarahhenkens
Copy link
Author

sarahhenkens commented Mar 29, 2024

Following the manual fix steps by getting the kubeconfig and manually updating the secret in Rancher worked for me!

@bpedersen2
Copy link

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

No, it is running standalone.

@bpedersen2
Copy link

What I observe is that the token in harvester changes.

Rancher is configured to use OIDC, and in the rancher logs I get

Error refreshing token principals, skipping: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] [keycloak oidc] GetPrincipal: error creating new http client: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] error syncing 'user-XXX': handler mgmt-auth-userattributes-controller: oauth2: "invalid_grant" "Token is not active", requeuing

With a local user, it seems to work

@bpedersen2
Copy link

I reregistred the harvester cluster using a non-oidc admin account, now the connections seems to be stable again. It looks like a problem with token expiration to me.

@dawid10353
Copy link

dawid10353 commented Apr 5, 2024

I have the same problem:

Failed creating server [fleet-default/rke2-rc-control-plane-2aae5bdf-2m48z] of kind (HarvesterMachine) for machine rke2-rc-control-plane-5b74797746x4dpcs-ncdxf in infrastructure provider: CreateError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)" The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

Rancher v2.8.2
Dashboard v2.8.0
Helm v2.16.8-rancher2
Machine v0.15.0-rancher106
Harvester: v1.2.1

@dawid10353
Copy link

I have loop for many hours:
New VM is created, error, new VM is deleted, and new VM is created and again error and again new VM is deleted...

@dawid10353
Copy link

I could manually fix it:

  1. go to the harvester embedded rancher and get the kube config
  2. update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

OK that's worked for me. I have Rancher with users provided by Active Directory.

@dawid10353
Copy link

Now i have this error:

	Failed deleting server [fleet-default/rke2-rc-control-plane-3fba9236-dxptf] of kind (HarvesterMachine) for machine rke2-rc-control-plane-77f9455c9dx9xgsk-4kcwf in infrastructure provider: DeleteError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped About to remove rke2-rc-control-plane-3fba9236-dxptf WARNING: This action will delete both local reference and remote instance. Error removing host "rke2-rc-control-plane-3fba9236-dxptf": the server has asked for the client to provide credentials (get virtualmachines.kubevirt.io rke2-rc-control-plane-3fba9236-dxptf)

@m-ildefons
Copy link

Hi,
thanks for this bug report. May I ask which Harvester versions you were using, @bpedersen2, @sarahhenkens and when you last updated them?

@bpedersen2
Copy link

I am on harvester 1.2.1 and rancher 2.8.3 ( and waiting for 1.2.2 to be able to upgrade to 1.3.x eventually)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

4 participants