Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending volumes, rook-operator is unable to find the config file #2274

Closed
kainlite opened this issue Nov 8, 2018 · 13 comments
Closed

Pending volumes, rook-operator is unable to find the config file #2274

kainlite opened this issue Nov 8, 2018 · 13 comments
Labels
ceph main ceph tag operator
Projects
Milestone

Comments

@kainlite
Copy link

kainlite commented Nov 8, 2018

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
Created a PVC and have it pending forever.
Expected behavior:
Create a PVC and get the PV bound.

How to reproduce it (minimal and precise):
I do not know how to reproduce it, because the cluster was working fine until today that the volumes were pending, the logs from the operator say it couldn't find the config file:

E1108 19:29:22.785432       7 goroutinemap.go:165] Operation for "provision-kube-system/test-pvc[36a8d25f-e38a-11e8-a835-f4034358a5bc]" failed. No retr
ies permitted until 2018-11-08 19:31:24.785414964 +0000 UTC m=+2869.505673041 (durationBeforeRetry 2m2s). Error: Failed to create rook block image repl
icapool/pvc-36a8d25f-e38a-11e8-a835-f4034358a5bc: failed to create image pvc-36a8d25f-e38a-11e8-a835-f4034358a5bc in pool replicapool of size 536870912
0: Failed to complete '': exit status 1. global_init: unable to open config file from search list /var/lib/rook/rook-ceph/rook-ceph.config  

The workaround was to copy those files by hand, then the volumes started to work ok, ceph and everything were saying HEALTH_OK, and the existing volumes working fine, but I was not able to get a PVs from new PVCs, I did deleted all pods from the operator and from the cluster (rook-ceph and rook-ceph-system) and nothing changed until I copied by hand the config files to the operator container.

Environment:

  • OS (e.g. from /etc/os-release): CentOS Linux 7 (Core)
  • Kernel (e.g. uname -a): Linux mca 3.10.0-862.9.1.el7.x86_64 Monitor bootstrapping with libcephd #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: On prem hw, 4 physical servers with 1.5Tb ssd storage with LVM.
  • Rook version (use rook version inside of a Rook Pod): rook: v0.8.3
  • Kubernetes version (use kubectl version): v1.11.3
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): On-prem configured with kubespray
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK
@botzill
Copy link

botzill commented Jan 24, 2019

Hi.

Having the same issue, any updates on this ?

@travisn
Copy link
Member

travisn commented Jan 24, 2019

@botzill

  • What version of Rook are you running?
  • Any other clues from your side on what might have changed to cause it so we can repro it?
  • Is your cluster still in this state or did you recover already? If it's still in this state, can you see what files are in the operator pod under /var/lib/rook?

@graham-web
Copy link

I'm currently seeing this too. /var/lib/rook is an empty directory on the operator pod.

rook v0.9.2

We're mostly using ceph filesystem rather than persistent volumes, so can't really pinpoint when/why this broke am afraid.

@graham-web
Copy link

graham-web commented Feb 19, 2019

The operator logs are kinda interesting during startup here, particularly the part

unknown ceph major version. failed to complete version job. timed out waiting for the condition

Full output, before it starts failing to provision volumes:

"2019-02-13 09:35:30.792038 I | rookcmd: starting Rook v0.9.2 with arguments '/usr/local/bin/rook ceph operator'\n"
"2019-02-13 09:35:30.792162 I | rookcmd: flag values: --alsologtostderr=false, --help=false, --log-level=INFO, --log_backtrace_at=:0, --log_dir=, --logtostderr=true, --mon-healthcheck-interval=45s, --mon-out-timeout=5m0s, --stderrthreshold=2, --v=0, --vmodule=\n"
"2019-02-13 09:35:30.795501 I | cephcmd: starting operator\n"
"2019-02-13 09:35:30.887274 I | op-agent: getting flexvolume dir path from FLEXVOLUME_DIR_PATH env var\n"
"2019-02-13 09:35:30.887618 I | op-agent: flexvolume dir path env var FLEXVOLUME_DIR_PATH is not provided. Defaulting to: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/\n"
"2019-02-13 09:35:30.887698 I | op-agent: discovered flexvolume dir path from source default. value: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/\n"
"2019-02-13 09:35:30.887744 I | op-agent: no agent mount security mode given, defaulting to '%s' modeAny\n"
"2019-02-13 09:35:30.887779 W | op-agent: Invalid ROOK_ENABLE_FSGROUP value \"\". Defaulting to \"true\".\n"
"2019-02-13 09:35:30.960911 I | op-agent: rook-ceph-agent daemonset already exists, updating ...\n"
"2019-02-13 09:35:31.049641 I | op-discover: rook-discover daemonset already exists, updating ...\n"
"I0213 09:35:31.122595       5 controller.go:407] Starting provisioner controller b4615629-2f72-11e9-b0b0-ae610f41f11d!\n"
"2019-02-13 09:35:31.274960 I | operator: rook-provisioner ceph.rook.io/block started using ceph.rook.io flex vendor dir\n"
"2019-02-13 09:35:31.280410 I | operator: rook-provisioner rook.io/block started using rook.io flex vendor dir\n"
"2019-02-13 09:35:31.280523 I | op-cluster: start watching clusters in all namespaces\n"
"I0213 09:35:31.282313       5 controller.go:407] Starting provisioner controller b482b1cf-2f72-11e9-b0b0-ae610f41f11d!\n"
"2019-02-13 09:35:31.379151 I | op-cluster: skipping watching for legacy rook cluster events (legacy cluster CRD probably doesn't exist): the server could not find the requested resource (get clusters.ceph.rook.io)\n"
"2019-02-13 09:35:31.414356 I | op-cluster: starting cluster in namespace rook-ceph\n"
"2019-02-13 09:35:31.429050 I | op-k8sutil: verified the ownerref can be set on resources\n"
"2019-02-13 09:35:31.587563 I | op-k8sutil: waiting for job rook-ceph-detect-version to complete...\n"
"2019-02-13 09:50:31.595906 E | op-cluster: unknown ceph major version. failed to complete version job. timed out waiting for the condition\n"
"W0213 12:49:11.584444       5 reflector.go:341] github.com/rook/rook/pkg/operator/ceph/provisioner/controller/controller.go:411: watch of *v1.PersistentVolumeClaim ended with: too old resource version: 4755047 (4783217)\n"
"W0213 12:49:11.593658       5 reflector.go:341] github.com/rook/rook/pkg/operator/ceph/provisioner/controller/controller.go:411: watch of *v1.PersistentVolumeClaim ended with: too old resource version: 4755047 (4783217)\n"

@galexrt galexrt added ceph main ceph tag operator labels Mar 4, 2019
@lewismarshall
Copy link

lewismarshall commented Mar 7, 2019

I've just seen the same problem after testing a complete kubernetes cluster restart. The rook-ceph cluster recovered but the operator came up with the same logs.

Environment:

  • Rook version (use rook version inside of a Rook Pod): rook: v0.9.3
  • Kubernetes version (use kubectl version): v1.12.3
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

New PVC's were not created, to recover I simply restarted the operator:

2019-03-07 09:11:21.984514 I | op-cluster: starting cluster in namespace rook-ceph
2019-03-07 09:11:22.447396 I | op-k8sutil: verified the ownerref can be set on resources
2019-03-07 09:11:22.543728 I | op-k8sutil: waiting for job rook-ceph-detect-version to complete...
2019-03-07 09:26:22.615975 E | op-cluster: unknown ceph major version. failed to complete version job. timed out waiting for the condition
I0307 10:56:25.194365       8 controller.go:1072] scheduleOperation[lock-provision-default/mysql-pv-claim[a6c09b7c-40c7-11e9-9ad0-0025b5111a0b]]
I0307 10:56:25.283862       8 controller.go:1072] scheduleOperation[lock-provision-default/mysql-pv-claim[a6c09b7c-40c7-11e9-9ad0-0025b5111a0b]]
I0307 10:56:25.286400       8 leaderelection.go:156] attempting to acquire leader lease...
I0307 10:56:25.300822       8 leaderelection.go:178] successfully acquired lease to provision for pvc default/mysql-pv-claim
I0307 10:56:25.300869       8 controller.go:1072] scheduleOperation[provision-default/mysql-pv-claim[a6c09b7c-40c7-11e9-9ad0-0025b5111a0b]]
2019-03-07 10:56:25.302474 I | op-provisioner: creating volume with configuration {blockPool:replicapool clusterNamespace:rook-ceph fstype: dataBlockPool:}
2019-03-07 10:56:25.302501 I | exec: Running command: rbd create replicapool/pvc-a6c09b7c-40c7-11e9-9ad0-0025b5111a0b --size 20480 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
E0307 10:56:25.586471       8 controller.go:800] Failed to provision volume for claim "default/mysql-pv-claim" with StorageClass "rook-ceph-block": Failed to create rook block image replicapool/pvc-a6c09b7c-40c7-11e9-9ad0-0025b5111a0b: failed to create image pvc-a6c09b7c-40c7-11e9-9ad0-0025b5111a0b in pool replicapool of size 21474836480: Failed to complete '': exit status 1. global_init: unable to open config file from search list /var/lib/rook/rook-ceph/rook-ceph.config

@dimm0
Copy link
Contributor

dimm0 commented Mar 10, 2019

Seeing this after updating to 0.9.3:

I0310 07:09:11.711998       8 controller.go:1072] scheduleOperation[delete-pvc-eac8ed62-3f35-11e8-b93b-0cc47a6be994[efada617-3f35-11e8-b93b-0cc47a6be994]]
E0310 07:09:11.712002       8 controller.go:1079] Error scheduling operation "delete-pvc-eac8ed62-3f35-11e8-b93b-0cc47a6be994[efada617-3f35-11e8-b93b-0cc47a6be994]": Failed to create operation with name "delete-pvc-eac8ed62-3f35-11e8-b93b-0cc47a6be994[efada617-3f35-11e8-b93b-0cc47a6be994]". An operation with that name failed at 2019-03-10 07:08:11.929819005 +0000 UTC m=+510.509053035. No retries permitted until 2019-03-10 07:10:13.929819005 +0000 UTC m=+632.509053035 (2m2s). Last error: "Failed to delete rook block image rbd/pvc-eac8ed62-3f35-11e8-b93b-0cc47a6be994: failed to delete image pvc-eac8ed62-3f35-11e8-b93b-0cc47a6be994 in pool rbd: Failed to complete '': exit status 1. global_init: unable to open config file from search list /var/lib/rook/.config\n. output: ".
[root@rook-ceph-operator-578b478587-lm6cr /]# cd /var/lib/rook
[root@rook-ceph-operator-578b478587-lm6cr rook]# ls
rook
[root@rook-ceph-operator-578b478587-lm6cr rook]# ls^C
[root@rook-ceph-operator-578b478587-lm6cr rook]# cd rook
[root@rook-ceph-operator-578b478587-lm6cr rook]# ls
client.admin.keyring  rook.config
[root@rook-ceph-operator-578b478587-lm6cr rook]# cd ..
[root@rook-ceph-operator-578b478587-lm6cr rook]# ls -lah
total 4.0K
drwxrwxrwx.  3 root root  3 Mar 10 06:59 .
drwxr-xr-x. 20 root root 20 Mar 10 06:59 ..
drwxr--r--.  2 root root  4 Mar 10 06:59 rook

@yanivroz
Copy link

I'm also having the same problem:

rook-ceph-operator logs:

2019-04-28 14:31:07.890947 I | op-provisioner: creating volume with configuration {blockPool:replicapool clusterNamespace:rook-ceph fstype:ext4 dataBlockPool:}
2019-04-28 14:31:07.890985 I | exec: Running command: rbd create replicapool/pvc-419986f5-69c2-11e9-8b2d-44a8424ac2cf --size 20480 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
E0428 14:31:08.946945       8 controller.go:800] Failed to provision volume for claim "tutorial/mysql-pv-claim" with StorageClass "rook-ceph-block": Failed to create rook block image replicapool/pvc-419986f5-69c2-11e9-8b2d-44a8424ac2cf: failed to create image pvc-419986f5-69c2-11e9-8b2d-44a8424ac2cf in pool replicapool of size 21474836480: Failed to complete '': exit status 2. rbd: error opening pool 'replicapool': (2) No such file or directory
. output:
E0428 14:31:08.947071       8 goroutinemap.go:150] Operation for "provision-tutorial/mysql-pv-claim[419986f5-69c2-11e9-8b2d-44a8424ac2cf]" failed. No retries permitted until 2019-04-28 14:31:09.947017874 +0000 UTC m=+4929.071069813 (durationBeforeRetry 1s). Error: "Failed to create rook block image replicapool/pvc-419986f5-69c2-11e9-8b2d-44a8424ac2cf: failed to create image pvc-419986f5-69c2-11e9-8b2d-44a8424ac2cf in pool replicapool of size 21474836480: Failed to complete '': exit status 2. rbd: error opening pool 'replicapool': (2) No such file or directory\n. output: "

PVC:

Name:          mysql-pv-claim
Namespace:     tutorial
StorageClass:  rook-ceph-block
Status:        Pending
Volume:
Labels:        app=wordpress
Annotations:   control-plane.alpha.kubernetes.io/leader:
                 {"holderIdentity":"cb6e03a8-69b6-11e9-baf1-7e8a2d614181","leaseDurationSeconds":15,"acquireTime":"2019-04-28T14:31:05Z","renewTime":"2019-...
               kubectl.kubernetes.io/last-applied-configuration:
                 {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"labels":{"app":"wordpress"},"name":"mysql-pv-claim","names...
               volume.beta.kubernetes.io/storage-provisioner: ceph.rook.io/block
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Events:
  Type     Reason              Age                  From                                                                                        Message
  ----     ------              ----                 ----                                                                                        -------
  Normal   Provisioning        12s (x8 over 2m26s)  ceph.rook.io/block rook-ceph-operator-cdc686667-cps7h cb6e03a8-69b6-11e9-baf1-7e8a2d614181  External provisioner is provisioning volume for claim "tutorial/mysql-pv-claim"
  Warning  ProvisioningFailed  11s (x8 over 2m25s)  ceph.rook.io/block rook-ceph-operator-cdc686667-cps7h cb6e03a8-69b6-11e9-baf1-7e8a2d614181  Failed to provision volume with StorageClass "rook-ceph-block": Failed to create rook block image replicapool/pvc-419986f5-69c2-11e9-8b2d-44a8424ac2cf: failed to create image pvc-419986f5-69c2-11e9-8b2d-44a8424ac2cf in pool replicapool of size 21474836480: Failed to complete '': exit status 2. rbd: error opening pool 'replicapool': (2) No such file or directory
. output:
  Normal     ExternalProvisioning  11s (x25 over 2m24s)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "ceph.rook.io/block" or manually created by system administrator
Mounted By:  wordpress-mysql-6887bf844f-lcrsf

rook: v0.9.3
kubernetes: v1.13.5

Help anyone?

@userguy
Copy link

userguy commented May 28, 2019

Having Same Issue

@kariae
Copy link

kariae commented May 30, 2019

same issue with v1.0.0

@hosseinsalahi
Copy link

same issue with v1.0.1

@travisn travisn added this to the 1.0 milestone May 31, 2019
@travisn travisn added this to To do: v1.0.x patch release in v1.0 May 31, 2019
@davidkarlsen
Copy link
Contributor

davidkarlsen commented Jun 17, 2019

And same with v1.0.2
edit: actually ok, #1228 is my problem

@travisn
Copy link
Member

travisn commented Jul 2, 2019

Assuming this is caused by failing to detect the ceph version, this would be fixed by #3257, which was included in the v1.0.2 release. If someone can repro this on v1.0.2, let's see if we can get more info on the repro, otherwise we can close it.

@travisn
Copy link
Member

travisn commented Jul 5, 2019

Ok, let's reopen if we can repro on v1.0.2 or newer!

@travisn travisn closed this as completed Jul 5, 2019
v1.0 automation moved this from To do to Done Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag operator
Projects
No open projects
v1.0
  
Done
Development

No branches or pull requests