When OSD pods removed from hosts they are not able add them back to ceph cluster #4238

udayjalagam · 2019-10-31T22:34:08Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
When I change placement in cluster CRD by mistake which removed all OSDs from master nodes and when I add CRD back it try to schedule OSD pods back on those nodes but not able to do that since it deleted auth keys.

Expected behavior:
I would expect operator or prepare pod supposed to create those keys back and add OSDs back to cluster.

How to reproduce it (minimal and precise):

Create cluster following placement to include all nodes(mainly for OSDs).
Change it to something that will eliminate some of the hosts(mainly for OSDs).
2.1) wait enough time till it operator clean up deployments and other conifguration.
Change back CRD to include previously excluded nodes(mainly for OSDs).

Note : I have hostnetworking enabled.

you will see pods are crashing.

File(s) to submit

current Cluster CRD
cluster.txt
Crashing OSD pod(s) logs

2019-10-31 22:16:26.741547 I | rookcmd: starting Rook v1.1.2 with arguments '/rook/rook ceph osd start -- --foreground --id 3 --fsid 109f1936-2aa7-4b80-8705-99e1fdf4e089 --cluster ceph --setuser ceph --setgroup ceph --setuser-match-path /var/lib/rook/osd3 --default-log-to-file false --ms-learn-addr-from-peer=false'
2019-10-31 22:16:26.741687 I | rookcmd: flag values: --help=false, --log-flush-frequency=5s, --log-level=INFO, --lv-path=, --operator-image=, --osd-id=3, --osd-store-type=bluestore, --osd-uuid=f50d171d-4a21-4727-b8c9-6cff21423db2, --pvc-backed-osd=false, --service-account=
2019-10-31 22:16:26.741708 I | op-mon: parsing mon endpoints:
2019-10-31 22:16:26.741718 W | op-mon: ignoring invalid monitor
2019-10-31 22:16:26.744969 I | cephosd: Successfully updated lvm config file
2019-10-31 22:16:26.744990 I | exec: Running command: stdbuf -oL ceph-volume lvm activate --no-systemd --bluestore 3 f50d171d-4a21-4727-b8c9-6cff21423db2
2019-10-31 22:16:27.592964 I | Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-3
2019-10-31 22:16:28.258416 I | Running command: /usr/sbin/restorecon /var/lib/ceph/osd/ceph-3
2019-10-31 22:16:28.923490 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-3
2019-10-31 22:16:29.587911 I | Running command: /bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-4d7dc0b0-ea32-499a-b1fc-3fde26cadd3e/osd-data-bfd37ef4-dd90-403e-a38d-cc65752e503d --path /var/lib/ceph/osd/ceph-3 --no-mon-config
2019-10-31 22:16:30.304560 I | Running command: /bin/ln -snf /dev/ceph-4d7dc0b0-ea32-499a-b1fc-3fde26cadd3e/osd-data-bfd37ef4-dd90-403e-a38d-cc65752e503d /var/lib/ceph/osd/ceph-3/block
2019-10-31 22:16:30.971199 I | Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-3/block
2019-10-31 22:16:31.638778 I | Running command: /bin/chown -R ceph:ceph /dev/mapper/ceph--4d7dc0b0--ea32--499a--b1fc--3fde26cadd3e-osd--data--bfd37ef4--dd90--403e--a38d--cc65752e503d
2019-10-31 22:16:32.302874 I | Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-3
2019-10-31 22:16:32.963243 I | --> ceph-volume lvm activate successful for osd ID: 3
2019-10-31 22:16:32.972318 I | exec: Running command: ceph-osd --foreground --id 3 --fsid 109f1936-2aa7-4b80-8705-99e1fdf4e089 --cluster ceph --setuser ceph --setgroup ceph --setuser-match-path /var/lib/rook/osd3 --default-log-to-file false --ms-learn-addr-from-peer=false --crush-location=root=default host=kmaster1-c1-am2-nskope-net
2019-10-31 22:16:33.030460 I | 2019-10-31 22:16:33.026 7f511d9d2700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2019-10-31 22:16:33.030484 I | 2019-10-31 22:16:33.026 7f511e1d3700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2019-10-31 22:16:33.030489 I | 2019-10-31 22:16:33.026 7f511d1d1700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2019-10-31 22:16:33.030493 I | failed to fetch mon config (--no-mon-config to skip)
failed to start osd. Failed to complete '': exit status 1.

Prepare pod logs
osd-prepare-pod.txt

Environment:

OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS
Kernel (e.g. uname -a): 4.15.0-64-generic
Rook version (use rook version inside of a Rook Pod): v1.1.2
Storage backend version (e.g. for ceph do ceph -v): v14.2.3
Kubernetes version (use kubectl version): v1.15.3
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): GKE
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
`[root@knode5 /]# ceph -s
cluster:
id: 109f1936-2aa7-4b80-8705-99e1fdf4e089
health: HEALTH_OK

services:
mon: 3 daemons, quorum a,b,c (age 2d)
mgr: a(active, since 46h)
osd: 25 osds: 16 up (since 46h), 16 in (since 2d)

data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: pgs:
ID CLASS WEIGHT TYPE NAME -1 27.93750 root default -13 -7 -9 -3 6.98438 1 ssd 1.74609 7 ssd 1.74609 14 ssd 1.74609 21 ssd 1.74609 -5 6.98438 2 ssd 1.74609 8 ssd 1.74609 15 ssd 1.74609 22 ssd 1.74609 -15 6.98438 6 ssd 1.74609 13 ssd 1.74609 19 ssd 1.74609 25 ssd 1.74609 -11 6.98438 5 ssd 1.74609 12 ssd 1.74609 20 ssd 1.74609 26 ssd 1.74609 0 0 osd.0 4 0 osd.4 9 0 osd.9 10 11 16 17 23 24 16 GiB used, 28 TiB / 28 TiB avail[root@knode5 /]# ceph osd tree
STATUS REWEIGHT PRI-AFF
0 host kmaster1-c1-am2-nskope-net
0 host kmaster2-c1-am2-nskope-net
0 host kmaster3-c1-am2-nskope-net
host knode1-c1-am2-nskope-net
osd.1 up 1.00000 1.00000
osd.7 up 1.00000 1.00000
osd.14 up 1.00000 1.00000
osd.21 up 1.00000 1.00000
host knode2-c1-am2-nskope-net
osd.2 up 1.00000 1.00000
osd.8 up 1.00000 1.00000
osd.15 up 1.00000 1.00000
osd.22 up 1.00000 1.00000
host knode3-c1-am2-nskope-net
osd.6 up 1.00000 1.00000
osd.13 up 1.00000 1.00000
osd.19 up 1.00000 1.00000
osd.25 up 1.00000 1.00000
host knode5-c1-am2-nskope-net
osd.5 up 1.00000 1.00000
osd.12 up 1.00000 1.00000
osd.20 up 1.00000 1.00000
osd.26 up 1.00000 1.00000
down 0 1.00000
down 0 1.00000
down 0 1.00000
0 osd.10 down 0 1.00000
0 osd.11 down 0 1.00000
0 osd.16 down 0 1.00000
0 osd.17 down 0 1.00000
0 osd.23 down 0 1.00000
0 osd.24 down 0 1.00000`

The text was updated successfully, but these errors were encountered:

travisn · 2019-11-01T02:44:49Z

@leseb If the osd keyrings are removed, how can they be re-created to allow the OSDs to start again? In this case c-v activate succeeds, but then the osd can't connect because the auth key is gone.

This is another instance of why osds should never automatically be removed... Unintentional changes to desired state in the CR with destructive side affects should be avoided.

leseb · 2019-11-04T14:45:24Z

To generate a new key you can run: ceph auth get-or-create osd.0 mon 'profile osd' mgr 'profile osd' osd 'allow *'.

udayjalagam · 2019-11-05T22:32:06Z

Thanks @leseb for your reply. How long rook will wait before marking OSDs are out of cluster ? yesterday I when node was down for some period of time rook marked osds as out of cluster marked them for destroy.
Can I August the timings ? is the below setting in cluster CRD that controlling it ?

disruptionManagement:
managePodBudgets: false
osdMaintenanceTimeout: 30

?

travisn · 2019-11-05T22:34:40Z

@udayjalagam This can only be controlled with the removeOSDsIfOutAndSafeToRemove setting, either enabling or disabling it. We just recently disabled this functionality by default.

udayjalagam · 2019-11-06T00:38:19Z

Thanks @travisn , is if we set removeOSDsIfOutAndSafeToRemove: false then it will not out OSD out also right ?

Thanks,
Uday

stale · 2020-02-04T01:37:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

stale · 2020-02-11T02:53:58Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

MikaelSmith · 2020-05-13T18:27:28Z

I think I ran into something similar to this. Specifically a master was removed and the osd purged from ceph, then re-added to my Kubernetes cluster so the rook-ceph operator tried to recreate it (and re-use existing data). That led to the recreated osd pod failing with

failed to fetch mon config (--no-mon-config to skip)

https://access.redhat.com/solutions/3524771 ended up being useful. I fixed it by re-adding the keyring to the operator config. Seems like something that might be rolled into common issues docs, as it seems like it's come up a few times (https://github.com/rook/rook/issues?q=is%3Aissue+is%3Aclosed+%22failed+to+fetch+mon+config+%28--no-mon-config+to+skip%29%22).

travisn · 2020-07-27T20:41:34Z

Reopening since this was hit again in the wild...

Here are the steps that helped recover an OSD, following the RH solution previously linked:

kubectl -n rook-ceph edit deploy rook-ceph-osd-<id>
insert the following command into the init container where there is a script that initializes ceph-volume: cat $OSD_DATA_DIR/keyring
- There may be a simpler way to get the key, but since the pod immediately crashes, you can't connect to the pod. Another possibility is to add a sleep to the script that allows a brief connection to the pod.
Save the deployment
Look at the logs from the init container after the osd tries to start: kubectl -n rook-ceph logs -c activate-osd
Get the key from the end of the log and copy it into a temporary file with the caps (in their example it is “osd.9.export”
ceph auth import -i <file>
restart the OSD pod
Repeat for each OSD that is missing auth

After done with the OSDs, restarting the operator will cause the pod specs to reset to the expected script and the key will no longer be written to the log.

@leseb What if the init container were to always ensure the osd auth is correct when it starts? Any reason we shouldn't do that?

yanchicago · 2020-08-28T15:52:02Z

@travisn Thanks for referring to it. But I'm at a step that the auth has been disabled. So there's something else missing here.

Nevertheless, I did recover the auth for one OSD ( osd.6) and it still wasn't recognized by the cephcluster. In my case, all OSD pods are up and running so the auth key is easy to be retrieved.

Based on the recovery procudure, once the cluster "fsid" was updated, the OSD should join the cluster. But it seems not the case in my testing.

travisn · 2020-08-28T15:56:08Z

#5914 sounds very similar to your issue as well

CJCShadowsan · 2020-10-01T13:23:49Z

Reopening since this was hit again in the wild...

Here are the steps that helped recover an OSD, following the RH solution previously linked:

kubectl -n rook-ceph edit deploy rook-ceph-osd-<id>

insert the following command into the init container where there is a script that initializes ceph-volume: cat $OSD_DATA_DIR/keyring

There may be a simpler way to get the key, but since the pod immediately crashes, you can't connect to the pod. Another possibility is to add a sleep to the script that allows a brief connection to the pod.

Save the deployment

Look at the logs from the init container after the osd tries to start: kubectl -n rook-ceph logs -c activate

Get the key from the end of the log and copy it into a temporary file with the caps (in their example it is “osd.9.export”

ceph auth import -i <file>

restart the OSD pod

Repeat for each OSD that is missing auth

After done with the OSDs, restarting the operator will cause the pod specs to reset to the expected script and the key will no longer be written to the log.

@leseb What if the init container were to always ensure the osd auth is correct when it starts? Any reason we shouldn't do that?

Just a quick query on this as this is a "me too" moment...

Where inside of step 2 am I adding this?

I can see the initContainers definition:

      initContainers:
      - args:
        - ceph
        - osd
        - init
        env:
        - name: ROOK_NODE_NAME
          value: node02
        - name: ROOK_CLUSTER_ID
          value: 4926d57b-0da6-40bc-95e8-a540c8ecaaae
        - name: ROOK_PRIVATE_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: ROOK_PUBLIC_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: ROOK_CLUSTER_NAME
          value: rook-ceph
        - name: ROOK_MON_ENDPOINTS
          valueFrom:
            configMapKeyRef:
              key: data
              name: rook-ceph-mon-endpoints
        - name: ROOK_MON_SECRET
          valueFrom:
            secretKeyRef:
              key: mon-secret
              name: rook-ceph-mon
        - name: ROOK_ADMIN_SECRET
          valueFrom:
            secretKeyRef:
              key: admin-secret
              name: rook-ceph-mon

A quick pointer of how it should look would be useful here!

travisn · 2020-10-01T17:27:55Z

@CJCShadowsan In the "activate" init container, there is a script that initializes the OSD. At the end of it you could append the cat command. For example:

      ...
      else
        ARGS=(--device ${DEVICE} --no-systemd --no-tmpfs)
        if [ -n "$METADATA_DEVICE" ]; then
          ARGS+=(--block.db ${METADATA_DEVICE})
        fi
        # ceph-volume raw mode only supports bluestore so we don't need to pass a store flag
        ceph-volume "$CV_MODE" activate "${ARGS[@]}"
      fi
      # INSERT HERE
      cat $OSD_DATA_DIR/keyring

TomFletcher0 · 2020-10-12T15:23:50Z

I successfully followed @travisn's workaround above after removing an OSD and then trying to re-add it. A few points that confused me in the instructions above:

In step 2, I added cat $OSD_DATA_DIR/keyring before the umount "$OSD_DATA_DIR" line of the init script (line 226 of kubectl -n rook-ceph edit deploy rook-ceph-osd-<id> for me).
In step 4, the command to show the init containers logs was kubectl -n rook-ceph logs <osd-pod-name> -c activate-osd (not -c activate). I found the actual init containers name in the "initContainers" section of kubectl describe <osd-podname>.
Step 6 happened automatically - no manual action needed (it was in a crashloop, so I guess it restarted on its own).

travisn · 2020-10-12T18:04:36Z

@TomFletcher0 Thanks for the feedback.

prazumovsky · 2020-12-15T16:38:25Z

Any updates/fixes are planned for this?

travisn · 2020-12-15T17:33:18Z

To generate a new key you can run: ceph auth get-or-create osd.0 mon 'profile osd' mgr 'profile osd' osd 'allow *'.

@leseb Any reason not to run this in an init container for the osd daemon?

leseb · 2021-01-04T13:53:14Z

To generate a new key you can run: ceph auth get-or-create osd.0 mon 'profile osd' mgr 'profile osd' osd 'allow *'.

@leseb Any reason not to run this in an init container for the osd daemon?

Not that I can think of. I can look into this.

timhughes · 2021-01-22T15:45:09Z

I can repeat this every time using minikube.

Minikube

Install minikube:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
install minikube-linux-amd64 ~/.local/bin/minikube
rm minikube-linux-amd64

Clean up any old install:

minikube delete

Start a new minikube:

minikube start --cpus=8 --memory=8G --cni=calico --addons=metallb --addons=istio-provisioner --addons=istio

Check that it is working:

$ minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured

Rook Ceph

Add a extra disk for rook-ceph

sudo virsh vol-create-as --pool default --name minikube-rook.raw --format raw --capacity 40G --allocation 10M
sudo virsh attach-disk --domain minikube --source $(virsh vol-list --pool default | grep minikube-rook.raw|awk '{print $2}') --target vdb --cache none --persistent

Start and stop the vm so that it sees the disk

minikube stop
minikube start

Clone the rook repoa dn checkout the latest release

git clone https://github.com/rook/rook
cd rook
git checkout release-1.5

Apply the manifests

kubectl apply -f cluster/examples/kubernetes/ceph/crds.yaml
kubectl apply -f cluster/examples/kubernetes/ceph/common.yaml
kubectl apply -f cluster/examples/kubernetes/ceph/operator.yaml
kubectl apply -f cluster/examples/kubernetes/ceph/cluster-test.yaml

Watch the logs

kubectl logs -n rook-ceph -l app=rook-ceph-operator  -f

Break it

Restarting minikube breaks the osd auth.

minikube stop
minikube start

Fixing

Work out which OSD it is, in this example it is OSD-0, and make sure you change all the following commands to match your OSD number.

Append cat $OSD_DATA_DIR/keyring to the end of the command for the initContainer activate

$ kubectl -n rook-ceph edit deploy rook-ceph-osd-0

It should be similar to this (--- snip --- are lines removed for display)

--- snip ---
      initContainers:
      - command:
        - /bin/bash
        - -c
        - "\nset -ex\n\nOSD_ID=0
          --- snip ---
          \"${ARGS[@]}\"\nfi\ncat $OSD_DATA_DIR/keyring\n\n"
        --- snip ---
        name: activate

When you save the above change it should restart the osd pop but if not then restart the pod for the osd manually

$ kubectl -n rook-ceph delete pod -l rook-ceph-osd -l ceph-osd-id=0

Get the key from the logs after the activate container exits

$ kubectl -n rook-ceph logs deploy/rook-ceph-osd-0 -c activate|tail -n 3
+ cat /var/lib/ceph/osd/ceph-0/keyring
[osd.0]
key = AQA/2gpgZx96JBAAgvYBeky3A3VuleCFm+Dxrg==

Create an instance of rook-ceph-toolbox and get a bash shell in it

$ kubectl create -f cluster/examples/kubernetes/ceph/toolbox.yaml
$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash

Export the current auth config for the osd and change the key to the one from above then import it and check it ha worked.

[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]# ceph auth export osd.0 > auth-osd.0.export
[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]# vi auth-osd.0.export
[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]# cat auth-osd.0.export
[osd.0]
key = AQA/2gpgZx96JBAAgvYBeky3A3VuleCFm+Dxrg==
caps mon = "allow profile osd"
caps osd = "allow *"

[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]# ceph auth import -i auth-osd.0.export
imported keyring
[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]# ceph auth ls
installed auth entries:

osd.0
	key: AQA/2gpgZx96JBAAgvYBeky3A3VuleCFm+Dxrg==
	caps: [mon] allow profile osd
	caps: [osd] allow *
client.admin

Exit the toolbox and restart the pod for the osd.

$ kubectl -n rook-ceph delete pod -l rook-ceph-osd -l ceph-osd-id=0

The OSD should now start correctly.

Break again

Just repeat the restart of minikube and run through the process again.

timhughes · 2021-01-22T15:53:09Z

I also get a similar issue with a rgw for an object store. Not sure if this is the same issue or needs a new issue raised.

Do all the same as above but at the end of setting everything up, create an objectstore, storageclass and claim

kubectl apply -f cluster/examples/kubernetes/ceph/object-test.yaml
kubectl apply -f cluster/examples/kubernetes/ceph/storageclass-bucket-delete.yaml
kubectl apply -f cluster/examples/kubernetes/ceph/object-bucket-claim-delete.yaml

Restart minikube

The chown-container-data-dir container has the logs:

ownership of '/var/log/ceph' retained as ceph:ceph                                                                                                                                                                                      
ownership of '/var/lib/ceph/crash' retained as ceph:ceph                                                                                                                                                                                   
ownership of '/var/lib/ceph/rgw/ceph-my-store' retained as ceph:ceph                                                                                                                                                                       
stream closed

and the rgw container

debug 2021-01-22T15:51:07.244+0000 7fdb87a64700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
failed to fetch mon config (--no-mon-config to skip)                                                                                                                                                                                       
stream closed

Update: Fix

Find the secret name for the rgw

$ kubectl --namespace rook-ceph get deploy rook-ceph-rgw-my-store-a -o jsonpath='{..secretName}'
rook-ceph-rgw-my-store-a-keyring

Decode the secret

$ kubectl --namespace rook-ceph get secret/rook-ceph-rgw-my-store-a-keyring -o jsonpath='{.data.keyring}'|base64 -d

[client.rgw.my.store.a]
key = AQAW3gpg/HzQFxAASrm2IHnRbFk8BK/XP8nQ9w==
caps mon = "allow rw"
caps osd = "allow rwx"

Enter the toolbox container, create a file with the contents of the keyring and import it

[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]# vi client.rgw.my.store.a.keyring
[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]# cat client.rgw.my.store.a.keyring
[client.rgw.my.store.a]
key = AQAW3gpg/HzQFxAASrm2IHnRbFk8BK/XP8nQ9w==
caps mon = "allow rw"
caps osd = "allow rwx"
[root@rook-ceph-tools-77bf5b9b7d-h26v4 /]#  ceph auth import -i client.rgw.my.store.a.keyring

Restart the rgw pod

BlaineEXE · 2021-01-22T17:08:05Z

To generate a new key you can run: ceph auth get-or-create osd.0 mon 'profile osd' mgr 'profile osd' osd 'allow *'.

@leseb Any reason not to run this in an init container for the osd daemon?

Not that I can think of. I can look into this.

I thought the code that creates/updates OSD deployments also created the keyring secret for the OSD after doing ceph auth get-or-create for the OSD ID. Would we be able to fix this step to retrieve preexisting keys, or does this need info from the OSD's on-disk data?

ShyamsundarR · 2021-02-05T18:22:26Z

I also get a similar issue with a rgw for an object store. Not sure if this is the same issue or needs a new issue raised.

Adding in to report that in a similar minikube setup, on restart found the issue with rbd-mirror pods as well. This was with rook 1.5.

Followed the same steps as rgw case and imported the auth for client.rbd-mirror.a to get the rbd mirror pod functioning again.

leseb · 2021-03-22T09:58:38Z

@rohan47 please investigate :)

leseb · 2021-04-12T10:10:05Z

@timhughes @ShyamsundarR for the Minikube case I believe this is expected. My Minikube mounts with:

# df -h|grep -v "/var/lib"
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            22G  711M   21G   4% /
devtmpfs         12G     0   12G   0% /dev
tmpfs            12G     0   12G   0% /dev/shm
tmpfs            12G   20M   12G   1% /run
tmpfs            12G     0   12G   0% /sys/fs/cgroup
tmpfs            12G  8.0K   12G   1% /tmp
/dev/vda1        35G  6.5G   27G  20% /mnt/vda1

Because of tmpfs restarting the node does not persist any Ceph data. At the restart, the Mons will initialize like a new cluster.

ShyamsundarR · 2021-04-12T11:05:58Z

@leseb I do preserve /var/lib/rook in all my setups like so,

minikube ssh "sudo mkdir -p /mnt/vda1/rook/ && sudo ln -sf /mnt/vda1/rook/ /var/lib/" --profile=rook1

$ minikube ssh --profile=rook1 'ls -l /var/lib/'
...
lrwxrwxrwx  1 root root   15 Apr 12 10:59 rook -> /mnt/vda1/rook/
...

and still see the problem.

Also, I would expect the OSD init container scripts to copy the new credentials over to the OSD keyring, but I see that it copies the older content over always (in the minikube case). Getting some more details on this in a short bit.

leseb · 2021-04-12T11:09:23Z

@leseb I do preserve /var/lib/rook in all my setups like so,

minikube ssh "sudo mkdir -p /mnt/vda1/rook/ && sudo ln -sf /mnt/vda1/rook/ /var/lib/" --profile=rook1
$ minikube ssh --profile=rook1 'ls -l /var/lib/'
...
lrwxrwxrwx  1 root root   15 Apr 12 10:59 rook -> /mnt/vda1/rook/
...
and still see the problem.

Also, I would expect the OSD init container scripts to copy the new credentials over to the OSD keyring, but I see that it copies the older content over always (in the minikube case). Getting some more details on this in a short bit.

I used to do that too but I just realized that on reboot the symlink goes away. Can you verify?

ShyamsundarR · 2021-04-12T11:22:07Z

@leseb I do preserve /var/lib/rook in all my setups like so,
minikube ssh "sudo mkdir -p /mnt/vda1/rook/ && sudo ln -sf /mnt/vda1/rook/ /var/lib/" --profile=rook1
$ minikube ssh --profile=rook1 'ls -l /var/lib/'
...
lrwxrwxrwx  1 root root   15 Apr 12 10:59 rook -> /mnt/vda1/rook/
...
and still see the problem.
Also, I would expect the OSD init container scripts to copy the new credentials over to the OSD keyring, but I see that it copies the older content over always (in the minikube case). Getting some more details on this in a short bit.
I used to do that too but I just realized that on reboot the symlink goes away. Can you verify?

Duh! 🤦🏾 yes the symlink is on / which is tmpfs, and so the minikube reboot takes it away...

Did you find a better way to preserve /var/lib/rook on minikube setups?

leseb · 2021-04-12T11:42:57Z

@leseb I do preserve /var/lib/rook in all my setups like so,
minikube ssh "sudo mkdir -p /mnt/vda1/rook/ && sudo ln -sf /mnt/vda1/rook/ /var/lib/" --profile=rook1
$ minikube ssh --profile=rook1 'ls -l /var/lib/'
...
lrwxrwxrwx  1 root root   15 Apr 12 10:59 rook -> /mnt/vda1/rook/
...
and still see the problem.
Also, I would expect the OSD init container scripts to copy the new credentials over to the OSD keyring, but I see that it copies the older content over always (in the minikube case). Getting some more details on this in a short bit.
I used to do that too but I just realized that on reboot the symlink goes away. Can you verify?
Duh! 🤦🏾 yes the symlink is on / which is tmpfs, and so the minikube reboot takes it away...

Did you find a better way to preserve /var/lib/rook on minikube setups?

So far no, any changes in /etc will never persist, editing the initramfs could be an option but that's not really suitable.

ShyamsundarR · 2021-04-12T12:01:05Z

@leseb I do preserve /var/lib/rook in all my setups like so,
minikube ssh "sudo mkdir -p /mnt/vda1/rook/ && sudo ln -sf /mnt/vda1/rook/ /var/lib/" --profile=rook1
$ minikube ssh --profile=rook1 'ls -l /var/lib/'
...
lrwxrwxrwx  1 root root   15 Apr 12 10:59 rook -> /mnt/vda1/rook/
...
and still see the problem.
Also, I would expect the OSD init container scripts to copy the new credentials over to the OSD keyring, but I see that it copies the older content over always (in the minikube case). Getting some more details on this in a short bit.
I used to do that too but I just realized that on reboot the symlink goes away. Can you verify?
Duh! 🤦🏾 yes the symlink is on / which is tmpfs, and so the minikube reboot takes it away...
Did you find a better way to preserve /var/lib/rook on minikube setups?
So far no, any changes in /etc will never persist, editing the initramfs could be an option but that's not really suitable.

My current workaround to preserve /var/lib/rook is as follows, (using the kvm2 driver and hence virsh and friends where appropriate), leaving it here if it helps others:

// Initial setup
minikube start
minikube ssh "sudo mkdir -p /mnt/vda1/rook/ && sudo ln -sf /mnt/vda1/rook/ /var/lib/"
<deploy rook and create a cluster>
minikube stop

// --- RESTART ---
// Start the VM first
sudo virsh start minikube

// Relink the /var/lib/rook directory
minikube ssh "sudo mkdir -p /mnt/vda1/rook/ && sudo ln -sf /mnt/vda1/rook/ /var/lib/"

// Start the k8s control plane last
minikube start

voarsh2 · 2021-07-12T17:50:27Z

I still get this issue and haven't been able to get your workarounds working.

jelmer · 2024-02-09T14:56:38Z

Another (simpler?) workaround is to just set the dataDirHostPath in the ceph cluster to a path under /data (which is persistent between runs by minikube).

E.g.:

dataDirHostPath: /data/rook

(see https://rook.io/docs/rook/v1.12/CRDs/Cluster/host-cluster/)

udayjalagam added the bug label Oct 31, 2019

stale bot added the wontfix label Feb 4, 2020

stale bot closed this as completed Feb 11, 2020

travisn reopened this Jul 27, 2020

stale bot removed the wontfix label Jul 27, 2020

travisn added ceph main ceph tag keepalive labels Jul 27, 2020

georgr mentioned this issue Aug 19, 2020

OSD is marked as down, but OSD daemon is running #6132

Closed

travisn mentioned this issue Aug 28, 2020

The OSDs can't join the cluster based on the recovery procedure #6182

Open

travisn added the ceph-osd label Sep 18, 2020

travisn added this to Blocking Release in v1.5 via automation Dec 15, 2020

travisn moved this from Blocking Release to To do in v1.5 Dec 15, 2020

leseb self-assigned this Jan 4, 2021

leseb assigned rohan47 and unassigned leseb Mar 22, 2021

leseb unassigned rohan47 Apr 12, 2021

leseb added this to To do in v1.6 via automation Apr 12, 2021

travisn removed this from To do in v1.5 Apr 16, 2021

subhamkrai self-assigned this May 13, 2021

subhamkrai mentioned this issue Jun 29, 2021

ceph: create new keyring for osd #8155

Merged

10 tasks

travisn mentioned this issue Jul 16, 2021

Ceph cluster lost and any recovery solution not worked. All OSDs are in but all PGs are unknown. #8329

Closed

leseb closed this as completed in #8155 Jul 19, 2021

v1.6 automation moved this from To do to Done Jul 19, 2021

travisn mentioned this issue Aug 17, 2022

Crash loopback off error in OSD pods #9769

Closed

Tyler-Cash mentioned this issue Feb 15, 2023

K8S-Rook-ceph-operator error #11475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When OSD pods removed from hosts they are not able add them back to ceph cluster #4238

When OSD pods removed from hosts they are not able add them back to ceph cluster #4238

udayjalagam commented Oct 31, 2019 •

edited by travisn

Loading

travisn commented Nov 1, 2019

leseb commented Nov 4, 2019

udayjalagam commented Nov 5, 2019

travisn commented Nov 5, 2019

udayjalagam commented Nov 6, 2019

stale bot commented Feb 4, 2020

stale bot commented Feb 11, 2020

MikaelSmith commented May 13, 2020

travisn commented Jul 27, 2020 •

edited

Loading

yanchicago commented Aug 28, 2020 •

edited

Loading

travisn commented Aug 28, 2020

CJCShadowsan commented Oct 1, 2020 •

edited

Loading

travisn commented Oct 1, 2020 •

edited

Loading

TomFletcher0 commented Oct 12, 2020

travisn commented Oct 12, 2020

prazumovsky commented Dec 15, 2020

travisn commented Dec 15, 2020

leseb commented Jan 4, 2021

timhughes commented Jan 22, 2021

timhughes commented Jan 22, 2021 •

edited

Loading

BlaineEXE commented Jan 22, 2021

ShyamsundarR commented Feb 5, 2021

leseb commented Mar 22, 2021

leseb commented Apr 12, 2021

ShyamsundarR commented Apr 12, 2021

leseb commented Apr 12, 2021

ShyamsundarR commented Apr 12, 2021

leseb commented Apr 12, 2021

ShyamsundarR commented Apr 12, 2021

voarsh2 commented Jul 12, 2021

jelmer commented Feb 9, 2024

When OSD pods removed from hosts they are not able add them back to ceph cluster #4238

When OSD pods removed from hosts they are not able add them back to ceph cluster #4238

Comments

udayjalagam commented Oct 31, 2019 • edited by travisn Loading

travisn commented Nov 1, 2019

leseb commented Nov 4, 2019

udayjalagam commented Nov 5, 2019

travisn commented Nov 5, 2019

udayjalagam commented Nov 6, 2019

stale bot commented Feb 4, 2020

stale bot commented Feb 11, 2020

MikaelSmith commented May 13, 2020

travisn commented Jul 27, 2020 • edited Loading

yanchicago commented Aug 28, 2020 • edited Loading

travisn commented Aug 28, 2020

CJCShadowsan commented Oct 1, 2020 • edited Loading

travisn commented Oct 1, 2020 • edited Loading

TomFletcher0 commented Oct 12, 2020

travisn commented Oct 12, 2020

prazumovsky commented Dec 15, 2020

travisn commented Dec 15, 2020

leseb commented Jan 4, 2021

timhughes commented Jan 22, 2021

Minikube

Rook Ceph

Break it

Fixing

Break again

timhughes commented Jan 22, 2021 • edited Loading

Update: Fix

BlaineEXE commented Jan 22, 2021

ShyamsundarR commented Feb 5, 2021

leseb commented Mar 22, 2021

leseb commented Apr 12, 2021

ShyamsundarR commented Apr 12, 2021

leseb commented Apr 12, 2021

ShyamsundarR commented Apr 12, 2021

leseb commented Apr 12, 2021

ShyamsundarR commented Apr 12, 2021

voarsh2 commented Jul 12, 2021

jelmer commented Feb 9, 2024

udayjalagam commented Oct 31, 2019 •

edited by travisn

Loading

travisn commented Jul 27, 2020 •

edited

Loading

yanchicago commented Aug 28, 2020 •

edited

Loading

CJCShadowsan commented Oct 1, 2020 •

edited

Loading

travisn commented Oct 1, 2020 •

edited

Loading

timhughes commented Jan 22, 2021 •

edited

Loading