New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disaster recovery guide for PVCs #6452
Conversation
## Adopting into a new Kubernetes cluster with PVCs | ||
|
||
It is possible to migrate/restore an rook/ceph cluster from an exising Kubernetes cluster to a new one without restoriting to SSH access or ceph tooling. This allows doing the migration using standard kubernetes resources only. This guide assumes you have a CephCluster that uses PVCs to persist mon and osd data. | ||
1. Stop rook in the cluster cluster by scaling the operator deployment `rook-ceph-operator` down to zero and deleting the other deployments: `rook-ceph-mgr-a`, `rook-ceph-mon-a`, ..., `rook-ceph-osd-0`, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Stop rook in the cluster cluster by scaling the operator deployment `rook-ceph-operator` down to zero and deleting the other deployments: `rook-ceph-mgr-a`, `rook-ceph-mon-a`, ..., `rook-ceph-osd-0`, ... | |
1. Stop rook in the old cluster by scaling the operator deployment `rook-ceph-operator` down to zero and deleting the other deployments: `rook-ceph-mgr-a`, `rook-ceph-mon-a`, ..., `rook-ceph-osd-0`, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant the new cluster. restructured the description to make it clearer.
Commitlint check is failing, please add proper commit title and message. See commit-structure |
1. Copy the endpoints configmap from the old cluster: `rook-ceph-mon-endpoints` | ||
1 Scale the operator back up and wait until the reconciliation is over. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about showing kubectl or other commands that will help users get this working? Right now the document is high level and may be ambiguous for how to backup and restore the resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point that the exact commands will depend on the tools. This is an advanced scenario, so it seems acceptable to leave the details to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I did add several kubectl commands as examples..
|
||
Do the following in the new cluster: | ||
1. Stop the rook operator by scaling the deployment `rook-ceph-operator` down to zero: `kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 0` | ||
and deleting the other deployments. An example command to do this is `k -n rook-ceph delete deployment -l operator!=rook` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not delete the operator deployment as well? Scaling it down to zero isn't much different than deleting the deployment. Or else why not scale down all the deployments to 0 instead of deleting them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User convenience. All the other deployments are created and reconciled by operator so once the operator is scaled up they will be restored automatically. The operator itself is installed as part of rook and deleting it will require re-installing rook-ceph (helm, argo, ...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, when installed with the helm chart you can't just remove the deployment and re-create it easily.
1. Copy the endpoints configmap from the old cluster: `rook-ceph-mon-endpoints` | ||
1 Scale the operator back up and wait until the reconciliation is over. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point that the exact commands will depend on the tools. This is an advanced scenario, so it seems acceptable to leave the details to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more small suggestions...
|
||
Do the following in the new cluster: | ||
1. Stop the rook operator by scaling the deployment `rook-ceph-operator` down to zero: `kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 0` | ||
and deleting the other deployments. An example command to do this is `k -n rook-ceph delete deployment -l operator!=rook` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, when installed with the helm chart you can't just remove the deployment and re-create it easily.
The content looks good, thanks. Now the CI checks are just complaining about a couple housekeeping items:
|
The complexity of the restore process was one of the big things holding me back from fully committing to Rook + Ceph so it's really great to see a much simpler way of handling disaster recovery! I am curious however about how the hostPath data of Rook should be handled (the data at /var/lib/rook) with this process? I can see three paths (after scaling down the operator deployment in the new cluster):
Judging by the guide not calling any attention to this problem, I'm guessing the answer is 3 and it's not something which needs to be worried about. However, IMHO it is worth calling this out explicitly somewhere in the guide. I can see this being the first point people think of because of how careful one needs to be about /var/lib/rook in general when setting up/tearing down clusters. Thanks for the great work here! Looking forward to trying this process out and start using Rook/Ceph for real! |
@LalitMaganti This simpler backup/restore guide is really only for clusters that are deployed on PVCs (for example with cluster-on-pvc.yaml). The dataDirHostPath does not need to be backed up in that case since it will only have crush dumps, logging, and other non-critical information. Would you mind opening a new issue to update the disaster recovery guide for the case where clusters are not on PVCs? The main difference will be to backup the mon directories in /var/lib/rook and the raw devices for OSDs. |
Oh cool! For the record, I plan on using using PVCs for storing all mon and OSD data so seems like the simple guide will work just fine for me as is. I always thought that even in the PVC case, important data was stored on the hostpath volume. It would be good if we could update the docs stating this fact as I don't think its clear from the current documentation. |
I tried out creating a PVC cluster today and I noticed that two files get created on each node in the
Just to confirm again: these don't need to be backed up right? If not, I'm curious as to which part of Rook (re)creates them? The operator? Thanks again for all the help and answering my questions! |
@LalitMaganti I noticed that such pieces of data are stored in the secrets as sources and the components (mon, osd, mgr) mount the secrets as files to access them. Seems like the operator takes care of creating these secets (and configmaps) - maybe @travisn can add more info here. Most of them are not reproducible automatically and need to be backed-up - keyrings, fsid, ceph user/pass, ... In my test i simply backed up all the rook secrets and restored them to the new cluster. I think I'll update the document for this. |
Signed-off-by: Tareq Sharafy <tareq.sha@gmail.com>
@LalitMaganti Correct rook stores the important metadata for the cluster in a few configmaps and secrets. When the cluster info is needed in the pods, it is generated from those CMs and secrets as the source of truth. |
disaster recovery guide for PVCs (bp #6452)
Description of your changes:
Adds guidelines for performing disaster recovery for rook-ceph in kubernetes clusters with PVCs. The existing guide contains many unnecessary steps for this case - it can be handled using only regular kubernetes resources.
Which issue is resolved by this Pull Request:
Resolves #6442
Checklist:
make codegen
) has been run to update object specifications, if necessary.