Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disaster recovery guide for PVCs #6452

Merged
merged 1 commit into from
Oct 27, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 18 additions & 0 deletions Documentation/ceph-disaster-recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,3 +336,21 @@ Assuming `dataHostPathData` is `/var/lib/rook`, and the `CephCluster` trying to
1. Bring the Rook Ceph operator back online by running `kubectl -n rook-ceph edit deploy/rook-ceph-operator` and set `replicas` to `1`.
1. Watch the operator logs with `kubectl -n rook-ceph logs -f rook-ceph-operator-xxxxxxx`, and wait until the orchestration has settled.
1. **STATE**: Now the new cluster should be up and running with authentication enabled. `ceph -s` output should not change much comparing to previous steps.

tareksha marked this conversation as resolved.
Show resolved Hide resolved
## Backing up and restoring a cluster based on PVCs into a new Kubernetes cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about showing kubectl or other commands that will help users get this working? Right now the document is high level and may be ambiguous for how to backup and restore the resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that the exact commands will depend on the tools. This is an advanced scenario, so it seems acceptable to leave the details to the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I did add several kubectl commands as examples..

It is possible to migrate/restore an rook/ceph cluster from an exising Kubernetes cluster to a new one without resorting to SSH access or ceph tooling. This allows doing the migration using standard kubernetes resources only. This guide assumes the following
1. You have a CephCluster that uses PVCs to persist mon and osd data. Let's call it the "old cluster"
1. You can restore the PVCs as-is in the new cluster. Usually this is done by taking regular snapshots of the PVC volumes and using a tool that can re-create PVCs from these snapshots in the underlying cloud provider. Velero is one such tool. (https://github.com/vmware-tanzu/velero)
1. You have regular backups of the secrets and configmaps in the rook-ceph namespace. Velero provides this functionality too.

Do the following in the new cluster:
1. Stop the rook operator by scaling the deployment `rook-ceph-operator` down to zero: `kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 0`
and deleting the other deployments. An example command to do this is `k -n rook-ceph delete deployment -l operator!=rook`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not delete the operator deployment as well? Scaling it down to zero isn't much different than deleting the deployment. Or else why not scale down all the deployments to 0 instead of deleting them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User convenience. All the other deployments are created and reconciled by operator so once the operator is scaled up they will be restored automatically. The operator itself is installed as part of rook and deleting it will require re-installing rook-ceph (helm, argo, ...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, when installed with the helm chart you can't just remove the deployment and re-create it easily.

1. Restore the rook PVCs to the new cluster.
1. Copy the keyring and fsid secrets from the old cluster: `rook-ceph-mgr-a-keyring`, `rook-ceph-mon`, `rook-ceph-mons-keyring`, `rook-ceph-osd-0-keyring`, ...
1. Delete mon services and copy them from the old cluster: `rook-ceph-mon-a`, `rook-ceph-mon-b`, ... Note that simply re-applying won't work because the goal here is to restore the `clusterIP` in each service and this field is immutable in `Service` resources.
1. Copy the endpoints configmap from the old cluster: `rook-ceph-mon-endpoints`
1. Scale the rook operator up again : `kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 1`
1. Wait until the reconciliation is over.