Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disaster recovery guide for PVCs #6452

Merged
merged 1 commit into from Oct 27, 2020
Merged

disaster recovery guide for PVCs #6452

merged 1 commit into from Oct 27, 2020

Conversation

tareksha
Copy link
Contributor

Description of your changes:

Adds guidelines for performing disaster recovery for rook-ceph in kubernetes clusters with PVCs. The existing guide contains many unnecessary steps for this case - it can be handled using only regular kubernetes resources.

Which issue is resolved by this Pull Request:
Resolves #6442

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

## Adopting into a new Kubernetes cluster with PVCs

It is possible to migrate/restore an rook/ceph cluster from an exising Kubernetes cluster to a new one without restoriting to SSH access or ceph tooling. This allows doing the migration using standard kubernetes resources only. This guide assumes you have a CephCluster that uses PVCs to persist mon and osd data.
1. Stop rook in the cluster cluster by scaling the operator deployment `rook-ceph-operator` down to zero and deleting the other deployments: `rook-ceph-mgr-a`, `rook-ceph-mon-a`, ..., `rook-ceph-osd-0`, ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Stop rook in the cluster cluster by scaling the operator deployment `rook-ceph-operator` down to zero and deleting the other deployments: `rook-ceph-mgr-a`, `rook-ceph-mon-a`, ..., `rook-ceph-osd-0`, ...
1. Stop rook in the old cluster by scaling the operator deployment `rook-ceph-operator` down to zero and deleting the other deployments: `rook-ceph-mgr-a`, `rook-ceph-mon-a`, ..., `rook-ceph-osd-0`, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant the new cluster. restructured the description to make it clearer.

Documentation/ceph-disaster-recovery.md Outdated Show resolved Hide resolved
Documentation/ceph-disaster-recovery.md Outdated Show resolved Hide resolved
@subhamkrai
Copy link
Contributor

Commitlint check is failing, please add proper commit title and message. See commit-structure

Documentation/ceph-disaster-recovery.md Show resolved Hide resolved
Documentation/ceph-disaster-recovery.md Outdated Show resolved Hide resolved
1. Copy the endpoints configmap from the old cluster: `rook-ceph-mon-endpoints`
1 Scale the operator back up and wait until the reconciliation is over.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about showing kubectl or other commands that will help users get this working? Right now the document is high level and may be ambiguous for how to backup and restore the resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that the exact commands will depend on the tools. This is an advanced scenario, so it seems acceptable to leave the details to the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I did add several kubectl commands as examples..

Documentation/ceph-disaster-recovery.md Show resolved Hide resolved
Documentation/ceph-disaster-recovery.md Outdated Show resolved Hide resolved

Do the following in the new cluster:
1. Stop the rook operator by scaling the deployment `rook-ceph-operator` down to zero: `kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 0`
and deleting the other deployments. An example command to do this is `k -n rook-ceph delete deployment -l operator!=rook`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not delete the operator deployment as well? Scaling it down to zero isn't much different than deleting the deployment. Or else why not scale down all the deployments to 0 instead of deleting them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User convenience. All the other deployments are created and reconciled by operator so once the operator is scaled up they will be restored automatically. The operator itself is installed as part of rook and deleting it will require re-installing rook-ceph (helm, argo, ...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, when installed with the helm chart you can't just remove the deployment and re-create it easily.

1. Copy the endpoints configmap from the old cluster: `rook-ceph-mon-endpoints`
1 Scale the operator back up and wait until the reconciliation is over.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that the exact commands will depend on the tools. This is an advanced scenario, so it seems acceptable to leave the details to the reader.

@tareksha tareksha requested a review from travisn October 20, 2020 19:42
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more small suggestions...


Do the following in the new cluster:
1. Stop the rook operator by scaling the deployment `rook-ceph-operator` down to zero: `kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas 0`
and deleting the other deployments. An example command to do this is `k -n rook-ceph delete deployment -l operator!=rook`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, when installed with the helm chart you can't just remove the deployment and re-create it easily.

Documentation/ceph-disaster-recovery.md Outdated Show resolved Hide resolved
@tareksha tareksha requested a review from travisn October 21, 2020 05:32
@travisn
Copy link
Member

travisn commented Oct 21, 2020

The content looks good, thanks. Now the CI checks are just complaining about a couple housekeeping items:

  • DCO (your signature)
  • Commitlint: Please squash to a single commit and add the "docs:" prefix in the commit as described here

@LalitMaganti
Copy link
Contributor

The complexity of the restore process was one of the big things holding me back from fully committing to Rook + Ceph so it's really great to see a much simpler way of handling disaster recovery!

I am curious however about how the hostPath data of Rook should be handled (the data at /var/lib/rook) with this process?

I can see three paths (after scaling down the operator deployment in the new cluster):

  1. Retore the data at this hostPath from the old cluster
  2. Completely purge all data in this path in the expectation that the operator will recreate the data.
  3. Keep whatever data was created by the operator before the scale down in the expectation that the operator will update the data here based on the restored CRDs.

Judging by the guide not calling any attention to this problem, I'm guessing the answer is 3 and it's not something which needs to be worried about.

However, IMHO it is worth calling this out explicitly somewhere in the guide. I can see this being the first point people think of because of how careful one needs to be about /var/lib/rook in general when setting up/tearing down clusters.

Thanks for the great work here! Looking forward to trying this process out and start using Rook/Ceph for real!

@travisn
Copy link
Member

travisn commented Oct 23, 2020

@LalitMaganti This simpler backup/restore guide is really only for clusters that are deployed on PVCs (for example with cluster-on-pvc.yaml). The dataDirHostPath does not need to be backed up in that case since it will only have crush dumps, logging, and other non-critical information.

Would you mind opening a new issue to update the disaster recovery guide for the case where clusters are not on PVCs? The main difference will be to backup the mon directories in /var/lib/rook and the raw devices for OSDs.

@LalitMaganti
Copy link
Contributor

Oh cool! For the record, I plan on using using PVCs for storing all mon and OSD data so seems like the simple guide will work just fine for me as is.

I always thought that even in the PVC case, important data was stored on the hostpath volume. It would be good if we could update the docs stating this fact as I don't think its clear from the current documentation.

@LalitMaganti
Copy link
Contributor

I tried out creating a PVC cluster today and I noticed that two files get created on each node in the dataDirHostPath folder:

  1. rook-ceph.config containing fsid, mon addresses etc.
  2. client.admin.keyring containing keyring info

Just to confirm again: these don't need to be backed up right? If not, I'm curious as to which part of Rook (re)creates them? The operator?

Thanks again for all the help and answering my questions!

@tareksha
Copy link
Contributor Author

@LalitMaganti I noticed that such pieces of data are stored in the secrets as sources and the components (mon, osd, mgr) mount the secrets as files to access them. Seems like the operator takes care of creating these secets (and configmaps) - maybe @travisn can add more info here. Most of them are not reproducible automatically and need to be backed-up - keyrings, fsid, ceph user/pass, ... In my test i simply backed up all the rook secrets and restored them to the new cluster. I think I'll update the document for this.

Signed-off-by: Tareq Sharafy <tareq.sha@gmail.com>
@travisn
Copy link
Member

travisn commented Oct 27, 2020

@LalitMaganti Correct rook stores the important metadata for the cluster in a few configmaps and secrets. When the cluster info is needed in the pods, it is generated from those CMs and secrets as the source of truth.

@travisn travisn merged commit 3dccee9 into rook:master Oct 27, 2020
mergify bot added a commit that referenced this pull request Oct 27, 2020
@tareksha tareksha deleted the patch-1 branch October 28, 2020 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disaster Recovery Without SSH to Nodes Not Working
6 participants