New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

WIP: Document for cluster recovery when running on PVCs #4838

Closed

travisn wants to merge 3 commits into rook:master from travisn:doc-cluster-recovery-pvc

Member

travisn commented Feb 7, 2020

Description of your changes:
When catastrophe strikes and an entire kubernetes cluster is destroyed, it is still possible to restore Rook in a new Kubernetes cluster as long as the PVs underneath the MONs and OSDs are still available and some critical metadata was backed up before the loss. This guide walks through the restoration of such a cluster.

My testing has not yet included loss of an entire cluster. Thus far it has only been tested on a cluster where the cluster CR was removed and the PVCs and PVs remained intact.

There are items marked TODO where we would need to fill in details for full k8s cluster loss.
The doc assumes critical metadata is backed up from a cluster before it is lost. I'd still like to explore recovery even if that metadata is not backed up.

Checklist:

Reviewed the developer guide on Submitting a Pull Request
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.
Pending release notes updated with breaking and/or notable changes, if necessary.
Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
Code generation (make codegen) has been run to update object specifications, if necessary.
Comments have been added or updated based on the standards set in CONTRIBUTING.md
Add the flag for skipping the CI if this PR does not require a build. See here for more details.

[test ceph]

travisn added 3 commits

February 7, 2020 16:44


          ceph: add label to PVCs for OSDs

55e21ef

The PVCs behind OSDs are not easily identifiable. This adds a label
to the OSDs in order to query on app=rook-ceph-osd.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>


          ceph: remove owner reference from mon PVC

bcac490

When the cluster CR is deleted, all the resources are also deleted.
To prevent accidental removal of critical data, we don't want to remove
PVCs behind MONs or OSDs automatically. The PVCs behind OSDs already do not
have owner references for this reason. Now this change removes the owner
reference from the MONs as well.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>


          ceph: guide for restoring a cluster on pvcs

f9b92ae

When catastrophe strikes and an entire kubernetes cluster is destroyed,
it is still possible to restore Rook in a new Kubernetes cluster as long
as the PVs underneath the MONs and OSDs are still available. This guide
walks through the restoration of a cluster.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>

leseb requested changes

View reviewed changes

Documentation/ceph-disaster-recovery.md

+              ## Scenario
+. The Kubernetes environment underlying a running Rook Ceph cluster failed catastrophically, requiring a new Kubernetes environment in which the user wishes to recover the previous Rook Ceph cluster.
+. The underlying PVs with the Ceph data (OSDs) and metadata (MONs) are still available in the cloud environment.

Member

leseb Feb 13, 2020

Suggested change

      
            2. The underlying PVs with the Ceph data (OSDs) and metadata (MONs) are still available in the cloud environment.
          
            2. The underlying PVs with the Ceph data (OSDs) and metadata (Monitors) are still available in the cloud environment.

Documentation/ceph-disaster-recovery.md

+              ### Exporting Critical Info
+              Critical keys and info about the mons must be exported from the original cluster. This info is not stored on the PVs by either the mons
+              or osds. This info is necessary to restore the cluster in case of disaster.

Member

leseb Feb 13, 2020

Suggested change

      
            or osds. This info is necessary to restore the cluster in case of disaster.
          
            or OSDs. This info is necessary to restore the cluster in case of disaster.

Documentation/ceph-disaster-recovery.md

+              kubectl -n ${namespace} get cm rook-ceph-mon-endpoints -o yaml > critical/rook-ceph-mon-endpoints.yaml
+              kubectl -n ${namespace} get svc -l app=rook-ceph-mon -o yaml > critical/rook-ceph-mon-svc.yaml
+              # information about PVCs and PVs to help reconstruct them later
+              # TODO: Can we just export these as yamls and import them again directly? At a minimum we would need to filter the PV list since more than Rook PVs would be included.

Member

leseb Feb 13, 2020

This would mean using --export if users kubectl still supports it.

Documentation/ceph-disaster-recovery.md


		1. Start the new Kubernetes clusterr

		2. Modify the critical resources before creating them

Member

leseb Feb 13, 2020

Fields to trim:

creationTimestamp
namespace if different
resourceVersion
uid

Documentation/ceph-disaster-recovery.md


		<TODO: Commands to create the PVs>

		<TODO: How do we know which PVs belonged to the MONs or OSDs? The volumes just have random names. Do we need to rely on the PV size to indicate

Member

leseb Feb 13, 2020

yes!

Documentation/ceph-disaster-recovery.md

+                 On a running cluster, you might see these PVs:
+              ```console
+              $ oc get pvc -l ceph.rook.io/DeviceSet=set1
+              NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE

Member

leseb Feb 13, 2020

The PVC name is set1-data-0-w4bgt, that's the new format now.

Documentation/ceph-disaster-recovery.md

+                 <TODO: Commands to bind the PVCs to the PVs>
+. Create PVCs for the OSD volumes.
+                 - The PVCs must follow the Rook naming convention `<device-set-name>-<index>-<type>-<suffix>` where

Member

leseb Feb 13, 2020

The PVC name is set1-data-0-w4bgt, that's the new format now.

Member

leseb Feb 13, 2020

So maybe we should present both.

Documentation/ceph-disaster-recovery.md

+                  incorrect from the services that were imported from the previous cluster. The mon endpoints are part of their identity
+                  and cannot change. If they do need to change, see the section above on [restoring mon quorum](#restoring-mon-quorum).
+. Verify that the cluster is working. You should see three MONs, some number of OSDs, and one MGR daemon running.

Member

leseb Feb 13, 2020

Suggested change

      
            12. Verify that the cluster is working. You should see three MONs, some number of OSDs, and one MGR daemon running.
          
            12. Verify that the cluster is working. You should see three Monitors, some number of OSDs, and one MGR daemon running.

Member

leseb commented Mar 27, 2020

@travisn please rebase.

mergify bot commented May 22, 2020

This pull request has merge conflicts that must be resolved before it can be merged. @travisn please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork

Member

leseb commented Jul 27, 2020

@travisn any updates on this one?

Member Author

travisn commented Jul 27, 2020

This still needs testing

travisn mentioned this pull request

Ceph metadata backups, Disaster Recovery documentation #592

Closed

travisn mentioned this pull request

disaster recovery guide for PVCs #6452

Merged

10 tasks

Member Author

travisn commented Oct 20, 2020

There are too many open questions when assuming that the entire K8s cluster is lost. Closing this in favor of #6452, which requires the backup of the critical resources to be restored later in the new cluster.

travisn closed this

travisn deleted the doc-cluster-recovery-pvc branch

February 24, 2022 22:42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment