From c0b8f5a708913cf661f413fc75ff6e03f9c30383 Mon Sep 17 00:00:00 2001 From: Yug Gupta Date: Wed, 20 Oct 2021 09:12:09 +0530 Subject: [PATCH] docs: add documents for failover and failback add a document to track the steps for failover and failback in case of Async DR; for Planned Migration and Disaster Recovery use case. Signed-off-by: Yug Gupta --- Documentation/async-disaster-recovery.md | 112 +++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 Documentation/async-disaster-recovery.md diff --git a/Documentation/async-disaster-recovery.md b/Documentation/async-disaster-recovery.md new file mode 100644 index 000000000000..9577976b33c9 --- /dev/null +++ b/Documentation/async-disaster-recovery.md @@ -0,0 +1,112 @@ +--- +title: Failover and Failback +weight: 3245 +indent: true +--- + +# RBD Asynchronous DR Failover and Failback + +## Table of Contents + +* [Planned Migration and Disaster Recovery](#planned-migration-and-disaster-recovery) +* [Planned Migration](#planned-migration) + * [Relocation](#relocation) +* [Disaster Recovery](#disaster-recovery) + * [Failover](#failover-abrupt-shutdown) + * [Failback](#failback-post-disaster-recovery) + +## Planned Migration and Disaster Recovery + +Rook comes with the volume replication support, which allows users to perform disaster recovery and planned migration of clusters. + +The following document will help to track the procedure for failover and failback in case of a Disaster recovery or Planned migration use cases. + +> **Note**: The document assumes that RBD Mirroring is set up between the peer clusters. +> For information on rbd mirroring and how to set it up using rook, please refer to +> the [rbd-mirroring guide](rbd-mirroring.md). + +## Planned Migration + +> Use cases: Datacenter maintenance, technology refresh, disaster avoidance, etc. + +### Relocation + +The Relocation operation is the process of switching production to a + backup facility(normally your recovery site) or vice versa. For relocation, + access to the image on the primary site should be stopped. +The image should now be made *primary* on the secondary cluster so that + the access can be resumed there. + +> :memo: Periodic or one-time backup of +> the application should be available for restore on the secondary site (cluster-2). + +Follow the below steps for planned migration of workload from the primary + cluster to the secondary cluster: + +* Scale down all the application pods which are using the + mirrored PVC on the Primary Cluster. +* [Take a backup](rbd-mirroring.md#backup-&-restore) of PVC and PV object from the primary cluster. + This can be done using some backup tools like + [velero](https://velero.io/docs/main/). +* [Update VolumeReplication CR](rbd-mirroring.md#create-a-volumereplication-cr) to set `replicationState` to `secondary` at the Primary Site. + When the operator sees this change, it will pass the information down to the + driver via GRPC request to mark the dataSource as `secondary`. +* If you are manually recreating the PVC and PV on the secondary cluster, + remove the `claimRef` section in the PV objects. (See [this](rbd-mirroring.md#restore-the-backup-on-cluster-2) for details) +* Recreate the storageclass, PVC, and PV objects on the secondary site. +* As you are creating the static binding between PVC and PV, a new PV won’t + be created here, the PVC will get bind to the existing PV. +* [Create the VolumeReplicationClass](rbd-mirroring.md#create-a-volume-replication-class-cr) on the secondary site. +* [Create VolumeReplications](rbd-mirroring.md#create-a-volumereplication-cr) for all the PVC’s for which mirroring + is enabled + * `replicationState` should be `primary` for all the PVC’s on + the secondary site. +* [Check VolumeReplication CR status](rbd-mirroring.md#checking-replication-status) to verify if the image is marked `primary` on the secondary site. +* Once the Image is marked as `primary`, the PVC is now ready + to be used. Now, we can scale up the applications to use the PVC. + +>:memo: **WARNING**: In Async Disaster recovery use case, we don't get +> the complete data. +> We will only get the crash-consistent data based on the snapshot interval time. + +## Disaster Recovery + +> Use cases: Natural disasters, Power failures, System failures, and crashes, etc. + +> **NOTE:** To effectively resume operations after a failover/relocation, +> backup of the kubernetes artifacts like deployment, PVC, PV, etc need to be created beforehand by the admin; so that the application can be restored on the peer cluster. For more information, see [backup and restore](rbd-mirroring.md#backup-&-restore). +### Failover (abrupt shutdown) + +In case of Disaster recovery, create VolumeReplication CR at the Secondary Site. + Since the connection to the Primary Site is lost, the operator automatically + sends a GRPC request down to the driver to forcefully mark the dataSource as `primary` + on the Secondary Site. + +* If you are manually creating the PVC and PV on the secondary cluster, remove + the claimRef section in the PV objects. (See [this](rbd-mirroring.md#restore-the-backup-on-cluster-2) for details) +* Create the storageclass, PVC, and PV objects on the secondary site. +* As you are creating the static binding between PVC and PV, a new PV won’t be + created here, the PVC will get bind to the existing PV. +* [Create the VolumeReplicationClass](rbd-mirroring.md#create-a-volume-replication-class-cr) and [VolumeReplication CR](rbd-mirroring.md#create-a-volumereplication-cr) on the secondary site. +* [Check VolumeReplication CR status](rbd-mirroring.md#checking-replication-status) to verify if the image is marked `primary` on the secondary site. +* Once the Image is marked as `primary`, the PVC is now ready to be used. Now, + we can scale up the applications to use the PVC. + +### Failback (post-disaster recovery) + +Once the failed cluster is recovered on the primary site and you want to failback + from secondary site, follow the below steps: + +* Scale down the running applications (if any) on the primary site. + Ensure that all persistent volumes in use by the workload are no + longer in use on the primary cluster. +* [Update VolumeReplication CR](rbd-mirroring.md#create-a-volumereplication-cr) replicationState + from `primary` to `secondary` on the primary site. +* Scale down the applications on the secondary site. +* [Update VolumeReplication CR](rbd-mirroring.md#create-a-volumereplication-cr) replicationState state from `primary` to + `secondary` in secondary site. +* On the primary site, [verify the VolumeReplication status](rbd-mirroring.md#checking-replication-status) is marked as + volume ready to use. +* Once the volume is marked to ready to use, change the replicationState state + from `secondary` to `primary` in primary site. +* Scale up the applications again on the primary site.