Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions content/blog/2025-11-17-introducing-ramendr-starter-kit.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
date: 2025-11-17
title: Introducing RamenDR Starter Kit
summary: A New Pattern to illustrate Regional DR for Virtualization Workloads running on OpenShift Data Foundation
author: Martin Jackson
blog_tags:
- patterns
- announce
---
:toc:
:imagesdir: /images

We are excited to announce that the link:https://validatedpatterns.io/patterns/ramendr-starter-kit/[**validatedpatterns-sandbox/ramendr-starter-kit**] repository is now available and has reached the Sandbox tier of Validated Patterns.

== The Pattern

This Validated Pattern draws on previous work that models Regional Disaster Recovery, adds Virtualization to
the managed clusters, and starts virtual machines and can fail them over and back between managed clusters.

The pattern ensures that all of the prerequisites are set up correctly and in order, and ensures that things like
the SSL CA certificate copying that is necessary for both the Ceph replication and the OADP/Velero replication
will work correctly.

The user is in control of when the failover happens; the pattern provides a script to do the explicit failover
required for Ramen Regional DR of a discovered application.

== Why Does DR Matter?

In a perfect world, every application would have its own knowledge of where it is available and would shard and
replicate its own data. But many appplications were built without these concepts in mind, and even if a company
wanted to and could afford re-writing every application, it could not re-write them and deploy them all at once.

Thus, users benefit from being able to rely on technology products and solutions to enable a regional disaster
recovery capability when the application does not support it natively.

The ability to recover a workload in the event of a regional disaster is considered a requirement in several
industries for applications that the user deems critical enough to require DR support for, but unable to provide
it natively in the application.

== Learnings from Developing the Pattern: On the use of AI to generate scripts

This pattern is also noteworthy in that all of the major shell scripts in the pattern were written by
Cursor. This was a major learning experience, both in the capabilities of modern AI coding tools, and in some
of their limitations.

=== The Good

* Error handling and visual output are better than the shell sripts (or Ansible code) I would have written if
I had written all of this from scratch.
* The "inner loop" of development felt a lot faster using the generated code than if I had written it all from
scratch. The value in this pattern is in the use of the components together, not in finding new and novel
ways to retrieve certificate material from a running OpenShift cluster.

=== The Bad

* Even when the context "knew" it was working on OpenShift and Hive, it used different mechanisms to retrieve
kubeconfig files for managed clusters. I had to remind it to use a known-good mechanism, which had worked for
downloading kubeconfigs to the user workstation.
* Several of these scripts are bash scripts wrappen in Kubernetes jobs or cronjobs. The generator had some problems
with using local variables in places it could not, and in using shell here documents in places that was not allowed
in YAML. Eventually I set the context that we were better off using `File.get` calls and externalizing the scripts
from the jobs altogether.

=== The Ugly

* I am uncomfortable at the level of duplication in the code. Time will tell whether some of these scripts will become
problematic to maintain. A more rigorous analysis might find several opportunities to refactor code.
* The sheer volume of code makes it a bit daunting to look at. All of the major scripts in the pattern are over 150
lines long, and the longest (as of this publication) is over 1300 lines long.
* Some of the choices of technique and loading dependencies were a bit too generic. We have images for Validated
Patterns that provide things like a Python interpreter with access to the YAML module, the AWS CLI, and other things
that turned out to be useful. I left in the cursor frameworks for downloading things like the AWS CLI, because they
correctly detect that those dependencies are already installed and may prove beneficial if we move to different
images.

== DR Terminology - What are we talking about?

**High Availability (“HA”)** includes all characteristics, qualities and workflows of a system that prevent
unavailability events for workloads. This is a very broad category, and includes things like redundancy built into
individual disks, such that failure of a single drive does not result in outage to the workload. Load balancing,
redundant power supplies, running a workload across multiple fault domains, are some of the techniques that belong to
HA, because they keep the workload from becoming unavailable in the first place. Usually HA is completely automatic in that it does not require a real-time human in the loop.

**Disaster Recovery (“DR”)** includes the characteristics, qualities and workflows of a system to recover from an
outage event when there has been data loss. DR events often also include things that are recognized as major
environmental disasters (such as weather events like hurricanes, tornadoes, fires), or other large-scale problems that
cause widespread devastation or disruption to a location where workloads run, such that critical personnel might also
be affected (i.e. unavailable because they are dead or disabled) and questions of how decisions will be made without
key decision makers are also considered. (This is often included under the heading of “Business Continuity,” which is
closely related to DR.) There are two critical differences between HA and DR: The first is the expectation of human
decision-making in the loop, and the other is the data loss aspect. That is, in a DR event we know we have lost data;
we are working on how much is acceptable to lose and how quickly we can restore workloads. This is what makes it
fundamentally a different thing than HA; but some organizations do not really see or enforce this distinction and that
leads to a lot of confusion. Some vendors also do not strongly make this distinction, which does not discourage that
confusion.

DR policies can be driven by external regulatory or legal requirements, or an organization’s internal understanding of
what such external legal and regulatory requirements mean. That is to say - the law may not specifically require a
particular level of DR, but the organization interprets the law to mean that is what they need to do to be compliant
to the law or regulation. The Sarbanes Oxley Act (“SOX”) in the US was adopted after the Enron and Worldcom financial
scandals of the early 2000’s, but includes a number of requirements for accurate financial reporting, which many
organizations have used to justify and fund substantial BC/DR programs.

**Business Continuity: (“BC”, but usually used together with DR as “BCDR” or BC/DR”)** refers to primarily the people
side of recovery from disasters. Large organizations will have teams that focus on BC/DR and use that term in the team
title or name. Such teams will be responsible for making sure that engineering and application groups are compliant to
the organization’s BC/DR policies. This can involve scheduling and running BC/DR “drills” and actual live testing of
BC/DR technologies.

**Recovery Time Objective (“RTO”)** is the amount of time it takes to restore a failed workload to service. This is
NOT the amount of data that is tolerable to lose - that is defined by the companion RPO.

**Recovery Point Objective (“RPO”)** is the amount of data a workload can stand to lose. One confusing aspect of RPO is that it can be defined as a time interval (as opposed to, say, a number of transactions). But an RPO of “5 minutes”
should be read as “we want to lose no more than 5 minutes’ worth of data.

RPO/RTO: So lots of people want a 0/0 RPO/RTO, often without understanding what it takes to implement that. It can be
fantastically expensive, even for the world’s largest and best-funded organizations.

== Special Thanks

This pattern was an especially challenging one to design and complete, because of the number of elements in it
and the timing issues inherent in eventual-consistency models. Therefore, special thanks are due to the following
people, without whom this pattern would not exist:

* The authors of the original link:https://github.com/validatedpatterns/regional-resiliency-pattern[regional-resiliency-pattern], which provided the foundation for the ODF and RamenDR components, and building the managed clusters via Hive
* Aswin Suryanarayanan, who helped immensely with some late challenges with Submariner
* Annette Clewett, without whom this pattern would not exist. Annette took the time to thoroughly explain all of RamenDR's dependencies and how to orchestrate them all correctly.
75 changes: 75 additions & 0 deletions content/patterns/ramendr-starter-kit/_index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
title: RamenDR Starter Kit
date: 2025-11-13
tier: sandbox
summary: This pattern demonstrates the use of Red Hat OpenShift Data Foundations for Virtualization Regional Disaster Recovery
rh_products:
- Red Hat OpenShift Container Platform
- Red Hat OpenShift Virtualization
- Red Hat Enterprise Linux
- Red Hat OpenShift Data Foundation
- Red Hat OpenShift Data Foundation MultiCluster Orchestrator
- Red Hat OpenShift Data Foundation DR Hub Operator
- Red Hat Advanced Cluster Management
industries: []
aliases: /ramendr-starter-kit/
pattern_logo: ansible-edge.png
links:
github: https://github.com/validatedpatterns-sandbox/ramendr-starter-kit/
install: getting-started
bugs: https://github.com/validatedpatterns-sandbox/ramendr-starter-kit/issues
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
ci: ramendr-starter-kit
---

:toc:
:imagesdir: /images
:_content-type: ASSEMBLY
include::modules/comm-attributes.adoc[]

== RamenDR Regional Disaster Recovery with Virtualization Starter Kit

This pattern sets up three clusters as recommended for OpenShift Data Foundations Regional Disaster Recovery as
documented link:https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.18/html-single/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/index[here].

Of additional interest is that the workload that it sets up protection for and can failover involves running virtual
machines.

The setup process is relatively intricate; the goal of this pattern is to handle all the intricate parts and present
a functional DR-capable starting point for Virtual Machine workloads. In particular this pattern takes care to sequence
installations and validate pre-requisites for all of the core components of the Disaster Recovery system.

In particular, this pattern must be customized to specify DNS basedomains for the managed clusters, which makes
forking the pattern (which we generally recommend anyway, in case you want to make other customizations) effectively
a requirement. The link:https://validatedpatterns-sandbox/patterns/getting-started[**Getting Started**] doc has
details on what needs to be changed and how to commit and push those changes.

=== Background

It would be ideal if all applications in the world understood availability concepts natively and had their own
integrated regional failover strategies. However, many workloads do not, and users who need regional disaster recovery
capabilities need to solve this problem for the applications that cannot solve it for themselves.

This pattern uses OpenShift Virtualization (the productization of Kubevirt) to simulate the Edge environment for VMs.

==== Solution elements

==== Red Hat Technologies

* Red Hat OpenShift Container Platform (Kubernetes)
* Red Hat Advanced Cluster Management (RHACM)
* Red Hat OpenShift Data Foundations (ODF, including Multicluster Orchestrator)
* Submariner (VPN)
* Red Hat OpenShift GitOps (ArgoCD)
* OpenShift Virtualization (Kubevirt)
* Red Hat Enterprise Linux 9 (on the VMs)

==== Other technologies this pattern Uses

* HashiCorp Vault (Community Edition)
* External Secrets Operator (Community Edition)

=== Architecture

.ramendr-architecture-diagram
image::/images/ramendr-starter-kit/ramendr-architecture.drawio.png[ramendr-starter-kit-architecture,title="RamenDR Starter Kit Architecture"]
21 changes: 21 additions & 0 deletions content/patterns/ramendr-starter-kit/cluster-sizing.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
title: Cluster sizing
weight: 50
aliases: /ramendr-starter-kit/cluster-sizing/
---

:toc:
:imagesdir: /images
:_content-type: ASSEMBLY

include::modules/comm-attributes.adoc[]
include::modules/ramendr-starter-kit/metadata-ramendr-starter-kit.adoc[]

The OpenShift hub cluster is made of 3 Control Plane nodes and 3 Workers for the cluster; the 3 workers are standard
compute nodes. For the node sizes we used the **m5.4xlarge** on AWS.

This pattern has only been tested on AWS only right now because of the integration of both Hive and OpenShift
Virtualization. We may publish a later revision that supports more hyperscalers.

include::modules/cluster-sizing-template.adoc[]

Loading