Skip to content

2.25.2.0-b123

@mdbridge mdbridge tagged this 09 Mar 22:55
Summary:
With automatic mode xCluster replication, we need to bump up the (original) source database normal space OID counter during switchover.

In this diff we do that when we drop the replication.

Why we need to do this is explained in the new test's comment:
```
TEST_F(XClusterDDLReplicationSwitchoverTest, SwitchoverBumpsAboveUsedOids) {
  // To understand this test, it helps to picture the result of A->B replication before we do a
  // switchover.  The following is an example of the OID spaces of A and B for one database after A
  // has allocated three OIDs we don't care about preserving (the Ns) and one OID we do care about
  // preserving (P).  The [OID ptr]'s indicate where the next OID would be allocated in each space
  // modulo we skip OIDs already in use in that space on that universe.
  //
  // In particular, the next normal space OID that will be allocated on B is the one marked (*),
  // which conflicts with an OID already in use in cluster A.  While not a problem while B is a
  // target (targets only allocate in the secondary space), this will be a problem if we switch the
  // replication direction so B is now the source.
  //
  // Accordingly, xCluster is designed to bump up B's normal space [OID ptr] to after A's normal
  // space [OID ptr] as part of doing switchover; this test attempts to verify that that successfully
  // avoids the OID conflict problem described above.
  //
  //           A:                  B:
  //  Normal:
  //           N                [OID ptr] (*)
  //           N
  //           P                   P
  //           N
  //         [OID ptr]
  //
  //  Secondary:
  //         [OID ptr]             N
  //                               N
  //                               N
  //                            [OID ptr]
```

Implementation:
  * dropping the original direction replication is done by switchover by calling DeleteOutboundReplicationGroup on the source universe
  * we modify this to get the current normal space OID counters for each namespace in the replication group
    * we do this by simply allocating new OIDs
  * this information is then passed to the target universe in the DeleteUniverseReplication RPC using a new field:
```
~/code/yugabyte-db/src/yb/master/master_replication.proto:
  // producer_namespace_id -> oid_to_bump_above
  map<string,uint32> producer_namespace_oids = 5;
}
```
  * on the target, the RPC handling code then does the bumping of the replication group's namespaces before actually proceeding
    * in the process, it needs to translate between source and target namespace IDs, which can differ across universes

Other:
  * the number of OIDs prefetched from master at a time is now exposed as a new gflag
    * this allows changing it in the test

Upgrade/Rollback safety:
  * in this diff, we add an optional field to an RPC
  * because automatic mode is first becoming available in this release (2.25.1), it is impossible to upgrade to this code while an automatic xCluster replication is running
  * YBA does not allow setting up automatic replication while doing this upgrade
  * the use of the RPC field is gated on the replication being dropped being in automatic mode; thus the RPC field will not be used before the code is available
  * the absence of the RPC field on the target causes no behavior changes
  * summing up, there is no need for an auto flag here and we do not provide one
Jira: DB-15535

Test Plan:
A new test, XClusterDDLReplicationSwitchoverTest.SwitchoverBumpsAboveUsedOids, verifies that these changes solves the problem in question.  It has been verified to fail if the bumping is not done.
```
ybd --cxx-test xcluster_ddl_replication-test --gtest_filter '*.SwitchoverBumpsAboveUsedOids'
```

Reviewers: hsunder, xCluster

Reviewed By: hsunder

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D42185
Assets 2
Loading