Skip to content

2.25.2.0-b19

@mdbridge mdbridge tagged this 19 Feb 20:11
Summary:
In order to avoid OID collision, we are splitting the original Postgres OID space in half; the first part is the "normal" space, which continues to be used for most things and the second part is a new "secondary" space.  The split is:
```
~/code/yugabyte-db/src/yb/common/entity_ids.h:32:
static const uint32_t kPgFirstNormalObjectId = 16384; // Hardcoded in transam.h
// We include some OIDs that would be signed with int32_t to allow checking the Postgres logic that
// handles those.  See PgLibPqLargeOidTest.LargeOid test.
static const uint32_t kPgUpperBoundNormalObjectId = 2'199'999'999;; // upper bound is exclusive
// Secondary OID space is used by xCluster when a database is a target.
static const uint32_t kPgFirstSecondarySpaceObjectId = 2'200'000'000; // = 0x83'21'56'00
static const uint32_t kPgUpperBoundSecondarySpaceObjectId = 0xff'ff'ff'ff;
```
where OIDs in range 1..16383 are still considered part of the normal space but reserved for initdb and the like.

When automatic xCluster replication is running, the target universe will allocate OIDs from the secondary space instead of the normal space to avoid collisions with OIDs freshly allocated on the source universe.  That is the only time we allocate OIDs from the secondary space.

Implementation of the splitting:
  * we modify the existing per-database OID allocator
  * the system catalog entries for databases in addition to the OID pointer for the normal space have a new OID pointer for the secondary space
  * the RPC to the master to reserve a range of OIDs now takes a bool that determines which space we allocate from
  * the TServers cache the OIDs they get from the result of that RPC; each TServer caches OIDs from only one space, discarding their cache when it is necessary to switch spaces
    * we expect switching to only happen when xCluster replication is dropped so there is no need to maintain dual caches
  * Postgres backends do not cache OIDs unless --ysql_enable_pg_per_database_oid_allocator=false, which is not the default and has never been recommended
    * if for some reason that flag had been turned on and we need to allocate from the secondary space, an error is returned
    * we are considering a better guardrail for that in the future
  * when databases are copied from template databases, only their normal OID pointer is copied
    * this effectively gives a secondary space pointer pointing to the start of the secondary space
    * this is fine -- xCluster safety only requires the secondary space pointer not be reset while replication of a database is running
  * when a database is restored, both OID pointers are reset to the beginning of their respective spaces
    * as above, this is fine for the secondary space pointer
    * Postgres doesn't care about this because it expects and handles correctly OID counters that wrap; YugabyteDB expects the counter not to wrap during a Postgres session but we have not broken that expectation here
    * xCluster does care but we will handle the normal space pointer in a future diff by doing a scan of the OIDs in use when xCluster automatic mode replication is started up then setting the normal space OID pointer after that value

**Upgrade/Rollback safety:**

Renamed one proto-field, next_pg_oid->next_normal_pg_oid, and added one field, next_secondary_pg_oid to the system catalog entry for namespaces:
```
~/code/yugabyte-db/src/yb/master/catalog_entity_info.proto:277:
// The data part of a SysRowEntry in the sys.catalog table for a namespace.
message SysNamespaceEntryPB {
  ...
  optional uint32 next_normal_pg_oid = 3; // Next normal space oid to assign.
  ...
  optional uint32 next_secondary_pg_oid = 9; // Next secondary space oid to assign.
}
```
  * Note that the field rename is not persisted and serves only to verify that no one is incorrectly depending on this field
    * (the field ID is unchanged and that is what is persisted)
  * the code treats the absence of the second field as if it was there and containing the start of the secondary space

Added a new boolean argument to the RPC for allocating a range of OIDs:
```
~/code/yugabyte-db/src/yb/master/master_client.proto:185:
// Reserve Postgres oid
message ReservePgsqlOidsRequestPB {
  optional bytes namespace_id = 1; // The namespace id of the Postgres database.
  optional uint32 next_oid = 2;    // The next oid to reserve.
  optional uint32 count = 3;       // The number of oids to reserve.
  // use_secondary_space is used by xCluster when a database is a target.
  optional bool use_secondary_space = 4;
}
```
When omitted, this field defaults to false, which means to use the normal space as usual.

We are not planning to use any auto flags here in spite of these changes changing persistent data and RPCs
  * automatic mode will not be available in the wild until after this diff is landed
  * and the customer has upgraded to a version after that
    * YBA will not allow setting up automatic mode replication until after that version is finalized
  * this means that the RPC will never be called with the extra field until after the upgrade is finalized
  * this means that the new persisted field will likewise never be set until after the upgrade is finalized

Although no new fields will be used until the upgrade is finalized, the logic that limits the OIDs to the normal space applies immediately once the upgrade has started:
  * in theory, an upgrade could cause problems for a customer whose normal space OID pointer was already into the secondary space
    * this is very unlikely in practice because the customer would've have to gone through over a billion OIDs
    * worst case, we can safely reset their OID counter back to the beginning of the normal space
      * Postgres code comments basically says they can not handle more than a few million active OIDs at once due to poor algorithm choice
      * so the problem is the OID counter could get too high not that there are too many OIDs
      * vanilla Postgres is already designed to handle OID wrap and YugabyteDB can handle counter resets when xCluster is not running so the resetting should not be an issue
    * a rollback in the meantime will restore functionality because the previous version of the code does not honor the new lower normal space upper bound

Implementation for using the secondary space on xCluster automatic mode replication targets:
  * extended TserverXClusterContextIf with a function for determining if a namespace is the target of xCluster automatic replication
  * that returning true plus being in xCluster read-only mode determines that we should use the secondary space
  * we carefully do not use the secondary space if other modes of xCluster are being used
  * this required some plumbing changes where we now pass/store TserverXClusterContext instead of TserverXClusterContextIf in some places
Jira: DB-12928

Test Plan:
I added a new test suite to unit test the OID allocator:
```
ybd --cxx-test xcluster_secondary_oid_space-test
```
as well as a new test to verify that collisions have been avoided if there are extra OID allocations on the target:
```
ybd --cxx-test xcluster_ddl_replication-test --gtest_filter '*.ExtraOidAllocationsOnTarget'
```
This test fails if we do not use the secondary space on the target.

Note that there is additional work to remove collisions in all cases, particularly in the case of switch overs.  Future diffs will address that.

```
ybd --cxx-test pg_libpq-test --gtest_filter '*.LargeOid'
```

Reviewers: hsunder, xCluster

Reviewed By: hsunder

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D41872
Assets 2
Loading