Skip to content

2.25.0.0-b144

@hari90 hari90 tagged this 15 Oct 22:57
Summary:
This change adds the capability to create new indexes in a bi-directional xCluster setup without having to stop the user workload on the indexed (base) table. This guarantees eventual consistency of the indexes on both universes. This is specific to YSQL where stronger consistency is required.
In order to use the capability the user has to set `--auto_add_new_index_to_bidirectional_xcluster` (disabled by default), and then issue the DDL to create the Index on both universes concurrently within 10min `--add_new_index_to_bidirectional_xcluster_timeout_secs`. The Create Index will block and eventually timeout if it is only issued on one universe.

Online index creation in Yugabyte is based on the F1 paper, and has 4 stages: `Delete Only`, `Insert and Delete Only`, `Backfill`, `Read, Insert and delete`. Consistency is guaranteed as long as no two nodes in the system are more than 2 stages apart. Within a single universe the CatalogVersion is bumped to enforce this.
With bi-directional xCluster writes happen on both universes, each with its own CatalogVersion. The regular Create Index flow does not have a way to synchronize the stages across the universes, which before this change meant the user could not write to the indexed table while creating indexes. Its important to note that in bi-directional xCluster the users have the responsibility to insert into different key ranges on each universe. This is important as xCluster which operates at the physical layer does not enforce YSQL layer constraints like Foreign keys.
In order to allow Online Create Index we synchronize the two universe but only at the Backfill stage. We allow them to diverse by more than 2 stages, since for any given key range each universe will locally ensures the 2 stage apart policy. Only the backfill stage reads and writes data across the entire key range, so this alone needs to be synchronized across the two universe.

When the indexed table is part of bi-directional xCluster the new sequence of events when creating a new index is:
1. Universe A: User starts CREATE INDEX DDL.
1. Universe A: PG Created DocDB tables for the index.
1. Universe A: xCluster creates streams for the new index table and checkpoints it to OpId 0, so that all the data can be replicated.
1. Universe A: PG goes through `indislive` and `indisready` stages and then triggers the `backfill`.
1. Universe A: xCluster waits for index on Universe B to reach `backfill`.
1. Universe B: User starts CREATE INDEX DDL.
1. Universe B: PG Created DocDB tables for the index.
1. Universe B: xCluster creates streams for the new index table and checkpoints it to OpId 0, so that all the data can be replicated.
1. Universe B: PG goes through `indislive` and `indisready` stages and then triggers the backfill.
1. Universe A: xCluster gets the stream for index on Universe B and add its local index to replication.
1. Universe A: xCluster picks backfill time, and waits for its indexed table to catch up.
1. Universe A: Backfill operation runs.
1. Universe A: PG sets `indisvalid`.
1. Universe A: User CREATE INDEX DDL completes.
1. Universe B: xCluster gets the stream for index on Universe B and add its local index to replication.
1. Universe B: xCluster picks backfill time, and waits for its indexed table to catch up.
1. Universe B: Backfill operation runs.
1. Universe B: PG sets `indisvalid`.
1. Universe B: User CREATE INDEX DDL completes.

Steps 10-19 can get interweaved across the two universes.

For colocated tables, only the parent table is part of replication, so we skip the steps related to stream creation, and adding index to replication.

- For YSQL  `INDEX_PERM_DO_BACKFILL` is not used. Instead PG relies on  `indislive` and `indisready`. Adding `index_is_backfilling` to `GetTableSchemaResponsePB` which is set from the in-mem value of `TableInfo::IsBackfilling`. This is set while the backfill job is running.
- Added Client helpers `IsBackfillIndexStarted`, `WaitForReplicationDrain` and `GetXClusterStreams`.
- Moved all xCluster related code from `backfill_index.cc` to the `TryGetXClusterSafeTimeForBackfill` function.
- Added `AddBiDirectionalIndexToXClusterTargetTask` which waits for the index DocDB table on the other universe to get created, reach the backfill stage and then adds it to replication.
- Added `CreateXClusterStreamForBiDirectionalIndexTask` to crate stream and checkpoint it to OpId 0 on source side, when DocDB table for index is created.
- Added helper functions `IsTableBiDirectionallyReplicated`, universe replication `HasTable`.
- Converted `CreateNewXClusterStreamForTable` to run asynchronously.
- `PrepareAndGetBackfillTimeForBiDirectionalIndex` run the steps required to synchronize the two clusters and returns the safe backfill time.

#24362 tracks emitting a user friendly info in PG about running the DDL on both clusters when the index is created.

**Upgrade/Rollback safety:**
`is_backfilling` has been added to `GetTableSchemaResponsePB`. This is protected under the flag `auto_add_new_index_to_bidirectional_xcluster` which is turned OFF by default. Anyone intending to use this for other operations must ensure they have their own flag. This has been indicated in a comment in the proto file.

Fixes #22244
Jira: DB-11162

Test Plan:
XClusterBiDirectionalIndexTest, CreateIndex
XClusterBiDirectionalIndexTest, BlockBeforeBackfill
XClusterBiDirectionalIndexTest, PauseIndexedTable
XClusterBiDirectionalIndexTest, CreateIndexWithWorkload

Reviewers: jhe, slingam, xCluster

Reviewed By: jhe

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D38611
Assets 2
Loading