Release 2.25.0.0-b346: [#24905] xCluster: Ensure all Pollers are stopped on replication pause · yugabyte/yugabyte-db

2.25.0.0-b346
4e7696f
Choose a tag to compare

Filter

View all tags

2.25.0.0-b346: [#24905] xCluster: Ensure all Pollers are stopped on replication pause

2.25.0.0-b346
4e7696f
Choose a tag to compare

Filter

View all tags

hari90 tagged this 21 Nov 15:15

Summary:
When performing Disaster Recovery failover of a transactional xCluster replication, we first need to pause the Pollers. This ensures that during grey failures where replication is partially or sporadically healthy, the xCluster safetime used for the cut over (PITR) does not move.

Prior to this change Pollers of paused streams were stopped and removed by the xCluster Consumer. This has been changed so that the pollers can perform any post-apply work and set the tablet stream error to `REPLICATION_PAUSED`. yb-master waits for all the tablet streams to reach this state when processing the pause replication request.
For DDL replication, the DDL poller will run any pending DDLs as part of this post-apply work. This will be done as part of #23957.

xCluster consumer tserver-master heartbeat will include the poller modified states as part of regular heartbeat processing when a full report is not needed.
This ensures that we do not wait longer for the pause. The full report is still handled by the metric heartbeat provider (every 5s).

XClusterConsumer will no longer write invalid safe time to the safe time table and also make sure if only updates the safe time to a higher value. This ensures nodes with stale configs do not overwrite updates from nodes with the latest config.

Moved RPC `SetUniverseReplicationEnabled` to XClusterManager.

Fixed bug in ` GetXClusterSafeTimeForNamespace ` RPC to always compute the safe time.

** Upgrade/Rollback safety: **
New Replication error is added.
This new error is used by yb-admin command `set_universe_replication_enabled`, and RPC `SetUniverseReplicationEnabled`, both of which are not used in the middle of an upgrade.
Jira: DB-14030

Test Plan:
XClusterTest.ToggleReplicationEnabled
XClusterSafeTimeTest.SafeTimeInTableDoesNotGoBackwards

Reviewers: jhe, mlillibridge, xCluster

Reviewed By: jhe

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39975

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!