Summary:
When performing Disaster Recovery failover of a transactional xCluster replication, we first need to pause the Pollers. This ensures that during grey failures where replication is partially or sporadically healthy, the xCluster safetime used for the cut over (PITR) does not move.
Prior to this change Pollers of paused streams were stopped and removed by the xCluster Consumer. This has been changed so that the pollers can perform any post-apply work and set the tablet stream error to `REPLICATION_PAUSED`. yb-master waits for all the tablet streams to reach this state when processing the pause replication request.
For DDL replication, the DDL poller will run any pending DDLs as part of this post-apply work. This will be done as part of #23957.
xCluster consumer tserver-master heartbeat will include the poller modified states as part of regular heartbeat processing when a full report is not needed.
This ensures that we do not wait longer for the pause. The full report is still handled by the metric heartbeat provider (every 5s).
XClusterConsumer will no longer write invalid safe time to the safe time table and also make sure if only updates the safe time to a higher value. This ensures nodes with stale configs do not overwrite updates from nodes with the latest config.
Moved RPC `SetUniverseReplicationEnabled` to XClusterManager.
Fixed bug in ` GetXClusterSafeTimeForNamespace ` RPC to always compute the safe time.
** Upgrade/Rollback safety: **
New Replication error is added.
This new error is used by yb-admin command `set_universe_replication_enabled`, and RPC `SetUniverseReplicationEnabled`, both of which are not used in the middle of an upgrade.
Jira: DB-14030
Test Plan:
XClusterTest.ToggleReplicationEnabled
XClusterSafeTimeTest.SafeTimeInTableDoesNotGoBackwards
Reviewers: jhe, mlillibridge, xCluster
Reviewed By: jhe
Subscribers: ybase
Differential Revision: https://phorge.dev.yugabyte.com/D39975