Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Support Rollbacks with xCluster #19518

Closed
1 task done
hari90 opened this issue Oct 13, 2023 · 0 comments
Closed
1 task done

[DocDB] Support Rollbacks with xCluster #19518

hari90 opened this issue Oct 13, 2023 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@hari90
Copy link
Contributor

hari90 commented Oct 13, 2023

Jira Link: DB-8311

Description

Related to #13686

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@hari90 hari90 added the area/docdb YugabyteDB core features label Oct 13, 2023
@hari90 hari90 self-assigned this Oct 13, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Oct 13, 2023
hari90 added a commit that referenced this issue Nov 16, 2023
Summary:
AutoFlag config change are broadcasted to masters via `ChangeAutoFlagsConfigOperation`, and to tservers via the heartbeat. The tservers heartbeat at an interval of 1s (`heartbeat_interval_ms`) so it can take a while for all the tserver processes to get the new config.

As part of #19518 each tserver needs to send the latest AutoFlag config version to the xCluster target. This change adds a delay of 10s (`auto_flags_apply_delay_ms`) between a AutoFlag config update and the Apply (set the AutoFlags) of the new config. This delay guarantees that all tservers will have the new config version before any of them flips the AutoFlags and starts generating new data, as long as they heartbeated to master within `auto_flags_apply_delay_ms`.

master computes and stores `config_apply_time` in the AutoFlag config. All master and tservers that get the new config, persist them to disk immediately, but wait until `config_apply_time` before Applying it. If the process restarts after getting a new AutoFlag config but before the apply time has passed it will then synchronously wait out the remaining time during the next process startup.

tservers store the last successful heartbeat time in  `last_config_sync_time_`. The new `ValidateAndGetConfigVersion` will make sure that `last_config_sync_time_` is within `auto_flags_apply_delay_ms - max_clock_skew_usec` before returning the version.

New `ValidateAutoFlagsConfig` master API will check if the target universe AutoFlag config is compatible with the current universe config. If all the AutoFlags in the current universes config is present in the target config then it is declared compatible.

**Upgrade/Rollback safety:**
`config_apply_time` has been added to `AutoFlagsConfigPB`. Once set we require all tservers to read and process this value.
`AutoFlagsConfigPB` when AutoFlags are promoted, which is only allowed to happen after all processes have been upgraded to the new version.
After an Upgrade, Rollback, Upgrade cycle we will see an old value in  `config_apply_time`. The value is a lower bound of time. So having an old value is safe since we dont care about time that is in the past.

`ValidateAutoFlagsConfig` API has been added to master. This API is always safe to call as long as the master is on a version that supports it.
Rollback with xCluster is not supported on versions before this change.
The xCluster target universe will call this API on the source, and if the source is running an older version it will return `rpc::ErrorStatusPB::ERROR_NO_SUCH_METHOD` which will be handled by the client.

Fixes #19932

Test Plan:
AutoFlagsTest.AreAutoFlagsCompatible
AutoFlagsMiniClusterTest.HeartbeatDelay
AutoFlagsMiniClusterTest.AddTserverBeforeApplyDelay
AutoFlagsMiniClusterTest.RollbackBeforeApply
AutoFlagsMiniClusterTest.ValidateAutoFlagsConfig
AutoFlagsExternalMiniClusterTest.DelayedApplyFlags

Reviewers: slingam, xCluster, rahuldesirazu

Reviewed By: rahuldesirazu

Subscribers: ybase, bogdan, xCluster

Differential Revision: https://phorge.dev.yugabyte.com/D30163
hari90 added a commit that referenced this issue Nov 30, 2023
Summary:
This changes introduces support for Upgrade and Downgrade of YugabyteDB universes that contain xCluster links.

Two YugabyteDB universes are allowed to replicate data via xCluster only if the Target universe AutoFlags are compatible with the Source universe. The Target universe must have a super set of kExternal, and kNewInstallOnly class promoted AutoFlags compared to the Source universe in order to be compatible.

yb-master of the Target universe will check AutoFlag compatibility when the config changes on either universe, which happens during upgrades and rollbacks. The Target universe yb-master send a `ValidateAutoFlagsConfig` RPC to the source universe with its AutoFlags config, and `min_flag_class` set to `kExternal`. The source universe will perform the compatibility check and respond with the results. yb-master stores its local AutoFlags config version in `validated_local_auto_flags_config_version` of `SysUniverseReplicationEntryPB`, and the source universe AutoFlags config version that it was validated against in `validated_auto_flags_config_version` of `ProducerEntryPB`. If the AutoFlags are compatible, the target version is stored in `max_compatible_auto_flag_config_version` of `ProducerEntryPB`. When AutoFlag config of the Target universe changes `NotifyAutoFlagsConfigChanged` is called which sets `xcluster_auto_flags_revalidation_needed_`. In the xrepl background task this will trigger `XClusterRefreshLocalAutoFlagConfig`which reruns the AutoFlags check if the `validated_local_auto_flags_config_version` is different from the current target AutoFlags config version. On master leader changes `xcluster_auto_flags_revalidation_needed_` is pessimistically reset to handle missed notifications. If no AutoFlags config changed no RPC calls will be necessary.

`ProducerEntryPB` is part of the `consumer_registry`, which is sent to all yb-tservers via heartbeats. The `XClusterConsumer` passes the `max_compatible_auto_flag_config_version` to the `XClusterPoller`s, which in turn sets it in `auto_flags_config_version` of `GetChangesRequestPB`. When the Source universe processes the `GetChanges` RPC it checks if the passed in AutoFlags config version matches the current value. If it does not match it returns `AUTO_FLAGS_CONFIG_VERSION_MISMATCH` error and sets `auto_flags_config_version` with the current value in `GetChangesResponsePB`. `XClusterPoller`s sets its error state to `REPLICATION_AUTO_FLAG_CONFIG_VERSION_MISMATCH` and report the new AutoFlags config version to yb-master so that it can check the compatibility of the new config. Once yb-master completes its checks and if the universes are compatible the new `max_compatible_auto_flag_config_version` will be passed to `XClusterPoller`s.

`AutoFlagsVersionHandler` class handles the tracking and reporting of the new AutoFlags config version. Each `XClusterConsumer` has one `AutoFlagsVersionHandler`, so each yb-tserver has a singleton object. `AutoFlagsVersionHandler` sends `XClusterReportNewAutoFlagConfigVersion` RPC to master and handles deduplication across multiple `XClusterPoller`s and skips the reporting if `validated_auto_flags_config_version` is already higher than or equal the reported version. This ensures that each yb-tserver only sends one RPC to yb-master per config version per ReplicationGroup.

Other changes:
- Moved common xCluster YCQL test code and created a new `XClusterExternalMiniClusterBase` class
- The two test AutoFlags `TEST_auto_flags_initialized` and `TEST_auto_flags_new_install` have been removed from the compatibility validation
- Fixes cds initialization bug in `TestThreadHolder`

**Upgrade/Rollback safety:**
- This commit contains the following additive Proto changes:
`ProducerEntryPB` - `compatible_auto_flag_config_version` and `validated_auto_flags_config_version`
`ReplicationErrorPb` - `REPLICATION_AUTO_FLAG_CONFIG_VERSION_MISMATCH`
`SysUniverseReplicationEntryPB` - `validated_local_auto_flags_config_version`
`XClusterReportNewAutoFlagConfigVersionRequestPB`, `XClusterReportNewAutoFlagConfigVersionResponsePB` and rpc `XClusterReportNewAutoFlagConfigVersion`
- Upgrade safety:
The new proto fields and rpcs are guarded with `enable_xcluster_auto_flag_validation` of class `kLocalPersisted`.
The Target universe will invoke rpc `ValidateAutoFlagsConfig` against the Source universe. The YbClient wrapper for this handles the error `ERROR_NO_SUCH_METHOD` which is returned when the Source is running a version that does not support this rpc call. This was introduced in D30163/f63fb345e134549a87e75eb03f6c965652f39f22
- Rollback safety:
Universes with xCluster replication are not safe to rollback without this commit. After universes have been upgraded to a version that contains this commit, the future upgrades will support rollback.
Jira: DB-8311

Test Plan:
AutoFlagsVersionHandlerTest.TestInsertAndUpdate
AutoFlagsVersionHandlerTest.DuplicateReporting
AutoFlagsVersionHandlerTest.ConcurrentReporting
AutoFlagsVersionHandlerTest.NonBlockingReporting
XClusterProducerTest.GetChangesBasic
XClusterProducerTest.GetChangesWithAutoFlags
XClusterProducerTest.HeartbeatDelayWithoutData
XClusterProducerTest.HeartbeatDelayWithData
XClusterProducerTest.ProducerUpgrade
XClusterUpgradeTest.SetupWithLowerSourceUniverse
XClusterUpgradeTest.SetupWithLowerTargetUniverse
XClusterUpgradeTest.UpgradeTargetBeforeSource
XClusterUpgradeTest.UpgradeSourceBeforeTarget
XClusterUpgradeTest.RollbackTargetUniverse
XClusterUpgradeTest.RollbackSourceUniverse
XClusterUpgradeTest.DemoteSourceUniverseFlag
XClusterUpgradeTest.DemoteTargetUniverseFlag
XClusterUpgradeTest.SourceWithoutAutoFlagCompatiblity
XClusterUpgradeTest.TargetWithoutAutoFlagCompatiblity

Reviewers: jhe, xCluster, rahuldesirazu, slingam

Reviewed By: jhe, rahuldesirazu

Subscribers: ycdcxcluster, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D30326
hari90 added a commit that referenced this issue Dec 5, 2023
Summary:
Original commit: f63fb34 / D30163

a70a11e/D30416, a minor test fix also applied.

AutoFlag config change are broadcasted to masters via `ChangeAutoFlagsConfigOperation`, and to tservers via the heartbeat. The tservers heartbeat at an interval of 1s (`heartbeat_interval_ms`) so it can take a while for all the tserver processes to get the new config.

As part of #19518 each tserver needs to send the latest AutoFlag config version to the xCluster target. This change adds a delay of 10s (`auto_flags_apply_delay_ms`) between a AutoFlag config update and the Apply (set the AutoFlags) of the new config. This delay guarantees that all tservers will have the new config version before any of them flips the AutoFlags and starts generating new data, as long as they heartbeated to master within `auto_flags_apply_delay_ms`.

master computes and stores `config_apply_time` in the AutoFlag config. All master and tservers that get the new config, persist them to disk immediately, but wait until `config_apply_time` before Applying it. If the process restarts after getting a new AutoFlag config but before the apply time has passed it will then synchronously wait out the remaining time during the next process startup.

tservers store the last successful heartbeat time in  `last_config_sync_time_`. The new `ValidateAndGetConfigVersion` will make sure that `last_config_sync_time_` is within `auto_flags_apply_delay_ms - max_clock_skew_usec` before returning the version.

New `ValidateAutoFlagsConfig` master API will check if the target universe AutoFlag config is compatible with the current universe config. If all the AutoFlags in the current universes config is present in the target config then it is declared compatible.

**Upgrade/Rollback safety:**
`config_apply_time` has been added to `AutoFlagsConfigPB`. Once set we require all tservers to read and process this value.
`AutoFlagsConfigPB` when AutoFlags are promoted, which is only allowed to happen after all processes have been upgraded to the new version.
After an Upgrade, Rollback, Upgrade cycle we will see an old value in  `config_apply_time`. The value is a lower bound of time. So having an old value is safe since we dont care about time that is in the past.

`ValidateAutoFlagsConfig` API has been added to master. This API is always safe to call as long as the master is on a version that supports it.
Rollback with xCluster is not supported on versions before this change.
The xCluster target universe will call this API on the source, and if the source is running an older version it will return `rpc::ErrorStatusPB::ERROR_NO_SUCH_METHOD` which will be handled by the client.

Fixes #19932

Test Plan:
AutoFlagsTest.AreAutoFlagsCompatible
AutoFlagsMiniClusterTest.HeartbeatDelay
AutoFlagsMiniClusterTest.AddTserverBeforeApplyDelay
AutoFlagsMiniClusterTest.RollbackBeforeApply
AutoFlagsMiniClusterTest.ValidateAutoFlagsConfig
AutoFlagsExternalMiniClusterTest.DelayedApplyFlags

Reviewers: slingam, xCluster, rahuldesirazu

Reviewed By: slingam

Subscribers: xCluster, bogdan, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30654
@hari90 hari90 closed this as completed Dec 6, 2023
hari90 added a commit that referenced this issue Dec 7, 2023
Summary:
Original commit: 4c254dd / D30326
This changes introduces support for Upgrade and Downgrade of YugabyteDB universes that contain xCluster links.

Two YugabyteDB universes are allowed to replicate data via xCluster only if the Target universe AutoFlags are compatible with the Source universe. The Target universe must have a super set of kExternal, and kNewInstallOnly class promoted AutoFlags compared to the Source universe in order to be compatible.

yb-master of the Target universe will check AutoFlag compatibility when the config changes on either universe, which happens during upgrades and rollbacks. The Target universe yb-master send a `ValidateAutoFlagsConfig` RPC to the source universe with its AutoFlags config, and `min_flag_class` set to `kExternal`. The source universe will perform the compatibility check and respond with the results. yb-master stores its local AutoFlags config version in `validated_local_auto_flags_config_version` of `SysUniverseReplicationEntryPB`, and the source universe AutoFlags config version that it was validated against in `validated_auto_flags_config_version` of `ProducerEntryPB`. If the AutoFlags are compatible, the target version is stored in `max_compatible_auto_flag_config_version` of `ProducerEntryPB`. When AutoFlag config of the Target universe changes `NotifyAutoFlagsConfigChanged` is called which sets `xcluster_auto_flags_revalidation_needed_`. In the xrepl background task this will trigger `XClusterRefreshLocalAutoFlagConfig`which reruns the AutoFlags check if the `validated_local_auto_flags_config_version` is different from the current target AutoFlags config version. On master leader changes `xcluster_auto_flags_revalidation_needed_` is pessimistically reset to handle missed notifications. If no AutoFlags config changed no RPC calls will be necessary.

`ProducerEntryPB` is part of the `consumer_registry`, which is sent to all yb-tservers via heartbeats. The `XClusterConsumer` passes the `max_compatible_auto_flag_config_version` to the `XClusterPoller`s, which in turn sets it in `auto_flags_config_version` of `GetChangesRequestPB`. When the Source universe processes the `GetChanges` RPC it checks if the passed in AutoFlags config version matches the current value. If it does not match it returns `AUTO_FLAGS_CONFIG_VERSION_MISMATCH` error and sets `auto_flags_config_version` with the current value in `GetChangesResponsePB`. `XClusterPoller`s sets its error state to `REPLICATION_AUTO_FLAG_CONFIG_VERSION_MISMATCH` and report the new AutoFlags config version to yb-master so that it can check the compatibility of the new config. Once yb-master completes its checks and if the universes are compatible the new `max_compatible_auto_flag_config_version` will be passed to `XClusterPoller`s.

`AutoFlagsVersionHandler` class handles the tracking and reporting of the new AutoFlags config version. Each `XClusterConsumer` has one `AutoFlagsVersionHandler`, so each yb-tserver has a singleton object. `AutoFlagsVersionHandler` sends `XClusterReportNewAutoFlagConfigVersion` RPC to master and handles deduplication across multiple `XClusterPoller`s and skips the reporting if `validated_auto_flags_config_version` is already higher than or equal the reported version. This ensures that each yb-tserver only sends one RPC to yb-master per config version per ReplicationGroup.

Other changes:
- Moved common xCluster YCQL test code and created a new `XClusterExternalMiniClusterBase` class
- The two test AutoFlags `TEST_auto_flags_initialized` and `TEST_auto_flags_new_install` have been removed from the compatibility validation
- Fixes cds initialization bug in `TestThreadHolder`

**Upgrade/Rollback safety:**
- This commit contains the following additive Proto changes:
`ProducerEntryPB` - `compatible_auto_flag_config_version` and `validated_auto_flags_config_version`
`ReplicationErrorPb` - `REPLICATION_AUTO_FLAG_CONFIG_VERSION_MISMATCH`
`SysUniverseReplicationEntryPB` - `validated_local_auto_flags_config_version`
`XClusterReportNewAutoFlagConfigVersionRequestPB`, `XClusterReportNewAutoFlagConfigVersionResponsePB` and rpc `XClusterReportNewAutoFlagConfigVersion`
- Upgrade safety:
The new proto fields and rpcs are guarded with `enable_xcluster_auto_flag_validation` of class `kLocalPersisted`.
The Target universe will invoke rpc `ValidateAutoFlagsConfig` against the Source universe. The YbClient wrapper for this handles the error `ERROR_NO_SUCH_METHOD` which is returned when the Source is running a version that does not support this rpc call. This was introduced in D30163/f63fb345e134549a87e75eb03f6c965652f39f22
- Rollback safety:
Universes with xCluster replication are not safe to rollback without this commit. After universes have been upgraded to a version that contains this commit, the future upgrades will support rollback.
Jira: DB-8311

Test Plan:
AutoFlagsVersionHandlerTest.TestInsertAndUpdate
AutoFlagsVersionHandlerTest.DuplicateReporting
AutoFlagsVersionHandlerTest.ConcurrentReporting
AutoFlagsVersionHandlerTest.NonBlockingReporting
XClusterProducerTest.GetChangesBasic
XClusterProducerTest.GetChangesWithAutoFlags
XClusterProducerTest.HeartbeatDelayWithoutData
XClusterProducerTest.HeartbeatDelayWithData
XClusterProducerTest.ProducerUpgrade
XClusterUpgradeTest.SetupWithLowerSourceUniverse
XClusterUpgradeTest.SetupWithLowerTargetUniverse
XClusterUpgradeTest.UpgradeTargetBeforeSource
XClusterUpgradeTest.UpgradeSourceBeforeTarget
XClusterUpgradeTest.RollbackTargetUniverse
XClusterUpgradeTest.RollbackSourceUniverse
XClusterUpgradeTest.DemoteSourceUniverseFlag
XClusterUpgradeTest.DemoteTargetUniverseFlag
XClusterUpgradeTest.SourceWithoutAutoFlagCompatiblity
XClusterUpgradeTest.TargetWithoutAutoFlagCompatiblity

Reviewers: jhe, xCluster, rahuldesirazu, slingam

Reviewed By: jhe

Subscribers: bogdan, ybase, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30763
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants