Skip to content

2.27.0.0-b429

@hulien22 hulien22 tagged this 09 Aug 03:58
Summary:
Problem 1:
When the target gets a ChangeMetadataOp, it calls `UpdateConsumerOnProducerMetadata` to store the current and previous schema versions for mapping packing schemas.
If there are multiple CMOPs in quick succession on the source, these could get replicated quickly to the target before we can update the cdc_state checkpoint on the source. If both sides restart, then the new poller will start at the previous cdc_state checkpoint which could be before all these CMOPs. In that case we could run into issues processing the older CMOPs or processing records from older schema versions since we won't be able to find the corresponding schema versions.

Solution 1:
Adding `force_update_cdc_state_checkpoint` to GetChangesRequestPB. Since each GetChanges batch will only have one CMOP in it (the source cuts off batches after sending a CMOP), the target can keep track if the previous batch had a CMOP in it - if yes, then it will set `force_update_cdc_state_checkpoint = true` so that the source updates its cdc_state checkpoint as part of processing that next request.
Doing this on the target since its the one that knows when a CMOP has been fully processed, and can force the cdc_state update as part of the ack it sends on the next GetChanges.

Problem 2:
With automatic mode xcluster, we don't pause on CMOPs, so its possible for one tablet to get further ahead of other tablets in the same table. This can cause issues since this tablet could override the current+previous schema versions, all the while another tablet is still on a older schema version.

Solution 2:
Changing `old_producer_schema_version` and `old_consumer_schema_version` to repeated fields that keep track of the last `xcluster_max_old_schema_versions` previous schemas (default last 50). If we go over that limit, then the oldest schema pair is kicked out. (Reasoning here is that if multiple tablets are more than 50 schema versions apart, then something else is wrong here and probably requires a replication restart)

**Upgrade/Rollback safety:**
- `force_update_cdc_state_checkpoint` is a new field. If only one side is upgraded, then `force_update_cdc_state_checkpoint` won't do anything, so we'll still have today's behaviour.
- upgrading an optional field to a repeated field is safe for `old_producer_schema_version` and `old_consumer_schema_version` as it is protected by an autoflag `xcluster_store_older_schema_versions`. In case of rollback, we use the existing behaviour of only storing one old schema version
Jira: DB-14131, DB-16757

Test Plan:
```
ybd --cxx-test integration-tests_xcluster_ddl_replication-test --gtest_filter XClusterDDLReplicationTest.PackedSchemaLag
ybd --cxx-test integration-tests_xcluster_ysql-test --gtest_filter XClusterYsqlTest.ReplicationWithDropColumnDDLAndNodeRestart
```
First test fails without either force_update_cdc_state_checkpoint or xcluster_max_old_schema_versions. Second test fails without force_update_cdc_state_checkpoint.

Reviewers: xCluster, hsunder, mlillibridge

Reviewed By: mlillibridge

Subscribers: mlillibridge, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D45888
Assets 2
Loading