Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cdc-base] Support ability to skip backfill in CDC #2553

Closed
2 tasks done
loserwang1024 opened this issue Oct 16, 2023 · 1 comment
Closed
2 tasks done

[cdc-base] Support ability to skip backfill in CDC #2553

loserwang1024 opened this issue Oct 16, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@loserwang1024
Copy link
Contributor

loserwang1024 commented Oct 16, 2023

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

supports At-Least-once semantics

At present, the cdc connector only supports exactly-once semantics. To achieve this, each cdc connector has to read the backfill log for each snapshot split.

However, read backfill log will also increase burden in source database. For example, the Postgres cdc connector will establish many logical replication connections to the Postgres database, which can easily reach the max_sender_num or max_slot_number limit. Assuming there are 10 Postgres cdc sources and each runs 4 parallel processes, a total of 10*(4+1) = 50 replication connections will be created.

In many situations, the sink databases provides idempotence. Therefore, We can also support at-least-once semantics by skipping the backfill period, which will reduce budget on the source databases. Users can choose between at-least-once or exactly-once based on their demands.

Add Snapshot Hooks for better test

Currently, there is no suitable option to mock the real snapshot split period. This means that data changes can occur between the snapshot split periods. Most tests do not perform any database action during the snapshot split read duration, so the backfill process also does nothing, making it meaningless.

Some tests, such as SqlServerScanFetchTaskTest or SnapshotSplitReaderTest, utilize a MakeChangeEventTaskContext or MakeBinlogEventTaskContext to execute SQL commands before reaching the high watermark. However, this approach not only results in redundant code but also lacks flexibility.

For example:

  • What if I want to insert a message after the low watermark? This becomes necessary when implementing the ability to skip backfill, as the logs between the low watermark and snapshot completion would be duplicated.
  • What if I want to do some operations only one specified split?

Only by adding hooks in the framework layer can we provide more flexible testing options.

Solution

Support at-least-once semantic.
Add a SnapshotPhaseHooks with 4 hooks:

  • preLowWatermarkAction

  • postLowWatermarkAction

  • preHighWatermarkAction

  • postHighWatermarkAction

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@loserwang1024
Copy link
Contributor Author

Already done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant