Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC inserts 500k duplicates into Kafka #19196

Closed
yugabyte-ci opened this issue Sep 19, 2023 · 1 comment
Closed

CDC inserts 500k duplicates into Kafka #19196

yugabyte-ci opened this issue Sep 19, 2023 · 1 comment
Assignees
Labels
area/cdcsdk CDC SDK jira-originated kind/bug This issue is a bug priority/highest Highest priority issue
Projects

Comments

@yugabyte-ci
Copy link
Contributor

yugabyte-ci commented Sep 19, 2023

Jira Link: DB-7993

@yugabyte-ci yugabyte-ci added jira-originated kind/bug This issue is a bug priority/low Low priority status/awaiting-triage Issue awaiting triage labels Sep 19, 2023
@yugabyte-ci yugabyte-ci added the area/cdcsdk CDC SDK label Sep 19, 2023
@yugabyte-ci yugabyte-ci added priority/highest Highest priority issue and removed priority/low Low priority status/awaiting-triage Issue awaiting triage labels Sep 27, 2023
@yugabyte-ci yugabyte-ci assigned vaibhav-yb and unassigned suranjan Sep 27, 2023
@vaibhav-yb vaibhav-yb added this to In progress in CDC Sep 27, 2023
@vaibhav-yb
Copy link
Contributor

vaibhav-yb commented Sep 27, 2023

Update

Context

GetTabletListToPollForCDC

We have an API to get the checkpoint for tablets i.e. GetTabletListToPollForCDC which gets the checkpoint from the cdc_state table in service. We can also request the children of a given parent tablet once it is split and we will receive the children tablets along with their checkpoint values.

Now assume we have a tablet T and it got split into t1 and t2 - so once the client will ask for the children tablets, we will get t1 and t2 along with their checkpoint values. Now while retrieving the checkpoint, we copy the parent tablet's checkpoint to the children tablets but in case the parent itself is not found in the state table, service returns as a checkpoint with OpId as 0.0.

Invalid checkpoint

If a tablet is not polled even once, service returns an invalid checkpoint with OpId as -1.-1.

Reason for duplicates

  1. Stream was created on a namespace
  2. A table test_streaming is being polled - there are other tables (test_2, test 3) in the namespace at the moment as well, they are not being polled anywhere for changes
  3. Table test_2 is dropped
  4. Stream ID's state is marked as DELETING_METADATA
  5. Simultaneously, one of the tablets for table test_streaming gets split
  6. A background thread runs which looks up for the tablet+stream ID combination having state as DELETING_METADATA and deletes the tablet
  7. Client detects the split around the same time and asks for children - now since the parent tablet is deleted, the checkpoint for the children will be sent as 0.0 - and when GetChanges will be called on this checkpoint, an INTERNAL_ERROR will be thrown with a message that the checkpoint in the state table is invalid. This is because the children tablets haven't been polled before.
  8. At the same time, the client restarts and it again asks the checkpoint for all the tablets including children tablets from the service - service sends back the checkpoint as invalid i.e. -1.-1 - this checkpoint is then used to evaluate whether or not to take the snapshot. If it is invalid then the client takes the snapshot.
  9. Since snapshot is taken for the children tablets, this causes duplicates in Kafka.

[EDIT]

The fix was merged with the ticket as #19348

@yugabyte-ci yugabyte-ci assigned dr0pdb and unassigned suranjan Sep 29, 2023
dr0pdb added a commit that referenced this issue Oct 6, 2023
Summary:
Add a test case which ensures that only tablets which belong to a dropped table are deleted from the cdc_state table.

This is an additional test case for the fix introduced in https://phorge.dev.yugabyte.com/D28929.
Jira: DB-7993

Test Plan:
Jenkins: test regex: .*TestStreamMetaDataCleanupDropTableAfterTabletSplit.*

```
./yb_build.sh --cxx-test cdcsdk_tablet_split-test --gtest_filter "*TestStreamMetaDataCleanupDropTableAfterTabletSplit*"
```

Reviewers: skumar, asrinivasan, vkushwaha

Reviewed By: skumar, vkushwaha

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29120
CDC automation moved this from In progress to Done Oct 6, 2023
dr0pdb added a commit that referenced this issue Oct 9, 2023
…ed by Drop table

Summary:
Original commit: 51f3edf / D29120
Add a test case which ensures that only tablets which belong to a dropped table are deleted from the cdc_state table.

This is an additional test case for the fix introduced in https://phorge.dev.yugabyte.com/D28929.
Jira: DB-7993

Test Plan:
Jenkins: test regex: .*TestStreamMetaDataCleanupDropTableAfterTabletSplit.*

```
./yb_build.sh --cxx-test cdcsdk_tablet_split-test --gtest_filter "*TestStreamMetaDataCleanupDropTableAfterTabletSplit*"
```

Reviewers: skumar, asrinivasan, vkushwaha

Reviewed By: vkushwaha

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29128
vaibhav-yb added a commit to yugabyte/debezium-connector-yugabytedb that referenced this issue Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK jira-originated kind/bug This issue is a bug priority/highest Highest priority issue
Projects
CDC
  
Done
Development

No branches or pull requests

4 participants