Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[yugabyte/yugabyte-db#20136] Add tablets to snapshotCompletedTablets set in form of tableId.tabletId #300

Merged
merged 2 commits into from
Dec 6, 2023

Conversation

siddharth2411
Copy link
Collaborator

@siddharth2411 siddharth2411 commented Dec 4, 2023

Problem

Current behaviour:
Assume connector restarted after successful completion of snapshot bootstrap call.

After the GetCheckpoint call, we find that if we had already completed snapshot on a tablet, we add the tablet’s entry to snapshotCompletedTablets set in the form of TabletId.

During the snapshot consumption phase, we check for a tablet's entry in this set by looking for partition.getId(). But partition.getId() returns TableId.TabletID. Since the format is different, the check fails to find the entry and we tend to poll on tablets even though they we had already completed snapshot on them.

Due to this, the following issues pop-up:

  1. This tablet goes through the entire snapshot flow, gets added to tabletsWaitingForCallback set where we wait for kafka to send the snapshot complete marker as an acknowledgement for this tablet. Consequently, connector is stuck in the snapshot phase, and hence, it is never able to switch to the streaming phase
  2. Since you are polling on an already snapshotted tablet, cdc_state table entry for this tablet changes from SNAPSHOT MARKED DONE BY CLIENT to the transient state SNAPSHOT FULLY CONSUMED

Consider the below example:

  1. Create 2 non-colocated tables split into 1 tablets - t1(non-empty) & t2 (empty)
  2. Create a CDC stream in explicit mode
  3. Deploy connector
  4. Since snapshot.include.list was not passed, all tables will be considered for snapshot. Snapshot bootstrap on both tables (This step assumes snapshot was marked as done by client on t2).
  5. Restart connector
  6. After GetCheckpoint, tabletId of t2 will be added to snapshotCompletedTablets set.
  7. During Snapshot consumption, connector polls on tablets of t2 because of the above problem. This also changes the state table entry for the tablet.
  8. Since, tablets are not removed from shouldWaitForCallback set, they get added to tabletsWaitingForCallback set.
  9. Connector gets stuck in snapshot flow, waiting for kafka's acknowledgement.

Solution

After Getcheckpoint call, add entry in the set in the form TableId.TabletId. This will ensure that during consumption phase, we do not poll on those tablets on which snapshot is already completed.

Note: This solution itself isnt sufficient to switch to streaming phase. This, along with fixes for #20134 & #20135 will ensure the connector switches to streaming phase.

Testing

Performed Manual testing:

  • Ran a unit test with 2 tables (1 non-empty & another empty) and manually verified that after snapshot bootstrap, if the connector restarts, we no longer poll on tablets on which we have already completed the snapshot.
  • Also, verified the behaviour by setting up a CDC pipeline with the same set of tables.

Relevant Github Issue

[CDCSDK] Improve logic for adding tablets to snapshotCompletedTablets set

@siddharth2411 siddharth2411 changed the title Add tablets to snapshotCompletedTablets set in form of tableId.tabletId [yugabyte/yugabyted-db#20136] Add tablets to snapshotCompletedTablets set in form of tableId.tabletId Dec 4, 2023
@siddharth2411 siddharth2411 changed the title [yugabyte/yugabyted-db#20136] Add tablets to snapshotCompletedTablets set in form of tableId.tabletId [yugabyte/yugabyte-db#20136] Add tablets to snapshotCompletedTablets set in form of tableId.tabletId Dec 4, 2023
@siddharth2411 siddharth2411 merged commit 0382188 into yugabyte:main Dec 6, 2023
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants