Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[yugabyte/yugabyte-db#20636] Fetch for tablets from service in snapshot phase as well #320

Merged
merged 5 commits into from
Jan 31, 2024

Conversation

vaibhav-yb
Copy link
Collaborator

Problem

The connector flow today is such that:

  1. Connector obtains the tablets from service
  2. HashPartitions are created and the partition keys are passed down to the tasks
  3. Tasks receive an immutable list of tablets and start processing that
  4. In snapshot phase, connector directly rely on the passed list from the top level connector and take snapshot but in streaming phase it validates the tablet list again by asking the current tablets from service.
  5. Now suppose if a tablet is split and connector is in streaming phase, connector will handle the split gracefully.
  6. At this stage, if a task is restarted the flow will start from the snapshot phase itself and connector will try to get the checkpoint of tablets in the task to confirm whether to take snapshot. But the tablet list available here is stale and it will error out while getting the checkpoint since the tablet would have been deleted by this time.

Solution

This diff implements a solution to the problem by making a call to GetTabletListToPollForCDC in the snapshot phase as well which then ensures that we are only receiving a valid set of tablets to poll.

Testing

Tested manually using the following steps:

  1. Start connector with snapshot mode initial with a single tablet
  2. Split a tablet once it reaches streaming phase
  3. Insert more data
  4. Restart the task
  5. Upon restart:
    i. Without fix: The task will fail while trying to get the checkpoint of deleted parent tablet
    ii. With fix: Connector proceeds to the streaming phase without any error

This fixes yugabyte/yugabyte-db#20636

@vaibhav-yb vaibhav-yb added the bug Something isn't working label Jan 17, 2024
@vaibhav-yb vaibhav-yb self-assigned this Jan 17, 2024
@vaibhav-yb vaibhav-yb changed the title [yugabyte/yugabyte-db#20636] Fetch for tablets in snapshot phase as well [yugabyte/yugabyte-db#20636] Fetch for tablets from service in snapshot phase as well Jan 17, 2024
@vaibhav-yb vaibhav-yb merged commit c75c81b into yugabyte:main Jan 31, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CDCSDK] [Tablet Split] Too many errors while trying to get checkpoint for tablet
2 participants