Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xCluster] Race condition with SPLIT_OP processing and CdcConsumer shutdown #12068

Closed
hulien22 opened this issue Apr 7, 2022 · 0 comments
Closed
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@hulien22
Copy link
Contributor

hulien22 commented Apr 7, 2022

Jira Link: DB-836

Description

Noticed when working on XClusterAutomaticTabletSplitITest.AutomaticTabletSplitting.

There is a race occurring during the shutdown path, if we shutdown immediately after sending a DeleteUniverseReplication request and before waiting for all the CDCConsumers to be properly cleared.

From my investigations, we seem to be getting stuck in rpcs_->Abort({&write_handle_}); during ~TwoDCOutputClient(), but unclear as to what its waiting on..

Currently working around in tests by waiting for pollers to be deleted by DeleteUniverseReplication before continuing with the shutdown.

@hulien22 hulien22 added the area/docdb YugabyteDB core features label Apr 7, 2022
@hulien22 hulien22 self-assigned this Apr 7, 2022
@hulien22 hulien22 added this to To do in xCluster replication via automation Apr 7, 2022
@omkar-yb omkar-yb added this to Backlog in YBase features Apr 7, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 8, 2022
hulien22 added a commit that referenced this issue Oct 11, 2022
Summary:
Making a variety of changes to clean up the shutdown/destructor path for CDCConsumer -> CDCPoller -> TwoDCOutputClient.

- Adding a separate Shutdown() function to CDCPoller and TwoDCOutputClient
  - CDCConsumer calls Poller shutdown on its Shutdown
  - Removing the `should_continue_polling` and `should_continue_polling` callbacks in cdc_poller, and having cdc_consumer handle the removals in `TriggerDeletionOfOldPollers`
- Fix `CDCPoller::CheckOffline()` to actually check if the poller is offline
- Ensure that CDCPoller uses `shared_from_this()` for all callbacks to ensure that the dtor doesn't run early
- Change TwoDCOutputClient to use `shared_from_this()` as well to ensure we handle shutdowns during it's callbacks as well
- Introduce threadpool to TwoDCOutputClient so that we don't run callbacks on the reactor threads anymore
- Change shutdown_ to an atomic bool in TwoDCOutputClient
- Make sure to unregister poll handles before submitting the callback to the threadpool to avoid deadlocks in shutdown

Test Plan: Jenkins

Reviewers: rahuldesirazu, hsunder, nicolas

Reviewed By: hsunder, nicolas

Subscribers: ycdcxcluster, ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D19175
hulien22 added a commit that referenced this issue Oct 28, 2022
…nsumer

Summary:
Original commit: b659155 / D19175
Making a variety of changes to clean up the shutdown/destructor path for CDCConsumer -> CDCPoller -> TwoDCOutputClient.

- Adding a separate Shutdown() function to CDCPoller and TwoDCOutputClient
  - CDCConsumer calls Poller shutdown on its Shutdown
  - Removing the `should_continue_polling` and `should_continue_polling` callbacks in cdc_poller, and having cdc_consumer handle the removals in `TriggerDeletionOfOldPollers`
- Fix `CDCPoller::CheckOffline()` to actually check if the poller is offline
- Ensure that CDCPoller uses `shared_from_this()` for all callbacks to ensure that the dtor doesn't run early
- Change TwoDCOutputClient to use `shared_from_this()` as well to ensure we handle shutdowns during it's callbacks as well
- Introduce threadpool to TwoDCOutputClient so that we don't run callbacks on the reactor threads anymore
- Change shutdown_ to an atomic bool in TwoDCOutputClient
- Make sure to unregister poll handles before submitting the callback to the threadpool to avoid deadlocks in shutdown

Test Plan: Jenkins

Reviewers: rahuldesirazu, nicolas, hsunder

Reviewed By: hsunder

Subscribers: bogdan, ybase, ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D20538
YBase features automation moved this from Backlog to Done Nov 29, 2022
xCluster replication automation moved this from To do to Done Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Development

No branches or pull requests

2 participants