New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Repro / RCA for hung drop database command when CDC streams are active on a database #19879
Comments
Calling the
The |
…ropping a ysql database Summary: The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as our COW locks are not re-entrant. I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change. Test Plan: ``` ybd --cxx-test master_xrepl-test --gtest_filter 'MasterTestXRepl.DropNamespaceWithLiveCDCStream' ``` Reviewers: hsunder, xCluster, stiwary Reviewed By: hsunder, stiwary Subscribers: stiwary, ybase, bogdan, slingam Differential Revision: https://phorge.dev.yugabyte.com/D30082
Reopening for backports |
…ion path when dropping a ysql database Summary: **Backport only Note** Moved the test to `src/yb/integration-tests/cdcsdk_stream-test.cc` since the support to create CDCSDK stream on namespace exists in cdc_service in 2.20 and below branches. It was moved to yb-master in https://phorge.dev.yugabyte.com/D28678 which is not backported in previous stable branches. **Original Description** Original commit: 9e2b658 / D30082 The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as our COW locks are not re-entrant. I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change. Jira: DB-8822 Test Plan: ``` ybd --cxx-test cdcsdk_stream-test --gtest_filter 'CDCSDKStreamTest.DropNamespaceWithLiveCDCStream' ``` Reviewers: hsunder, xCluster, zdrudi Reviewed By: zdrudi Subscribers: slingam, bogdan, ybase, stiwary Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30122
…ion path when dropping a ysql database Summary: **Backport only Note** Moved the test to `src/yb/integration-tests/cdcsdk_stream-test.cc` since the support to create CDCSDK stream on namespace exists in cdc_service in 2.20 and below branches. It was moved to yb-master in https://phorge.dev.yugabyte.com/D28678 which is not backported in previous stable branches. **Original Description** Original commit: 9e2b658 / D30082 The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as our COW locks are not re-entrant. I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change. Jira: DB-8822 Test Plan: ``` ybd --cxx-test cdcsdk_stream-test --gtest_filter 'CDCSDKStreamTest.DropNamespaceWithLiveCDCStream' ``` Reviewers: hsunder, xCluster, zdrudi Reviewed By: zdrudi Subscribers: slingam, bogdan, ybase, stiwary Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30123
…ion path when dropping a ysql database Summary: **2.14 backport specific description** Due to refactors done in later versions, this backport is completely hand-written. **Original description** Original commit: 9e2b658 / D30082 The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as our COW locks are not re-entrant. I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change. Jira: DB-8822 Test Plan: ``` ybd --cxx-test master-test_ent --gtest_filter 'MasterTestEnt.DropNamespaceWithLiveCDCStream' ``` Reviewers: hsunder, zdrudi Reviewed By: zdrudi Subscribers: bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30128
Jira Link: DB-8822
Description
A customer ran
ysqlsh# drop database
on a YSQL database with CDC streams configured. The command timed out, and subsequent attempts to drop the database failed. On the yb-master side the namespace was stuck in theDELETING
state. This state is transient and the async job which tries to delete a database should terminate in a finite period of time, updating the namespace state toDELETED
if the deletion was successful andFAILED
if it was not.See CE-257 for logs and additional information.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: