Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Repro / RCA for hung drop database command when CDC streams are active on a database #19879

Closed
1 task done
druzac opened this issue Nov 7, 2023 · 2 comments
Closed
1 task done

Comments

@druzac
Copy link
Contributor

druzac commented Nov 7, 2023

Jira Link: DB-8822

Description

A customer ran ysqlsh# drop database on a YSQL database with CDC streams configured. The command timed out, and subsequent attempts to drop the database failed. On the yb-master side the namespace was stuck in the DELETING state. This state is transient and the async job which tries to delete a database should terminate in a finite period of time, updating the namespace state to DELETED if the deletion was successful and FAILED if it was not.

See CE-257 for logs and additional information.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@druzac druzac added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Nov 7, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Nov 7, 2023
@druzac
Copy link
Contributor Author

druzac commented Nov 8, 2023

Calling the DeleteNamespace RPC on a YSQL db w/ a namespace-level CDC stream configured on the db reliably deadlocks the background thread running CatalogManager::DeleteYsqlDatabaseAsync(). The only way to avoid this deadlock is if the number of tables in the db is 1 or less. The thread deadlocks with itself. Stack trace:

    @        0x11f307243  yb::RWCLock::WriteLock()
    @        0x106f4393b  yb::CowObject<>::StartMutation()
    @        0x106f4390f  yb::CowWriteLock<>::CowWriteLock()
    @        0x106f438d3  yb::CowWriteLock<>::CowWriteLock()
    @        0x106f43803  yb::master::MetadataCowWrapper<>::LockForWrite()
    @        0x10736e3db  yb::master::CatalogManager::MarkCDCStreamsForMetadataCleanup()
    @        0x10736ebc3  yb::master::CatalogManager::DeleteCDCStreamsMetadataForTables()
    @        0x106d3398f  yb::master::CatalogManager::DeleteYsqlDBTables()
    @        0x106d2d347  yb::master::CatalogManager::DeleteYsqlDatabaseAsync()

The DeleteCDCStreamsMetadataForTables and MarkCDCStreamsForMetadataCleanup functions are broken. DeleteCDCStreamsMetadataForTables computes a vector of streams for MarkCDCStreamsForMetadataCleanup to clean up, but this vector contains duplicates. Then MarkCDCStreamsForMetadataCleanup tries to acquire the write locks for all streams in a loop, but since the loop contains duplicates and our COW locks are not re-entrant, the function deadlocks itself.

@druzac druzac self-assigned this Nov 8, 2023
druzac added a commit that referenced this issue Nov 9, 2023
…ropping a ysql database

Summary:
The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as our COW locks are not re-entrant.

I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change.

Test Plan:
```
ybd --cxx-test master_xrepl-test --gtest_filter 'MasterTestXRepl.DropNamespaceWithLiveCDCStream'
```

Reviewers: hsunder, xCluster, stiwary

Reviewed By: hsunder, stiwary

Subscribers: stiwary, ybase, bogdan, slingam

Differential Revision: https://phorge.dev.yugabyte.com/D30082
@dr0pdb dr0pdb self-assigned this Nov 10, 2023
@yugabyte-ci yugabyte-ci added area/cdcsdk CDC SDK and removed area/docdb YugabyteDB core features labels Nov 14, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Nov 14, 2023
@dr0pdb
Copy link
Contributor

dr0pdb commented Nov 15, 2023

Reopening for backports

@dr0pdb dr0pdb reopened this Nov 15, 2023
dr0pdb added a commit that referenced this issue Nov 15, 2023
…ion path when dropping a ysql database

Summary:
**Backport only Note**

Moved the test to `src/yb/integration-tests/cdcsdk_stream-test.cc` since the support to create CDCSDK stream on namespace exists in cdc_service in 2.20 and below branches. It was moved to yb-master in https://phorge.dev.yugabyte.com/D28678 which is not backported in previous stable branches.

**Original Description**
Original commit: 9e2b658 / D30082
The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as our COW locks are not re-entrant.

I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change.
Jira: DB-8822

Test Plan:
```
ybd --cxx-test cdcsdk_stream-test --gtest_filter 'CDCSDKStreamTest.DropNamespaceWithLiveCDCStream'
```

Reviewers: hsunder, xCluster, zdrudi

Reviewed By: zdrudi

Subscribers: slingam, bogdan, ybase, stiwary

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30122
dr0pdb added a commit that referenced this issue Nov 15, 2023
…ion path when dropping a ysql database

Summary:
**Backport only Note**

Moved the test to `src/yb/integration-tests/cdcsdk_stream-test.cc` since the support to create CDCSDK stream on namespace exists in cdc_service in 2.20 and below branches. It was moved to yb-master in https://phorge.dev.yugabyte.com/D28678 which is not backported in previous stable branches.

**Original Description**
Original commit: 9e2b658 / D30082
The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as our COW locks are not re-entrant.

I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change.
Jira: DB-8822

Test Plan:
```
ybd --cxx-test cdcsdk_stream-test --gtest_filter 'CDCSDKStreamTest.DropNamespaceWithLiveCDCStream'
```

Reviewers: hsunder, xCluster, zdrudi

Reviewed By: zdrudi

Subscribers: slingam, bogdan, ybase, stiwary

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30123
dr0pdb added a commit that referenced this issue Nov 16, 2023
…ion path when dropping a ysql database

Summary:
**2.14 backport specific description**

Due to refactors done in later versions, this backport is completely hand-written.

**Original description**
Original commit: 9e2b658 / D30082
The current logic adds the same stream once per table for a namespace-level CDC stream, which leads to a deadlock in the marking code when it acquires write locks for
all streams at once. If there is one than one table in the db being dropped, the streams passed to `MarkCDCStreamsForMetadataCleanup` contain duplicate entries of the
same stream info object. When `MarkCDCStreamsForMetadataCleanup` loops acquiring the write locks for the streams, it deadlocks with a previous iteration of the loop as
our COW locks are not re-entrant.

I refactored `FindCDCStreamsForTableToDeleteMetadata` to take a set instead of an individual table id accomplish this. I wonder if instead we could search for streams
by namespace, but I wasn't confident enough in my understanding of the two kinds of CDC streams to make that change.
Jira: DB-8822

Test Plan:
```
ybd --cxx-test master-test_ent --gtest_filter 'MasterTestEnt.DropNamespaceWithLiveCDCStream'
```

Reviewers: hsunder, zdrudi

Reviewed By: zdrudi

Subscribers: bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30128
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants