Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Core dumps found when exercising ddl atomicity + cross db concurrent ddls #21773

Closed
1 task done
archit-rastogi opened this issue Apr 2, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
2024.1 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@archit-rastogi
Copy link

archit-rastogi commented Apr 2, 2024

Jira Link: DB-10649

Description

http://stress.dev.yugabyte.com/stress_test/a26e46ed-0009-40eb-bbfe-e7e845a46a38
http://stress.dev.yugabyte.com/stress_test/0e143f1f-bb16-40c9-ad81-d97fac67bc2f

Build: 2024.1.0.0-b32

Core dump:
http://stress.dev.yugabyte.com/files/get?name=97b4f41a-e2e0-47ce-8ab0-106812ebf9cd-core[…].1.0.0-b32-almalinux8-aarch64!bin!yb-server.backtrace.txt
http://stress.dev.yugabyte.com/files/get?name=c27aa0ef-765d-40b2-ad5e-797fd6a6d5ff-core[…].1.0.0-b32-almalinux8-aarch64!bin!yb-server.backtrace.txt
http://stress.dev.yugabyte.com/files/get?name=17cde259-9f37-4958-8ae1-0d762618dc68-core[…].1.0.0-b32-almalinux8-aarch64!bin!yb-server.backtrace.txt

* thread #1, name = 'yb-master', stop reason = signal SIGABRT
  * frame #0: 0x0000ffff83836274 libc.so.6`raise + 172
    frame #1: 0x0000ffff83820a2c libc.so.6`abort + 276
    frame #2: 0x0000aaaacf5c79a4 yb-master`abort_message + 164
    frame #3: 0x0000aaaacf5c75a4 yb-master`__cxa_pure_virtual + 20
    frame #4: 0x0000aaaace0ce2dc yb-master`std::__1::__function::__func<yb::master::PollTransactionStatusBase::VerifyTransaction()::$_0, std::__1::allocator<yb::master::PollTransactionStatusBase::VerifyTransaction()::$_0>, void (yb::Status const&, yb::tserver::GetTransactionStatusResponsePB const&)>::operator()(yb::Status const&, yb::tserver::GetTransactionStatusResponsePB const&) [inlined] yb::master::PollTransactionStatusBase::TransactionReceived(this=0x000016a7acfe2aa0, txn_status=Status @ 0x0000ffff581069d0, resp=<unavailable>) at ysql_ddl_verification_task.cc:546:5
    frame #5: 0x0000aaaace0ce1cc yb-master`std::__1::__function::__func<yb::master::PollTransactionStatusBase::VerifyTransaction()::$_0, std::__1::allocator<yb::master::PollTransactionStatusBase::VerifyTransaction()::$_0>, void (yb::Status const&, yb::tserver::GetTransactionStatusResponsePB const&)>::operator()(yb::Status const&, yb::tserver::GetTransactionStatusResponsePB const&) [inlined] yb::master::PollTransactionStatusBase::VerifyTransaction()::$_0::operator()(this=<unavailable>, status=<unavailable>, resp=<unavailable>) const at ysql_ddl_verification_task.cc:528:7
....
....
....

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@archit-rastogi archit-rastogi added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Apr 2, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Apr 2, 2024
myang2021 added a commit that referenced this issue Apr 2, 2024
Summary:
The relevant code that can lead to rpc callback invocation on a deleted ddl
background task is in `PollTransactionStatusBase::VerifyTransaction()`

```
  *rpc_handle = client::GetTransactionStatus(
    TransactionRpcDeadline(),
    nullptr /* tablet */,
    client,
    &req,
    [this, rpc_handle]
        (Status status, const tserver::GetTransactionStatusResponsePB& resp) {
      auto retained = rpcs_.Unregister(rpc_handle);
      TransactionReceived(std::move(status), resp);
    });
```

The base class `PollTransactionStatusBase` has two subclasses:

```
class NamespaceVerificationTask : public MultiStepNamespaceTaskBase,
                                  public PollTransactionStatusBase {
```
and
```
class TableSchemaVerificationTask : public MultiStepTableTaskBase,
                                    public PollTransactionStatusBase {
```

Both tasks can be asynchronously deleted by `TasksTracker::CleanupOldTasks`.
Therefore it is possible that the background task is already deleted before the
callback shown above is invoked. This can lead to yb-master crash.

To fix this bug, we need to ensure that the bg task isn't deleted before the
callback is invoked. I added a new virtual function `GetSharedFromThis()` to
retrieve a `shared_ptr` to the bg task object and capture it within the callback
itself as a weak_ptr<MonitoredTask>.
Jira: DB-10649

Test Plan: jenkins

Reviewers: hsunder, tverona

Reviewed By: hsunder

Subscribers: ybase, yql, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D33731
myang2021 added a commit that referenced this issue Apr 5, 2024
…dl bg task

Summary:
The relevant code that can lead to rpc callback invocation on a deleted ddl
background task is in `PollTransactionStatusBase::VerifyTransaction()`

```
  *rpc_handle = client::GetTransactionStatus(
    TransactionRpcDeadline(),
    nullptr /* tablet */,
    client,
    &req,
    [this, rpc_handle]
        (Status status, const tserver::GetTransactionStatusResponsePB& resp) {
      auto retained = rpcs_.Unregister(rpc_handle);
      TransactionReceived(std::move(status), resp);
    });
```

The base class `PollTransactionStatusBase` has two subclasses:

```
class NamespaceVerificationTask : public MultiStepNamespaceTaskBase,
                                  public PollTransactionStatusBase {
```
and
```
class TableSchemaVerificationTask : public MultiStepTableTaskBase,
                                    public PollTransactionStatusBase {
```

Both tasks can be asynchronously deleted by `TasksTracker::CleanupOldTasks`.
Therefore it is possible that the background task is already deleted before the
callback shown above is invoked. This can lead to yb-master crash.

To fix this bug, we need to ensure that the bg task isn't deleted before the
callback is invoked. I added a new virtual function `GetSharedFromThis()` to
retrieve a `shared_ptr` to the bg task object and capture it within the callback
itself as a weak_ptr<MonitoredTask>.
Jira: DB-10649

Original commit: 70261b0 / D33731

Test Plan: jenkins

Reviewers: hsunder, tverona

Reviewed By: hsunder

Subscribers: jason, bogdan, yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33781
hari90 added a commit that referenced this issue Apr 6, 2024
Summary:
rpcs_.Shutdown() does not wait for all calls to complete. It has a max timeout after which it gives up. Usually this is not a problem since we only Shutdown rpcs on process exit, so its fine to end the process quickly.
In the case of PollTransactionStatusBase we need to explicitly wait for the callback completion when the Task ends.
Jira: DB-10649

Test Plan: Jenkins

Reviewers: myang

Reviewed By: myang

Subscribers: ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D33819
myang2021 pushed a commit that referenced this issue Apr 8, 2024
…tatusBase shutdown

Summary:
rpcs_.Shutdown() does not wait for all calls to complete. It has a max timeout after which it gives up. Usually this is not a problem since we only Shutdown rpcs on process exit, so its fine to end the process quickly.
In the case of PollTransactionStatusBase we need to explicitly wait for the callback completion when the Task ends.
Jira: DB-10649

Original commit: 70620c7 / D33819

Test Plan: Jenkins

Reviewers: tverona, kfranz, hsunder

Reviewed By: hsunder

Subscribers: bogdan, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33918
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

5 participants