Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Deadlock involving WaitForBackendsCatalogVersion #18711

Closed
1 task done
myang2021 opened this issue Aug 16, 2023 · 9 comments
Closed
1 task done

[YSQL] Deadlock involving WaitForBackendsCatalogVersion #18711

myang2021 opened this issue Aug 16, 2023 · 9 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@myang2021
Copy link
Contributor

myang2021 commented Aug 16, 2023

Jira Link: DB-7616

Description

When I ran the test PgIndexBackfillTest.PgStatProgressCreateIndexPhase via command

./yb_build.sh debug --sjb --cxx-test pgwrapper_pg_index_backfill-test --gtest_filter PgIndexBackfillTest.PgStatProgressCreateIndexPhase --clang16

The test passed in master branch (commit dfc08e8) which does not have per-database catalog version mode. However, when I ran the test with per-database catalog version mode turned on by default: change the default value of --FLAGS_TEST_enable_db_catalog_version_mode to true, then the above test failed.

After debugging, I found there is a potential deadlock scenario. Following are the detailed steps that lead to the deadlock:

(1)

The tablet server (ts-1) runs WaitForBackendsCatalogVersion (called by PG DefineIndex function as part of create index workflow). There is a macro defined for this method

  #define PG_CLIENT_SESSION_METHOD_FORWARD(r, data, method) \
  Status method( \
      const BOOST_PP_CAT(BOOST_PP_CAT(Pg, method), RequestPB)& req, \
      BOOST_PP_CAT(BOOST_PP_CAT(Pg, method), ResponsePB)* resp, \
      rpc::RpcContext* context) { \
    return VERIFY_RESULT(GetSession(req))->method(req, resp, context); \
  }

Note that this macro has GetSession(req), which is defined as

  Result<PgClientSessionLocker> GetSession(uint64_t session_id) {
    return PgClientSessionLocker(VERIFY_RESULT(DoGetSession(session_id)));
  }

In particular, it returns a PgClientSessionLocker object.

  explicit PgClientSessionLocker(const LockablePtr& lockable)
      : lockable_(lockable), lock_(lockable->mutex_) {
  }

The constructor of PgClientSessionLocker automatically locks the lockable, in this case is the session object.

Secondly, the session object will be locked until method(req, resp, context) completes. method in this case is

Status PgClientSession::WaitForBackendsCatalogVersion(
    const PgWaitForBackendsCatalogVersionRequestPB& req,
    PgWaitForBackendsCatalogVersionResponsePB* resp,
    rpc::RpcContext* context) {
  // TODO(jason): send deadline to client.
  const int num_lagging_backends = VERIFY_RESULT(client().WaitForYsqlBackendsCatalogVersion(
      req.database_oid(), req.catalog_version(), context->GetClientDeadline()));
  resp->set_num_lagging_backends(num_lagging_backends);
  return Status::OK();
}

So we know that PgClientSession is now locked and will not be released until its method WaitForBackendsCatalogVersion completes.

(2)
Now let's consider how client().WaitForYsqlBackendsCatalogVersion is implemented and we will see the deadlock.

Result<int> YBClient::WaitForYsqlBackendsCatalogVersion(
    PgOid database_oid, uint64_t version, const CoarseTimePoint& deadline) {
  WaitForYsqlBackendsCatalogVersionRequestPB req;
  WaitForYsqlBackendsCatalogVersionResponsePB resp;

  req.set_database_oid(database_oid);
  req.set_catalog_version(version);

  DCHECK(deadline != CoarseTimePoint()) << ToString(deadline);

  CALL_SYNC_LEADER_MASTER_RPC_WITH_DEADLINE(Admin, req, resp, deadline,
                                            WaitForYsqlBackendsCatalogVersion);

So it is sending WaitForYsqlBackendsCatalogVersion RPC to master. At master side:

  MASTER_SERVICE_IMPL_ON_LEADER_WITHOUT_LOCK(
      YsqlBackendsManager,
      (WaitForYsqlBackendsCatalogVersion)
  )

It basically calls YsqlBackendsManager::WaitForYsqlBackendsCatalogVersion, where a BackendsCatalogVersionJob is created:

  std::shared_ptr<BackendsCatalogVersionJob> job;

(3)
The BackendsCatalogVersionJob job is started via BackendsCatalogVersionJob::Launch, which in turn launches a task for each tablet server:

  for (const auto& ts_uuid : ts_uuids) {
    RETURN_NOT_OK(LaunchTS(ts_uuid, -1 /* num_lagging_backends */, epoch_));
  }

BackendsCatalogVersionJob::LaunchTS starts a BackendsCatalogVersionTS task for every tablet server. This of course, includes ts-1 that we mentioned in step (1).

The BackendsCatalogVersionTS follows RetryingTSRpcTask work flow:

class BackendsCatalogVersionTS : public RetryingTSRpcTask {

It sends out RPC to a tablet server:

bool BackendsCatalogVersionTS::SendRequest(int attempt) {
  tserver::WaitForYsqlBackendsCatalogVersionRequestPB req;
  {
    auto job = job_.lock();
    req.set_database_oid(job->database_oid());
    req.set_catalog_version(job->target_version());
    req.set_prev_num_lagging_backends(prev_num_lagging_backends_);
  }

  ts_admin_proxy_->WaitForYsqlBackendsCatalogVersionAsync(req, &resp_, &rpc_, BindRpcCallback());
  VLOG(0) << "Send " << description() << " to " << permanent_uuid()
          << " (attempt " << attempt << "): " << req.ShortDebugString();
  return true;
}

Note WaitForYsqlBackendsCatalogVersionAsync is called to send a WaitForYsqlBackendsCatalogVersion RPC to ts-1.

(4) Now back in ts-1:

How ts-1 handles WaitForYsqlBackendsCatalogVersion?

TabletServiceAdminImpl::WaitForYsqlBackendsCatalogVersion

It builds a query:

  const std::string num_lagging_backends_query = Format(
      "SELECT count(*) FROM pg_stat_activity WHERE catalog_version < $0 AND datid = $1",
      catalog_version, database_oid);

Then a connection is made to its local PG master to start a new PG backend.

Next, pg_stat_activity is defined as a view which involves function pg_stat_get_activity:

CREATE VIEW pg_stat_activity AS
    SELECT
......
    FROM pg_stat_get_activity(NULL) AS S
        LEFT JOIN pg_database AS D ON (S.datid = D.oid)
        LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid)
        LEFT JOIN (pg_stat_get_backend_idset() beid CROSS JOIN
                   pg_stat_get_backend_pid(beid) pid) B ON B.pid = S.pid;

It was found that function pg_stat_get_activity is not always called, depending on the other JOIN clauses defined above. But if pg_stat_get_activity is called, then we have a problem. Inside pg_stat_get_activity it calls

        HandleYBStatus(YBCPgActiveTransactions(txn_infos, num_backends));
YBCStatus YBCPgActiveTransactions(YBCPgSessionTxnInfo *infos, size_t num_infos) {
  return ToYBCStatus(pgapi->GetActiveTransactions(infos, num_infos));
}
Status PgApiImpl::GetActiveTransactions(YBCPgSessionTxnInfo* infos, size_t num_infos) {
  std::unordered_map<uint64_t, Slice> txns;
  txns.reserve(num_infos);
  return pg_client_.EnumerateActiveTransactions(make_lw_function(
  Status EnumerateActiveTransactions(
      const ActiveTransactionCallback& callback, bool for_current_session_only) {
......
    RETURN_NOT_OK(proxy_->GetActiveTransactionList(req, &resp, PrepareController()));

We can see eventually it makes a RPC GetActiveTransactionList back to ts-1 itself.

(5)
In GetActiveTransactionList

    for (const auto& session : sessions_snapshot) {
      AddTransactionInfo(resp, PgClientSessionLocker(session));
    }

The call PgClientSessionLocker(session) makes a PgClientSessionLocker object, the constructor of PgClientSessionLocker automatically locks the lockable, in this case is the same session object that has already been locked above in step (1). The deadlock!

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@myang2021 myang2021 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Aug 16, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Aug 16, 2023
@myang2021
Copy link
Contributor Author

The function GetActiveTransactionsList is newly introduced by the following commit

d2a700a3a35edc3a0e74a5ee72d36e9cac7319c1
Author: Karthik Ramanathan <kramanathan@yugabyte.com>
Date:   Wed Aug 2 17:30:08 2023 -0700

    [#17087, #18054] YSQL: Exposes YB Transaction IDs in Postgres.

It is mentioned in the commit summary:

    This change introduces a new RPC `GetActiveTransactionsList` for this purpose.

@myang2021
Copy link
Contributor Author

    decltype(sessions_) sessions_snapshot;
    {
      std::lock_guard lock(mutex_);
      sessions_snapshot = sessions_;
    }

The purpose looks like we try to make a snapshot of the current sessions, but end up making a copy of references to those sessions in the sessions_. If we can truly make a copy of clones to those sessions, then we will not have deadlock.

@myang2021
Copy link
Contributor Author

  boost::multi_index_container<
      LockablePgClientSessionPtr,
      boost::multi_index::indexed_by<
          boost::multi_index::hashed_unique<
              boost::multi_index::const_mem_fun<PgClientSession, uint64_t, &PgClientSession::id>
          >
      >
  > sessions_ GUARDED_BY(mutex_);

Note LockablePgClientSessionPtr is a pointer:

using LockablePgClientSessionPtr = std::shared_ptr<LockablePgClientSession>;

@myang2021
Copy link
Contributor Author

myang2021 commented Aug 17, 2023

But how can we safely make a clone of LockablePgClientSession when its mutex_ is already locked? By lock semantics if an object is locked, it is not readable because it may be undergoing a critical change that needs to be atomic and that's why a lock is needed. So the sessions_snapshot did not work as expected and I don't see how to make it work as expected.

I think we can fix the bug by

    for (const auto& session : sessions_snapshot) {
      AddTransactionInfo(resp, session);
    }

This assumes that AddTransactionInfo will only access the immutable (read-only) part of session. It does look so:

void AddTransactionInfo(
    PgGetActiveTransactionListResponsePB* out, const PgClientSessionLocker& locker) {
  auto& session = *locker;
  const auto* txn_id = session.GetTransactionId();
  if (!txn_id) {
    return;
  }

  auto& entry = *out->add_entries();
  entry.set_session_id(session.id());
  txn_id->AsSlice().CopyToBuffer(entry.mutable_txn_id());
}

We can see that session.GetTransactionId(); reads a pointer back, I think it is safe without locking mutex_ of session.

@myang2021
Copy link
Contributor Author

Thinking more, it's tricky and hard to ensure safety without locking mutex_. We can consider adding an active_txns_ and each time we add a new txn to a session, we add also insert it there. Any time we disassociate a txn with a session, we remove it from there. active_txns_ will have its own mutex, in this way we avoid the deadlock.

@jasonyb
Copy link
Contributor

jasonyb commented Aug 30, 2023

Forget the deadlock: GetActiveTransactionList taking locks makes pg_stat_activity unusably slow. This kind of information could be pre-populated close to pgstat code instead of pgstat having to go all the way to the pggate layer to get the information on the fly. Worst is when a CREATE INDEX is running and someone selects from pg_stat_activity: that's going to hang for a while.

@sushantrmishra sushantrmishra removed the status/awaiting-triage Issue awaiting triage label Sep 5, 2023
@jasonyb
Copy link
Contributor

jasonyb commented Sep 5, 2023

For instance, master flag --TEST_simulate_slow_table_create_secs=20 plus CREATE TABLE t (i int); simultaneously with select * from pg_stat_activity hangs for up to 20s waiting on that. A more realistic example not using that flag is multiple sessions running lock-taking DDLs, where each will cause around 0.5-1s wait due to RPC latency, and multiply that by, say 10 sessions doing DDLs out of a total of 300 sessions, means pg_stat_activity takes 5-10s--too long.

karthik-ramanathan-3006 added a commit that referenced this issue Sep 28, 2023
… session

Summary:
**Background**
pg_stat_activity issues an RPC (GetActiveTransactionList) to the local t-server to fetch a mapping between currently active sessions and the transactions that these sessions are executing.
At the local t-server, this RPC results in the execution of the following pseudo-code:
```
acquire exclusive pg_client_service_mutex
make a shallow copy of active_sessions
release exclusive pg_client_service_mutex

for session_x in copied sessions:
    acquire exclusive per_session_lock for session_x
    read transaction info from session_x
    release exclusive per_session_lock for session_x
```

In a nutshell, the RPC seeks to acquire an exclusive lock on each of the sessions active at the t-server.
This causes contention on a given session's lock whenever the session is performing active work concurrent to a a pg_stat_activity request.
This manifests itself particularly during:
  # Long running DML operations (such as CreateTable/Index)
  # The WaitForBackendsCatalogVersion flow, where an exclusive lock is acquired on each of the active sessions at the t-server.

In theory, any operation which acquires a per-session lock at the t-server for a non-trivial duration will experience contention from a concurrently running pg_stat_activity request.

**Fix**
This revision introduces a transaction cache at the t-server, which stores the mapping between active sessions and their transactions.
The cache is written to, at the beginning and end of each transaction by the session initiating the transaction under an exclusive lock for the cache.
The cache is read from, by the session servicing the pg_stat_activity request, under a shared lock for the cache. This alleviates the need for pg_stat_activity to take an exclusive lock on the session.
Jira: DB-7616

Test Plan:
```
./yb_build.sh --cxx-test pgwrapper_pg_stat_activity-test --gtest_filter PgStatActivityDelayTest.SlowDDLOperation
```

Reviewers: jason, myang, pjain, dmitry

Reviewed By: myang

Subscribers: smishra, ybase, yql, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28666
karthik-ramanathan-3006 added a commit that referenced this issue Sep 29, 2023
… lock on t-server session

Summary:
Original commit: f38b75f / D28666
**Background**
pg_stat_activity issues an RPC (GetActiveTransactionList) to the local t-server to fetch a mapping between currently active sessions and the transactions that these sessions are executing.
At the local t-server, this RPC results in the execution of the following pseudo-code:
```
acquire exclusive pg_client_service_mutex
make a shallow copy of active_sessions
release exclusive pg_client_service_mutex

for session_x in copied sessions:
    acquire exclusive per_session_lock for session_x
    read transaction info from session_x
    release exclusive per_session_lock for session_x
```

In a nutshell, the RPC seeks to acquire an exclusive lock on each of the sessions active at the t-server.
This causes contention on a given session's lock whenever the session is performing active work concurrent to a a pg_stat_activity request.
This manifests itself particularly during:
  # Long running DML operations (such as CreateTable/Index)
  # The WaitForBackendsCatalogVersion flow, where an exclusive lock is acquired on each of the active sessions at the t-server.

In theory, any operation which acquires a per-session lock at the t-server for a non-trivial duration will experience contention from a concurrently running pg_stat_activity request.

**Fix**
This revision introduces a transaction cache at the t-server, which stores the mapping between active sessions and their transactions.
The cache is written to, at the beginning and end of each transaction by the session initiating the transaction under an exclusive lock for the cache.
The cache is read from, by the session servicing the pg_stat_activity request, under a shared lock for the cache. This alleviates the need for pg_stat_activity to take an exclusive lock on the session.
Jira: DB-7616

Test Plan:
```
./yb_build.sh --cxx-test pgwrapper_pg_stat_activity-test --gtest_filter PgStatActivityDelayTest.SlowDDLOperation
```

Reviewers: jason, myang, pjain, dmitry

Reviewed By: myang

Subscribers: smishra, ybase, yql, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28923
karthik-ramanathan-3006 added a commit that referenced this issue Sep 29, 2023
…ock on t-server session

Summary:
Original commit: f38b75f / D28666
**Background**
pg_stat_activity issues an RPC (GetActiveTransactionList) to the local t-server to fetch a mapping between currently active sessions and the transactions that these sessions are executing.
At the local t-server, this RPC results in the execution of the following pseudo-code:
```
acquire exclusive pg_client_service_mutex
make a shallow copy of active_sessions
release exclusive pg_client_service_mutex

for session_x in copied sessions:
    acquire exclusive per_session_lock for session_x
    read transaction info from session_x
    release exclusive per_session_lock for session_x
```

In a nutshell, the RPC seeks to acquire an exclusive lock on each of the sessions active at the t-server.
This causes contention on a given session's lock whenever the session is performing active work concurrent to a a pg_stat_activity request.
This manifests itself particularly during:
  # Long running DML operations (such as CreateTable/Index)
  # The WaitForBackendsCatalogVersion flow, where an exclusive lock is acquired on each of the active sessions at the t-server.

In theory, any operation which acquires a per-session lock at the t-server for a non-trivial duration will experience contention from a concurrently running pg_stat_activity request.

**Fix**
This revision introduces a transaction cache at the t-server, which stores the mapping between active sessions and their transactions.
The cache is written to, at the beginning and end of each transaction by the session initiating the transaction under an exclusive lock for the cache.
The cache is read from, by the session servicing the pg_stat_activity request, under a shared lock for the cache. This alleviates the need for pg_stat_activity to take an exclusive lock on the session.
Jira: DB-7616

Test Plan:
```
./yb_build.sh --cxx-test pgwrapper_pg_stat_activity-test --gtest_filter PgStatActivityDelayTest.SlowDDLOperation
```

Reviewers: jason, myang, pjain, dmitry

Reviewed By: myang

Subscribers: bogdan, yql, ybase, smishra

Differential Revision: https://phorge.dev.yugabyte.com/D28922
@rthallamko3
Copy link
Contributor

No others backports pending right?

@karthik-ramanathan-3006
Copy link
Contributor

No more backports left, all done! 👍

d-uspenskiy added a commit that referenced this issue Oct 12, 2023
Summary:
This diff reverts changes made in context of the https://phorge.dev.yugabyte.com/D28666 diff.
Instead of storing info related to session's transactions into separate map named `TransactionCache` such info is stored near the session itself in the `SessionInfo` structure.
```
struct SessionInfo {
  TxnAssignment txn_assignment;
  LockablePgClientSession session;
};
```

This structure is stored in the `PgClientServiceImpl::sessions_` field (instead of `LockablePgClientSessionPtr`).
Jira: DB-7616

Test Plan:
Jenkins

run existing tests
```
./yb_build.sh --cxx-test pgwrapper_pg_stat_activity-test --gtest_filter PgStatActivityDelayTest.SlowDDLOperation
./yb_build.sh --cxx-test pgwrapper_pg_stat_activity-test --gtest_filter PgStatActivityTest.AllBackendsTransaction
./yb_build.sh --cxx-test pgwrapper_pg_stat_activity-test --gtest_filter PgStatActivityTest.CurrentTransaction
./yb_build.sh --cxx-test pgwrapper_pg_stat_activity-test --gtest_filter PgStatActivityTest.DDLInsideDMLTransaction
```

Reviewers: sergei, myang, kramanathan

Reviewed By: sergei, myang

Subscribers: bogdan, yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29109
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
Status: Done
Development

No branches or pull requests

6 participants