[DocDB] Eliminate the need for explicit aggressive poll interval - wait_queue_poll_interval_ms for contentious workloads #16440

robertsami · 2023-03-15T19:01:08Z

Jira Link: DB-5848

Description

We currently depend on a polling-based approach to resolve waiting transactions in the wait queue in order to achieve fairness under highly-contentious workloads, e.g. a workload where 10s of sessions are concurrently locking the same row.

Without aggressive polling (e.g. setting wait_queue_poll_interval_ms=5), such highly contentious workloads will suffer from high p99 latencies

Once we have #13578, we should ensure that highly contentious workloads can function with predictable p99 performance even with wait_queue_poll_interval_ms=100 or larger. Otherwise, we are trading off significant CPU overhead for fairness

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

…t to wait queue Summary: In case a transaction is committed, the transaction coordinator will send an UpdateTransaction request to each participating tablet. When the transaction participant processes this RPC, we can signal to the wait queue that such transaction was committed in case the wait queue is managing any waiting transactions blocked on this one. This should be more performant than relying on periodic call of WaitQueue::Poll to detect that a blocker is committed and unblock its waiters. In case a transaction is aborted, the query layer client will send an UpdateTransaction request with status IMMEDIATE_CLEANUP to every involved transaction participant. In such a case we can similarly signal to the wait queue that this transaction was aborted. In order to ensure a re-run of conflict resolution sees the latest signaled changes, we also modify the contract between conflict resolution and wait queue code to allow the wait queue to advance the resolution_ht used by conflict resolution beyond the ht of the commit/abort which triggered the waiter to be re-run. Given these changes, we can achieve high fairness in most normal workloads. For sufficiently contentious workloads, we need to set `wait_queue_poll_interval_ms` to a fairly small setting to maintain fairness. Immediate follow-up work will reduce this dependency on `wait_queue_poll_interval_ms`: see #16440 This commit includes another change to refactor the wait queue to be owned by the transaction participant to resolve some lifetime issues with this signaling approach. Test Plan: Jenkins: hot Reviewers: bkolagani, sergei Reviewed By: sergei Subscribers: pjain, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D23614

robertsami · 2023-04-04T20:03:58Z

for example, with the following transaction:

begin;
select * from foo where k = 1 for update;
commit;

if we have a single node local cluster on a 6 core machine, and we run this in 8 parallel threads, we will see something like 1/1000 requests experience 100x avg latency

robertsami · 2023-12-07T16:25:16Z

A tentative root cause has been identified, with a POC fix seemingly solving the issue. We have the following two code paths racing with each other:

Path 1 --

txn coordinator receives commit/abort for T1
txn coordinator sends UpdateTxn to participant for T1
participant calls SignalCommitted(T1)/SignalAborted(T1) on wait queue
wait queue signals all waiters currently known to be waiting on T1

Path 2 --

tserver receives WriteRequest for operation that is part of T2
write_query finds conflicts during conflict_resolution, including T1
conflict_resolution calls WaitQueue::WaitOn(waiter: T2, blockers: [T1])

step 3 of path 2 and step 4 of path 1 need to be synchronized, else T2 may miss the signal from the participant and be stuck in the wait queue until Poll() is called

robertsami · 2024-03-11T16:20:00Z

Depends on #21404

robertsami added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Mar 15, 2023

robertsami self-assigned this Mar 15, 2023

robertsami added this to Needs Triage in Wait-Queue Based Locking via automation Mar 15, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Mar 15, 2023

robertsami moved this from Needs Triage to Backlog in Wait-Queue Based Locking Mar 15, 2023

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 17, 2023

robertsami moved this from Backlog to On-By-Default Blocking in Wait-Queue Based Locking Apr 4, 2023

robertsami moved this from EA Blocking to Tech-Preview Blocking in Wait-Queue Based Locking Jun 26, 2023

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Jul 5, 2023

robertsami moved this from Tech-Preview (Due 7/21/23) to In progress in Wait-Queue Based Locking Jul 10, 2023

robertsami moved this from In progress to EA Blocking in Wait-Queue Based Locking Aug 14, 2023

rthallamko3 changed the title ~~[DocDB] Reduce dependence of wait queue on frequent Poll~~ [DocDB] Eliminate the need for explicit aggressive poll interval - wait_queue_poll_interval_ms for contentious workloads Dec 6, 2023

rthallamko3 assigned basavaraj29 and unassigned robertsami May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Eliminate the need for explicit aggressive poll interval - wait_queue_poll_interval_ms for contentious workloads #16440

[DocDB] Eliminate the need for explicit aggressive poll interval - wait_queue_poll_interval_ms for contentious workloads #16440

robertsami commented Mar 15, 2023 •

edited by jira bot

robertsami commented Apr 4, 2023

robertsami commented Dec 7, 2023

robertsami commented Mar 11, 2024

[DocDB] Eliminate the need for explicit aggressive poll interval - wait_queue_poll_interval_ms for contentious workloads #16440

[DocDB] Eliminate the need for explicit aggressive poll interval - wait_queue_poll_interval_ms for contentious workloads #16440

Comments

robertsami commented Mar 15, 2023 • edited by jira bot

Description

Warning: Please confirm that this issue does not contain any sensitive information

robertsami commented Apr 4, 2023

robertsami commented Dec 7, 2023

robertsami commented Mar 11, 2024

robertsami commented Mar 15, 2023 •

edited by jira bot