[DocDB] Ignore statuses from old status tablet on successful transaction promotion #19535

basavaraj29 · 2023-10-16T04:19:34Z

Jira Link: DB-8325

Description

For transaction promotion, once we pick a new status tablet, we send UpdateStatusTabletlocation rpc(s) to all involved tablets up until then (we also hold off sending the request until the tablet has processed an actual write), and start a new heartbeater thread pinging the new status tablet.

When the participant switches the status tablet for a RunningTransaction, we should ignore responses for the old status requests. Else, the participant could wrongly assume that the transactions is aborted (since the promotion codepath sends an abort to the old status tablet at some point), and cleans up the intents. And the participant seems to not complain during commit time for some other reasons necessary for correctness, which leads to data inconsistency in this case. Hence, we should ignore processing status responses from the old status tablet.

Another issue is that, we have transaction heartbeat thread at the query layer sending/processing heartbeat requests/responses. And for promoted transaction, there could be 2 threads processing heartbeats against different status tablets for a period. Once an abort is sent to the old status tablet, we stop sending heartbeat requests to it as well. But we should also not process responses from a request that was sent before sending the abort, but for which the resp arrived later. Else, we could wrongly assume that the transaction is aborted (while it was successfully promoted), and send clean request to participants and return an error code. Hence, we should stop processing heartbeat responses once we send an abort to the old status tablet.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

…involved in write, Ignore statuses from old status tablet on promotion Summary: The diff fixes a couple of issues with transaction promotion. 1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet. In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code. This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet. 2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency. This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction. 3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction. This diff address the issue by dropping the status response if an abort to the old status tablet was already sent. Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore). Test Plan: Jenkins ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20 ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet The first test fails consistently without the changes. And it reproduces all the 3 issues in the description. Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend. With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue. put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place. ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet Reviewers: sergei, rsami, rthallam, esheng Reviewed By: sergei, rsami Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D29252

…blets that are involved in write, Ignore statuses from old status tablet on promotion Summary: Original commit: 29bcfd1 / D29252 The diff fixes a couple of issues with transaction promotion. 1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet. In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code. This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet. 2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency. This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction. 3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction. This diff address the issue by dropping the status response if an abort to the old status tablet was already sent. Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore). Test Plan: Jenkins ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20 ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet The first test fails consistently without the changes. And it reproduces all the 3 issues in the description. Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend. With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue. put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place. ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet Reviewers: sergei, rsami, rthallam, esheng Reviewed By: rthallam Subscribers: ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29544

…blets that are involved in write, Ignore statuses from old status tablet on promotion Summary: Original commit: 29bcfd1 / D29252 The diff fixes a couple of issues with transaction promotion. 1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet. In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code. This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet. 2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency. This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction. 3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction. This diff address the issue by dropping the status response if an abort to the old status tablet was already sent. Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore). Test Plan: Jenkins ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20 ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet The first test fails consistently without the changes. And it reproduces all the 3 issues in the description. Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend. With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue. put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place. ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet Reviewers: sergei, rsami, rthallam, esheng Reviewed By: esheng Subscribers: ybase, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29626

basavaraj29 · 2023-10-30T17:59:48Z

Note: this is a correctness fix required to prevent data inconsistency issue as described above, and has been backported up until 2.18 only. Ideally, we shouldn't be running Geo-partitioning workloads prior to 2.18 since they could be prone to data inconsistency issue described above.

…tablets that are involved in write, Ignore statuses from old status tablet on promotion Summary: Original commit: 0880949 / D29626 The diff fixes a couple of issues with transaction promotion. 1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet. In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code. This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet. 2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency. This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction. 3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction. This diff address the issue by dropping the status response if an abort to the old status tablet was already sent. Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore). Test Plan: Jenkins ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20 ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet The first test fails consistently without the changes. And it reproduces all the 3 issues in the description. Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend. With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue. put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place. ./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet Reviewers: sergei, rsami, rthallam, esheng Reviewed By: esheng Subscribers: yql, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29787

basavaraj29 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Oct 16, 2023

basavaraj29 self-assigned this Oct 16, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Oct 16, 2023

basavaraj29 added 2.18 Backport Required 2.20 Backport Required labels Oct 19, 2023

basavaraj29 closed this as completed Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Ignore statuses from old status tablet on successful transaction promotion #19535

[DocDB] Ignore statuses from old status tablet on successful transaction promotion #19535

basavaraj29 commented Oct 16, 2023 •

edited by jira bot

basavaraj29 commented Oct 30, 2023

[DocDB] Ignore statuses from old status tablet on successful transaction promotion #19535

[DocDB] Ignore statuses from old status tablet on successful transaction promotion #19535

Comments

basavaraj29 commented Oct 16, 2023 • edited by jira bot

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

basavaraj29 commented Oct 30, 2023

basavaraj29 commented Oct 16, 2023 •

edited by jira bot