Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL][Read committed][Geo-partitioning] Encountering 40001 transaction abortion errors while running conflicting transactions that do not result in deadlock #18081

Closed
1 task done
Karvy-yb opened this issue Jul 5, 2023 · 12 comments
Assignees
Labels
2.18 Backport Required 2.20 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures QA QA filed bugs

Comments

@Karvy-yb
Copy link

Karvy-yb commented Jul 5, 2023

Jira Link: DB-7123

Description

This is a geo_partitoned universe (regions: us-west-2, us-east-1, ap-south-1) has yb_enable_read_committed_isolation, enable_deadlock_detection and enable_wait_queues gflags. While running contentious transactions the test starts 4 threads in parallel all trying to acquire different types of lock on the same row. From logs we know that waiting is happening properly and there is no deadlock yet after 1 or more transactions (out of the 4) are committed, some of the other remaining transaction encounters following error:

Move transaction status: Transaction 8bc626a9-de15-4b49-8d14-dcb8ddde47e1 expired or aborted by a conflict: 40001: . Errors from tablet servers: [Operation expired (yb/tablet/transaction_participant.cc:1202): Move transaction status: Transaction 8bc626a9-de15-4b49-8d14-dcb8ddde47e1 expired or aborted by a conflict: 40001 (pgsql error 40001)]

The test started failing recently on master (failing consistently on 2.19.1.0-237). Test summary:

testysqlrlgpwithtransactiontablesandwaitqueues-aws-rf3-geo-partition-multiregion: Start
	(     0.547s) User Login : Success
	(     0.187s) Refresh YB Version : Success
	(   120.519s) Setup Provider : Success
	(     0.048s) Updating Health Check Interval to 60000 sec : Success
	(  1781.252s) Create universe kmoh-6da6167592-20230704-063225 : Success
	(   101.247s) Create database transactions_db : Success
	(   208.163s) Create tablespace, parent table and geo partitioned tables : Success
	(     2.387s) Verify cluster is load balanced : Success
	(     2.471s) Verify geo partitioned transaction tables are created automatically : Success
	(    17.412s) Insert rows into the table : Success
	(     0.000s) Start running contentious transactions with Read committed isolation level : Success
	(     0.000s) Attempting Contentious transactions on transactions table : Success
	(    93.101s) Attempting Contentious transactions on transactions table : >>> Integration Test Failed <<< 
Move transaction status: Transaction 8bc626a9-de15-4b49-8d14-dcb8ddde47e1 expired or aborted by a conflict: 40001: . Errors from tablet servers: [Operation expired (yb/tablet/transaction_participant.cc:1202): Move transaction status: Transaction 8bc626a9-de15-4b49-8d14-dcb8ddde47e1 expired or aborted by a conflict: 40001 (pgsql error 40001)]

	(    55.005s) Saved server log files and keys at /share/jenkins/workspace/itest-system-developer/logs/2.19.1.0_testysqlrlgpwithtransactiontablesandwaitqueues-aws-rf3-geo-partition-multiregion_20230704_080548 : Success
	(   127.644s) Destroy universe : Success
	(     0.252s) Check and stop workloads : Success
testysqlrlgpwithtransactiontablesandwaitqueues-aws-rf3-geo-partition-multiregion: End

P.S. The same scenario passes for repeatable read isolation level

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@Karvy-yb Karvy-yb added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) priority/high High Priority QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures labels Jul 5, 2023
@rthallamko3
Copy link
Contributor

cc @es1024

@rthallamko3
Copy link
Contributor

@Karvy-yb , Can we enable verbose gflags before running the 4 threads that contend on the same row. cc @es1024 to provide the set of files that can help.

@Karvy-yb
Copy link
Author

Karvy-yb commented Jul 6, 2023

@rthallamko3 sure, let me know the gflags. I'll trigger a run for the test with those gflags and keep_universe

@es1024
Copy link
Contributor

es1024 commented Jul 11, 2023

@Karvy-yb vmodule gflag set to transaction=2,transaction_coordinator=4,conflict_resolution=4

@Karvy-yb
Copy link
Author

@es1024 this needs to be set on master/tserver?

@es1024
Copy link
Contributor

es1024 commented Jul 11, 2023

@Karvy-yb just tserver is fine

@yugabyte-ci yugabyte-ci added area/docdb YugabyteDB core features and removed area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Jul 22, 2023
@bmatican
Copy link
Contributor

@Karvy-yb @es1024 were we ever able to get more signal on this?

@yugabyte-ci yugabyte-ci added priority/medium Medium priority issue and removed priority/high High Priority labels Sep 21, 2023
@yugabyte-ci yugabyte-ci assigned Karvy-yb and unassigned es1024 Sep 21, 2023
@Karvy-yb
Copy link
Author

@bmatican this is on my TO-DO for this week. Will share the results here soon. (cc: @es1024 )
My apologies for the delay in respose, i missed this notification when I cam back from my PTO.

@rthallamko3
Copy link
Contributor

@Karvy-yb , Any update on this?

@Karvy-yb
Copy link
Author

Karvy-yb commented Oct 3, 2023

@Karvy-yb
Copy link
Author

Karvy-yb commented Oct 4, 2023

@rthallamko3 @es1024 The test is failing on master due to some other issue and I am unable to repro this issue now (with the flags) on the original version mentioned in the ticket. Trying couple of other things. Will update the thread as soon as I have some conclusion.

@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Oct 11, 2023
basavaraj29 added a commit that referenced this issue Oct 19, 2023
…involved in write, Ignore statuses from old status tablet on promotion

Summary:
The diff fixes a couple of issues with transaction promotion.

1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet.

In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code.

This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet.

2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency.

This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction.

3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction.

This diff address the issue by dropping the status response if an abort to the old status tablet was already sent.

Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore).

Test Plan:
Jenkins
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

The first test fails consistently without the changes. And it reproduces all the 3 issues in the description.

Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend.

With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue.

put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place.
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

Reviewers: sergei, rsami, rthallam, esheng

Reviewed By: sergei, rsami

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D29252
basavaraj29 added a commit that referenced this issue Oct 23, 2023
…blets that are involved in write, Ignore statuses from old status tablet on promotion

Summary:
Original commit: 29bcfd1 / D29252
The diff fixes a couple of issues with transaction promotion.

1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet.

In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code.

This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet.

2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency.

This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction.

3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction.

This diff address the issue by dropping the status response if an abort to the old status tablet was already sent.

Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore).

Test Plan:
Jenkins
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

The first test fails consistently without the changes. And it reproduces all the 3 issues in the description.

Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend.

With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue.

put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place.
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

Reviewers: sergei, rsami, rthallam, esheng

Reviewed By: rthallam

Subscribers: ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29544
basavaraj29 added a commit that referenced this issue Oct 28, 2023
…blets that are involved in write, Ignore statuses from old status tablet on promotion

Summary:
Original commit: 29bcfd1 / D29252
The diff fixes a couple of issues with transaction promotion.

1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet.

In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code.

This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet.

2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency.

This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction.

3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction.

This diff address the issue by dropping the status response if an abort to the old status tablet was already sent.

Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore).

Test Plan:
Jenkins
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

The first test fails consistently without the changes. And it reproduces all the 3 issues in the description.

Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend.

With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue.

put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place.
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

Reviewers: sergei, rsami, rthallam, esheng

Reviewed By: esheng

Subscribers: ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29626
@yugabyte-ci yugabyte-ci reopened this Oct 30, 2023
basavaraj29 added a commit that referenced this issue Oct 31, 2023
…tablets that are involved in write, Ignore statuses from old status tablet on promotion

Summary:
Original commit: 0880949 / D29626
The diff fixes a couple of issues with transaction promotion.

1. `tablets_` maintained at `YBTransaction::Impl` seems to maintain a list of tablets that are considered as "involved" for a transaction (the transaction has successfully processed some sort of write, which is determined by involved `YBOperation`'s `op` - (`op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)`). And the transaction promotion codepath ideally doesn't send promotion requests to tablets that haven't processed a write from this transaction yet.

In the current implementation, we update `num_completed_batches` of an entry of `tablets_` on all successfully completed operations immaterial of whether it processed a write at the tablet. This seems to wrongly affect the transaction promotion codepath, where promotion requests are sent to all tablets in `transaction_status_move_tablets_` (which is a consistent copy of `tablets_`) and with `num_completed_batches > 0`. When a txn participant sees a promotion request for an unknown transactions, it returns a 40001. So in a highly conflicting workload with read committed isolation, this seems to be frequent and we error with 40001. We would also hit this error when a promotion request is sent to a tablet that processes a vanilla read (without the `row_mark_type` set) with the existing code.

This diff addresses the issue by updating `num_completed_batches` only when `op.yb_op->applied() && op.yb_op->should_apply_intents(metadata_.isolation)` is true i.e, the op successfully wrote something to the tablet.

2. In the existing implementation, when processing the received status of a `RunningTransaction` at the participant, we don't really track if the response was from `old_status_tablet` or current `status_tablet`. For transaction promotion, this leads to additional issues. Consider a transaction that starts off as a local txn, successfully completes writes at a tablet T, and then undergoes promotion. In the process, an abort is sent to the old status tablet. Now if the `RunningTransaction` sees this abort, it initiates a cleanup at the participant. And when the promoted transaction now commits, we don't seem to error out and the updates get undone leading to data inconsistency.

This diff addresses the issue by tracking the status tablet on which the status request was initiated. On receiving the status response, if we see that the transaction underwent promotion, we return an already existing old status for the transaction instead of using the newly arrived status. Since the status tablet is now updated, subsequent status requests will get the latest state. Additionally, we reject promotion if there's an active request to abort the transaction.

3. At the query layer (`client/tramsaction.cc`), we could have two active heartbeating threads for a promoted transaction. in the existing implementation, we stop sending requests to the old status tablet once we have sent an abort to it (`state != OldTransactionState::kNone && state != OldTransactionState::kRunning`). But if we receive a response from an already sent request, we seem do the error handling and proactive clean up (requests to involved txn participants) even when if the abort to old status tablet was sent (`state == OldTransactionState::kAborting || state == OldTransactionState::kAborted`). This could lead to unnecessary failures for subsequent operations of the transaction.

This diff address the issue by dropping the status response if an abort to the old status tablet was already sent.

Additional context: Holding off promotion requests until the tablet has seen a write was introduced in [[ 21c5918 | commit ]]. Earlier, transaction promotion was being retried on failures which was removed in [[ 1ae14b8 | commit ]] (so we don't retry failed promotion anymore).

Test Plan:
Jenkins
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestPromotionAmidstConflicts -n 20
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

The first test fails consistently without the changes. And it reproduces all the 3 issues in the description.

Elaborating on what the test does - We have a couple of transactions starting off as local transactions by issuing a `PGSQL_WRITE` on a tablet (say T_P1) in partition P1. This is followed by few more `PGSQL_READ` operations are launched for all tablets (2 tablets in P1, 1 tablet in P2, and 1 tablet in P3). These read ops trigger a transaction promotion request. Depending on the order in which these ops get flushed, if the read to T_P1 and T_P2 get flushed before sending promotion requests, the existing code seems to insert these tablets into the list of "involved tablets". Note that these read ops don't have `row_mark_type` set. And the promotion codepath errors out returning a 40001 to the backend.

With the changes in the diff, since we now only update `num_completed_batches` when `op.yb_op->applied() && op.yb_op >should_apply_intents(metadata_.isolation)`, we shouldn't run into this issue.

put up another simpler test to validate point 2. in the description that fails consistently without the current changes in place.
./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoPartitionedReadCommiittedTest.TestParticipantIgnoresAbortFromOldStatusTablet

Reviewers: sergei, rsami, rthallam, esheng

Reviewed By: esheng

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.18 Backport Required 2.20 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures QA QA filed bugs
Projects
None yet
Development

No branches or pull requests

7 participants