Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL][Wait-on-conflict] : Parallel inserts with same primary key don't throw error #19407

Closed
1 task done
Karvy-yb opened this issue Oct 4, 2023 · 3 comments
Closed
1 task done

Comments

@Karvy-yb
Copy link

Karvy-yb commented Oct 4, 2023

Jira Link: DB-8199

Description

Steps to repro:

  1. Create universe with read committed, deadlock detection and wait on conflict enabled on 2.21.0.0-b12 DB version
  2. create table tb (k int primary key, v int);
  3. insert into tb values (1,1), (2,2);
  4. Session-1: begin; select; insert into tb values (3,3);
  5. Session-2: insert into tb values (3,9); -- This will wait for txn in session-1
  6. Session-1: commit;

Notice that both the inserts will go through without throwing any error and table will have (3,9) row.
I don't see this happening with wait-queues disabled and read committed enabled.
This issue is also observed for RR isolation level.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@Karvy-yb Karvy-yb added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) priority/high High Priority QA QA filed bugs status/awaiting-triage Issue awaiting triage labels Oct 4, 2023
@Karvy-yb Karvy-yb assigned robertsami and rthallamko3 and unassigned robertsami Oct 4, 2023
@sushantrmishra sushantrmishra removed the status/awaiting-triage Issue awaiting triage label Oct 4, 2023
@rthallamko3
Copy link
Contributor

rthallamko3 commented Oct 4, 2023

@Karvy-yb , Can you confirm if the isolation level is set to read_committed on session 2? Or is the isolation level set at the universe level? cc @basavaraj29

@Karvy-yb
Copy link
Author

Karvy-yb commented Oct 5, 2023

@rthallamko3 the isolation level is set as read committed at universe level (by setting yb_enable_read_committed_isolation to true) and session 2 uses the default. But note that the behaviour is same when yb_enable_read_committed_isolation is set to false (in which case repeatable read becomes the default isolation level).

@pkj415
Copy link
Contributor

pkj415 commented Oct 5, 2023

Root cause:

This is happening because, for the single shard INSERT in session 2, we have this order for the following steps:
(1) pick a read time on PgClientSession
(2) do conflict resolution in intents db -> wait in wait queue -> wake up and repeat conflict resolution in intents db
(3) check for a duplicate row among committed data in regular db as of the read time picked in step (1) (see ApplyInsert() in pgsql_operation.cc)

Because the read time was picked on PgClientSession before conflict resolution, the INSERT ends up not seeing the duplicate row.

While we discuss the fix, another thing to note: single shard UPDATEs and DELETEs do not pick a read time on PgClientSession. Instead they pick a read time on the tserver. This happens because a single shard INSERT ends up buffering the operation and then flushing in pg_session.cc as compared to UPDATE and DELETE which directly flush without buffering. Because of this, EnsureReadTimeIsSet is set as true for a single shard INSERT.

Perhaps we should do the same (i.e., pick a read time on docdb) for a single shard INSERT because picking the read time on docdb has some advantages: i) no need to wait for safe time to catch up with a read time picked on PgClientSession ii) kReadRestart errors can be handled on docdb itself without going back to the query layer for retries.

Moreover, for single shard operations, there might be other correctness issue if we pick a read time before conflict resolution but don't use it in the conflict resolution itself.

@rthallamko3 rthallamko3 assigned pkj415 and unassigned rthallamko3 Oct 5, 2023
pkj415 added a commit that referenced this issue Oct 19, 2023
…ctional

Summary:
Before this diff, `YBCExecuteInsertForDb()` passes information about
yb_disable_transactional_writes to `YBCExecuteInsertInternal()` by using
the `is_single_row_txn` parameter. Till now, `is_single_row_txn` really
only meant non-transactional (since PgsqlWriteOp sets `need_transaction`
using `!is_single_row_txn_`). So, till today it was okay to piggyback on
`is_single_row_txn` to indicate non-transactional writes (for COPY,
index backfill, or if `yb_disable_transactional_writes` was set).

However, as part of D29133 (to fix a correctness issue #19407), we need
to force non-bufferable operations for any single row txn. This should not
be done for multi-row non transactional writes (else, unit tests such as
PgOpBufferingTest.FKCheckWithNonTxnWrites will fail). Hence, we need to
stop using `is_single_row_txn` to indicate non-transactional writes
required by `COPY`, index backfill or if
`yb_disable_transactional_writes` is set.
Jira: DB-1870

Test Plan:
Jenkins

Tests exercising the non-transactional write paths should still pass:

    ./yb_build.sh --cxx-test pgwrapper_pg_op_buffering-test --gtest_filter="PgOpBufferingTest.FKCheckWithNonTxnWrites"
    ./yb_build.sh --java-test 'org.yb.pgsql.TestBatchCopyFrom#testNonTxnWriteSessionVariable'
    ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressNonTxnCopy#testPgRegressNonTxnCopy'

Reviewers: jason, dmitry

Reviewed By: jason, dmitry

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D29336
pkj415 added a commit that referenced this issue Oct 20, 2023
…er conflict resolution

Summary:
Single shard UPDATEs and DELETEs pick a read time on docdb after
conflict resolution. However, single shard INSERTs pick a read
time on PgClientSession (same node as the query layer).

Unlike distributed transactions, conflict resolution of
single shard operations in YB is optimized to not check
regular db for conflicting data that has been committed. This
is correct as long as we don't pick a read time until
the operation is sure that no conflicts exist and also holds
the in-memory locks (in other words, after conflict resolution
is successful).

Since a single shard INSERT picks a read time in the query
layer (i.e., much earlier before conflict resolution), it can
face a correctness issue due to the following steps in order:

In Fail-on-Conflict mode (incorrect behaviour with low probability):
-------------------------------------------------------------------

Assume two single shard INSERTs trying to insert the same row.

(1) Insert 1: pick a read time (rt1=5) on PgClientSession
(2) Insert 2: pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (only check intents db).
(4) Insert 2: wait for in-memory lock acquisition
(5) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in regular db with commit timestamp 10.
(6) Insert 2: acquire in-memory locks and do conflict checking (only check intents db).
(7) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12. (Assume that clock skew is 2 time
units, so no kReadRestart is hit)

To repro this manually, add a sleep for 1000ms at the start of ApplyInsert()
and perform concurrent single shard INSERTs such that the above scenario is hit.
Use a sleep higher than the max_clock_skew_usec because if the sleep is lower than
that, Insert 2 will face a kReadRestart and will be retried by the query
layer, and face a duplicate key error as expected.

In Wait-on-Conflict mode (easier to repro manually):
----------------------------------------------------
[Refer newly added test concurrent-inserts-duplicate-key-error]

Assume one single shard INSERT and one distributed transaction INSERT trying to
insert the same row.

(1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession
(2) Insert 2 (single shard): pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db).
(4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in intents db.
(5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and
enter wait queue
(6) Insert 1: `commit;` of distributed txn writes data to regular db with commit timestamp 10.
(7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting
intents again, none will be found.
(8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12.

The crux of the issue is: since the read time for a single shard INSERT
is picked before conflict resolution, it will miss reading rows in regular db
that are concurrently committed by another transaction with a timestamp
higher than the read time.

Solution:
---------

(1) Add a check in docdb to error out if a read time exists before conflict
resolution in the single shard operation path.

(2) For single shard UPDATEs and DELETEs, we pick a read time on docdb
because EnsureReadTimeIsSet is false in pg_session.cc for these. For
single shard INSERTs we currently set EnsureReadTimeIsSet to true, which
is not necessary. Changing it to false fixes the above issue because the
read time will now be picked on docdb after conflict resolution is done.

Picking the read time on docdb has some extra advantages too:

(i) not having to wait for the safe time to catchup to a read time picked
on another node's PgClientSession.

(ii) docdb can retry various errors (such as kReadRestart, kConflict) itself
without going back to the query layer.

Apart from fixing a correctness issue, this change is similar to two earlier
commits that ensure we pick read times on docdb in as many cases as
possible: 8166695 and b223af9.
Jira: DB-8199

Test Plan:
Jenkins: test regex: .*TestPgWaitQueuesRegress.*

./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress (added concurrent-inserts-duplicate-key-error in this)
./yb_build.sh --cxx-test pgwrapper_pg_read_time-test --gtest_filter PgReadTimeTest.CheckReadTimePickingLocation

Reviewers: dmitry, bkolagani, rsami, sergei, tvesely

Reviewed By: dmitry, rsami

Subscribers: bogdan, ybase, bkolagani, rsami, yql

Differential Revision: https://phorge.dev.yugabyte.com/D29133
pkj415 added a commit that referenced this issue Oct 20, 2023
…m is_non_transactional

Summary:
Before this diff, `YBCExecuteInsertForDb()` passes information about
yb_disable_transactional_writes to `YBCExecuteInsertInternal()` by using
the `is_single_row_txn` parameter. Till now, `is_single_row_txn` really
only meant non-transactional (since PgsqlWriteOp sets `need_transaction`
using `!is_single_row_txn_`). So, till today it was okay to piggyback on
`is_single_row_txn` to indicate non-transactional writes (for COPY,
index backfill, or if `yb_disable_transactional_writes` was set).

However, as part of D29133 (to fix a correctness issue #19407), we need
to force non-bufferable operations for any single row txn. This should not
be done for multi-row non transactional writes (else, unit tests such as
PgOpBufferingTest.FKCheckWithNonTxnWrites will fail). Hence, we need to
stop using `is_single_row_txn` to indicate non-transactional writes
required by `COPY`, index backfill or if
`yb_disable_transactional_writes` is set.
Jira: DB-1870

Original commit: 3890265 / D29336

Test Plan:
Jenkins

Tests exercising the non-transactional write paths should still pass:

    ./yb_build.sh --cxx-test pgwrapper_pg_op_buffering-test --gtest_filter="PgOpBufferingTest.FKCheckWithNonTxnWrites"
    ./yb_build.sh --java-test 'org.yb.pgsql.TestBatchCopyFrom#testNonTxnWriteSessionVariable'
    ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressNonTxnCopy#testPgRegressNonTxnCopy'

Reviewers: jason, dmitry

Reviewed By: jason

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29523
pkj415 added a commit that referenced this issue Oct 23, 2023
…d time only after conflict resolution

Summary:
Single shard UPDATEs and DELETEs pick a read time on docdb after
conflict resolution. However, single shard INSERTs pick a read
time on PgClientSession (same node as the query layer).

Unlike distributed transactions, conflict resolution of
single shard operations in YB is optimized to not check
regular db for conflicting data that has been committed. This
is correct as long as we don't pick a read time until
the operation is sure that no conflicts exist and also holds
the in-memory locks (in other words, after conflict resolution
is successful).

Since a single shard INSERT picks a read time in the query
layer (i.e., much earlier before conflict resolution), it can
face a correctness issue due to the following steps in order:

In Fail-on-Conflict mode (incorrect behaviour with low probability):
-------------------------------------------------------------------

Assume two single shard INSERTs trying to insert the same row.

(1) Insert 1: pick a read time (rt1=5) on PgClientSession
(2) Insert 2: pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (only check intents db).
(4) Insert 2: wait for in-memory lock acquisition
(5) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in regular db with commit timestamp 10.
(6) Insert 2: acquire in-memory locks and do conflict checking (only check intents db).
(7) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12. (Assume that clock skew is 2 time
units, so no kReadRestart is hit)

To repro this manually, add a sleep for 1000ms at the start of ApplyInsert()
and perform concurrent single shard INSERTs such that the above scenario is hit.
Use a sleep higher than the max_clock_skew_usec because if the sleep is lower than
that, Insert 2 will face a kReadRestart and will be retried by the query
layer, and face a duplicate key error as expected.

In Wait-on-Conflict mode (easier to repro manually):
----------------------------------------------------
[Refer newly added test concurrent-inserts-duplicate-key-error]

Assume one single shard INSERT and one distributed transaction INSERT trying to
insert the same row.

(1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession
(2) Insert 2 (single shard): pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db).
(4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in intents db.
(5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and
enter wait queue
(6) Insert 1: `commit;` of distributed txn writes data to regular db with commit timestamp 10.
(7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting
intents again, none will be found.
(8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12.

The crux of the issue is: since the read time for a single shard INSERT
is picked before conflict resolution, it will miss reading rows in regular db
that are concurrently committed by another transaction with a timestamp
higher than the read time.

Solution:
---------

(1) Add a check in docdb to error out if a read time exists before conflict
resolution in the single shard operation path.

(2) For single shard UPDATEs and DELETEs, we pick a read time on docdb
because EnsureReadTimeIsSet is false in pg_session.cc for these. For
single shard INSERTs we currently set EnsureReadTimeIsSet to true, which
is not necessary. Changing it to false fixes the above issue because the
read time will now be picked on docdb after conflict resolution is done.

Picking the read time on docdb has some extra advantages too:

(i) not having to wait for the safe time to catchup to a read time picked
on another node's PgClientSession.

(ii) docdb can retry various errors (such as kReadRestart, kConflict) itself
without going back to the query layer.

Apart from fixing a correctness issue, this change is similar to two earlier
commits that ensure we pick read times on docdb in as many cases as
possible: 8166695 and b223af9.
Jira: DB-8199

Original commit: fc21068 / D29133

Test Plan:
./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress (added concurrent-inserts-duplicate-key-error in this)
./yb_build.sh --cxx-test pgwrapper_pg_read_time-test --gtest_filter PgReadTimeTest.CheckReadTimePickingLocation

Reviewers: dmitry, bkolagani, rsami, sergei, tvesely

Reviewed By: dmitry

Subscribers: yql, rsami, bkolagani, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29559
pkj415 added a commit that referenced this issue Nov 1, 2023
…m is_non_transactional

Summary:
Before this diff, `YBCExecuteInsertForDb()` passes information about
yb_disable_transactional_writes to `YBCExecuteInsertInternal()` by using
the `is_single_row_txn` parameter. Till now, `is_single_row_txn` really
only meant non-transactional (since PgsqlWriteOp sets `need_transaction`
using `!is_single_row_txn_`). So, till today it was okay to piggyback on
`is_single_row_txn` to indicate non-transactional writes (for COPY,
index backfill, or if `yb_disable_transactional_writes` was set).

However, as part of D29133 (to fix a correctness issue #19407), we need
to force non-bufferable operations for any single row txn. This should not
be done for multi-row non transactional writes (else, unit tests such as
PgOpBufferingTest.FKCheckWithNonTxnWrites will fail). Hence, we need to
stop using `is_single_row_txn` to indicate non-transactional writes
required by `COPY`, index backfill or if
`yb_disable_transactional_writes` is set.
Jira: DB-1870

Original commit: 3890265 / D29336

Test Plan:
Tests exercising the non-transactional write paths should still pass:

    ./yb_build.sh --cxx-test pgwrapper_pg_op_buffering-test --gtest_filter="PgOpBufferingTest.FKCheckWithNonTxnWrites"
    ./yb_build.sh --java-test 'org.yb.pgsql.TestBatchCopyFrom#testNonTxnWriteSessionVariable'
    ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressNonTxnCopy#testPgRegressNonTxnCopy'

Reviewers: jason, dmitry

Reviewed By: jason

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29678
pkj415 added a commit that referenced this issue Nov 1, 2023
…m is_non_transactional

Summary:
Before this diff, `YBCExecuteInsertForDb()` passes information about
yb_disable_transactional_writes to `YBCExecuteInsertInternal()` by using
the `is_single_row_txn` parameter. Till now, `is_single_row_txn` really
only meant non-transactional (since PgsqlWriteOp sets `need_transaction`
using `!is_single_row_txn_`). So, till today it was okay to piggyback on
`is_single_row_txn` to indicate non-transactional writes (for COPY,
index backfill, or if `yb_disable_transactional_writes` was set).

However, as part of D29133 (to fix a correctness issue #19407), we need
to force non-bufferable operations for any single row txn. This should not
be done for multi-row non transactional writes (else, unit tests such as
PgOpBufferingTest.FKCheckWithNonTxnWrites will fail). Hence, we need to
stop using `is_single_row_txn` to indicate non-transactional writes
required by `COPY`, index backfill or if
`yb_disable_transactional_writes` is set.
Jira: DB-1870

Original commit: 3890265 / D29336

Test Plan:
Tests exercising the non-transactional write paths should still pass:

    ./yb_build.sh --cxx-test pgwrapper_pg_op_buffering-test --gtest_filter="PgOpBufferingTest.FKCheckWithNonTxnWrites"
    ./yb_build.sh --java-test 'org.yb.pgsql.TestBatchCopyFrom#testNonTxnWriteSessionVariable'
    ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressNonTxnCopy#testPgRegressNonTxnCopy'

Reviewers: jason, dmitry

Reviewed By: dmitry

Subscribers: amartsinchyk, yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29680
pkj415 added a commit that referenced this issue Nov 1, 2023
…m is_non_transactional

Summary:
Before this diff, `YBCExecuteInsertForDb()` passes information about
yb_disable_transactional_writes to `YBCExecuteInsertInternal()` by using
the `is_single_row_txn` parameter. Till now, `is_single_row_txn` really
only meant non-transactional (since PgsqlWriteOp sets `need_transaction`
using `!is_single_row_txn_`). So, till today it was okay to piggyback on
`is_single_row_txn` to indicate non-transactional writes (for COPY,
index backfill, or if `yb_disable_transactional_writes` was set).

However, as part of D29133 (to fix a correctness issue #19407), we need
to force non-bufferable operations for any single row txn. This should not
be done for multi-row non transactional writes (else, unit tests such as
PgOpBufferingTest.FKCheckWithNonTxnWrites will fail). Hence, we need to
stop using `is_single_row_txn` to indicate non-transactional writes
required by `COPY`, index backfill or if
`yb_disable_transactional_writes` is set.
Jira: DB-1870

Original commit: 3890265 / D29336

Test Plan:
Tests exercising the non-transactional write paths should still pass:

    ./yb_build.sh --cxx-test pgwrapper_pg_op_buffering-test --gtest_filter="PgOpBufferingTest.FKCheckWithNonTxnWrites"
    ./yb_build.sh --java-test 'org.yb.pgsql.TestBatchCopyFrom#testNonTxnWriteSessionVariable'
    ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressNonTxnCopy#testPgRegressNonTxnCopy'

Reviewers: jason, dmitry

Reviewed By: dmitry

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29709
pkj415 added a commit that referenced this issue Nov 8, 2023
…d time only after conflict resolution

Summary:
Single shard UPDATEs and DELETEs pick a read time on docdb after
conflict resolution. However, single shard INSERTs pick a read
time on PgClientSession (same node as the query layer).

Unlike distributed transactions, conflict resolution of
single shard operations in YB is optimized to not check
regular db for conflicting data that has been committed. This
is correct as long as we don't pick a read time until
the operation is sure that no conflicts exist and also holds
the in-memory locks (in other words, after conflict resolution
is successful).

Since a single shard INSERT picks a read time in the query
layer (i.e., much earlier before conflict resolution), it can
face a correctness issue due to the following steps in order:

In Fail-on-Conflict mode (incorrect behaviour with low probability):
-------------------------------------------------------------------

Assume two single shard INSERTs trying to insert the same row.

(1) Insert 1: pick a read time (rt1=5) on PgClientSession
(2) Insert 2: pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (only check intents db).
(4) Insert 2: wait for in-memory lock acquisition
(5) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in regular db with commit timestamp 10.
(6) Insert 2: acquire in-memory locks and do conflict checking (only check intents db).
(7) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12. (Assume that clock skew is 2 time
units, so no kReadRestart is hit)

To repro this manually, add a sleep for 1000ms at the start of ApplyInsert()
and perform concurrent single shard INSERTs such that the above scenario is hit.
Use a sleep higher than the max_clock_skew_usec because if the sleep is lower than
that, Insert 2 will face a kReadRestart and will be retried by the query
layer, and face a duplicate key error as expected.

In Wait-on-Conflict mode (easier to repro manually):
----------------------------------------------------
[Refer newly added test concurrent-inserts-duplicate-key-error]

Assume one single shard INSERT and one distributed transaction INSERT trying to
insert the same row.

(1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession
(2) Insert 2 (single shard): pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db).
(4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in intents db.
(5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and
enter wait queue
(6) Insert 1: `commit;` of distributed txn writes data to regular db with commit timestamp 10.
(7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting
intents again, none will be found.
(8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12.

The crux of the issue is: since the read time for a single shard INSERT
is picked before conflict resolution, it will miss reading rows in regular db
that are concurrently committed by another transaction with a timestamp
higher than the read time.

Solution:
---------

(1) Add a check in docdb to error out if a read time exists before conflict
resolution in the single shard operation path.

(2) For single shard UPDATEs and DELETEs, we pick a read time on docdb
because EnsureReadTimeIsSet is false in pg_session.cc for these. For
single shard INSERTs we currently set EnsureReadTimeIsSet to true, which
is not necessary. Changing it to false fixes the above issue because the
read time will now be picked on docdb after conflict resolution is done.

Picking the read time on docdb has some extra advantages too:

(i) not having to wait for the safe time to catchup to a read time picked
on another node's PgClientSession.

(ii) docdb can retry various errors (such as kReadRestart, kConflict) itself
without going back to the query layer.

Apart from fixing a correctness issue, this change is similar to two earlier
commits that ensure we pick read times on docdb in as many cases as
possible: 8166695 and b223af9.
Jira: DB-8199

Original commit: fc21068 / D29133

Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress (added concurrent-inserts-duplicate-key-error in this)

Reviewers: dmitry, bkolagani, rsami, sergei, tvesely

Reviewed By: dmitry

Subscribers: yql, rsami, bkolagani, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29854
pkj415 added a commit that referenced this issue Nov 8, 2023
…d time only after conflict resolution

Summary:
Single shard UPDATEs and DELETEs pick a read time on docdb after
conflict resolution. However, single shard INSERTs pick a read
time on PgClientSession (same node as the query layer).

Unlike distributed transactions, conflict resolution of
single shard operations in YB is optimized to not check
regular db for conflicting data that has been committed. This
is correct as long as we don't pick a read time until
the operation is sure that no conflicts exist and also holds
the in-memory locks (in other words, after conflict resolution
is successful).

Since a single shard INSERT picks a read time in the query
layer (i.e., much earlier before conflict resolution), it can
face a correctness issue due to the following steps in order:

In Fail-on-Conflict mode (incorrect behaviour with low probability):
-------------------------------------------------------------------

Assume two single shard INSERTs trying to insert the same row.

(1) Insert 1: pick a read time (rt1=5) on PgClientSession
(2) Insert 2: pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (only check intents db).
(4) Insert 2: wait for in-memory lock acquisition
(5) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in regular db with commit timestamp 10.
(6) Insert 2: acquire in-memory locks and do conflict checking (only check intents db).
(7) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12. (Assume that clock skew is 2 time
units, so no kReadRestart is hit)

To repro this manually, add a sleep for 1000ms at the start of ApplyInsert()
and perform concurrent single shard INSERTs such that the above scenario is hit.
Use a sleep higher than the max_clock_skew_usec because if the sleep is lower than
that, Insert 2 will face a kReadRestart and will be retried by the query
layer, and face a duplicate key error as expected.

In Wait-on-Conflict mode (easier to repro manually):
----------------------------------------------------
[Refer newly added test concurrent-inserts-duplicate-key-error]

Assume one single shard INSERT and one distributed transaction INSERT trying to
insert the same row.

(1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession
(2) Insert 2 (single shard): pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db).
(4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in intents db.
(5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and
enter wait queue
(6) Insert 1: `commit;` of distributed txn writes data to regular db with commit timestamp 10.
(7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting
intents again, none will be found.
(8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12.

The crux of the issue is: since the read time for a single shard INSERT
is picked before conflict resolution, it will miss reading rows in regular db
that are concurrently committed by another transaction with a timestamp
higher than the read time.

Solution:
---------

(1) Add a check in docdb to error out if a read time exists before conflict
resolution in the single shard operation path.

(2) For single shard UPDATEs and DELETEs, we pick a read time on docdb
because EnsureReadTimeIsSet is false in pg_session.cc for these. For
single shard INSERTs we currently set EnsureReadTimeIsSet to true, which
is not necessary. Changing it to false fixes the above issue because the
read time will now be picked on docdb after conflict resolution is done.

Picking the read time on docdb has some extra advantages too:

(i) not having to wait for the safe time to catchup to a read time picked
on another node's PgClientSession.

(ii) docdb can retry various errors (such as kReadRestart, kConflict) itself
without going back to the query layer.

Apart from fixing a correctness issue, this change is similar to two earlier
commits that ensure we pick read times on docdb in as many cases as
possible: 8166695 and b223af9.
Jira: DB-8199

Original commit: fc21068 / D29133

Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress (added concurrent-inserts-duplicate-key-error in this)

Reviewers: dmitry, bkolagani, rsami, sergei, tvesely

Reviewed By: dmitry

Subscribers: yql, rsami, bkolagani, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29856
pkj415 added a commit that referenced this issue Nov 8, 2023
…d time only after conflict resolution

Summary:
Single shard UPDATEs and DELETEs pick a read time on docdb after
conflict resolution. However, single shard INSERTs pick a read
time on PgClientSession (same node as the query layer).

Unlike distributed transactions, conflict resolution of
single shard operations in YB is optimized to not check
regular db for conflicting data that has been committed. This
is correct as long as we don't pick a read time until
the operation is sure that no conflicts exist and also holds
the in-memory locks (in other words, after conflict resolution
is successful).

Since a single shard INSERT picks a read time in the query
layer (i.e., much earlier before conflict resolution), it can
face a correctness issue due to the following steps in order:

In Fail-on-Conflict mode (incorrect behaviour with low probability):
-------------------------------------------------------------------

Assume two single shard INSERTs trying to insert the same row.

(1) Insert 1: pick a read time (rt1=5) on PgClientSession
(2) Insert 2: pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (only check intents db).
(4) Insert 2: wait for in-memory lock acquisition
(5) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in regular db with commit timestamp 10.
(6) Insert 2: acquire in-memory locks and do conflict checking (only check intents db).
(7) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12. (Assume that clock skew is 2 time
units, so no kReadRestart is hit)

To repro this manually, add a sleep for 1000ms at the start of ApplyInsert()
and perform concurrent single shard INSERTs such that the above scenario is hit.
Use a sleep higher than the max_clock_skew_usec because if the sleep is lower than
that, Insert 2 will face a kReadRestart and will be retried by the query
layer, and face a duplicate key error as expected.

In Wait-on-Conflict mode (easier to repro manually):
----------------------------------------------------
[Refer newly added test concurrent-inserts-duplicate-key-error]

Assume one single shard INSERT and one distributed transaction INSERT trying to
insert the same row.

(1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession
(2) Insert 2 (single shard): pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db).
(4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in intents db.
(5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and
enter wait queue
(6) Insert 1: `commit;` of distributed txn writes data to regular db with commit timestamp 10.
(7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting
intents again, none will be found.
(8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12.

The crux of the issue is: since the read time for a single shard INSERT
is picked before conflict resolution, it will miss reading rows in regular db
that are concurrently committed by another transaction with a timestamp
higher than the read time.

Solution:
---------

(1) Add a check in docdb to error out if a read time exists before conflict
resolution in the single shard operation path.

(2) For single shard UPDATEs and DELETEs, we pick a read time on docdb
because EnsureReadTimeIsSet is false in pg_session.cc for these. For
single shard INSERTs we currently set EnsureReadTimeIsSet to true, which
is not necessary. Changing it to false fixes the above issue because the
read time will now be picked on docdb after conflict resolution is done.

Picking the read time on docdb has some extra advantages too:

(i) not having to wait for the safe time to catchup to a read time picked
on another node's PgClientSession.

(ii) docdb can retry various errors (such as kReadRestart, kConflict) itself
without going back to the query layer.

Apart from fixing a correctness issue, this change is similar to two earlier
commits that ensure we pick read times on docdb in as many cases as
possible: 8166695 and b223af9.
Jira: DB-8199

Original commit: fc21068 / D29133

Test Plan: Jenkins

Reviewers: dmitry, bkolagani, rsami, sergei, tvesely

Reviewed By: dmitry

Subscribers: yql, rsami, bkolagani, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29857
@pkj415 pkj415 closed this as completed Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

6 participants