[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645

bmatican · 2022-03-03T22:20:36Z

Jira Link: DB-1008

Description

https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=org.yb.pgsql.TestPgRegressFeature&name=testPgRegressFeature&platform=all

myang2021 · 2022-09-08T21:55:59Z

There are two types of errors:

(1) Create database timed out
(2) Query error: Restart read required

For (1), the error only appeared in debug build, samples:

ts1|pid4892|:27485 W0907 02:06:21.790311  4929 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 53) took 57375ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:07:12.885120  4930 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 80) took 47181ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:11:01.036902  8142 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 59) took 45512ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:12:00.640125  7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 91) took 50357ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:12:59.858043  7345 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 104) took 58888ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:13:51.610272  7346 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 118) took 51351ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:14:53.788679  7347 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 131) took 60007ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:15:59.452118  7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 145) took 60004ms (client timeout 60000ms).

Note the last 2 has timed out, the others are not very far from the 60000ms timeout threshold. Because debug build runs slower than release build, we should increase the timeout limit from 60000ms to say 120000ms. We already do the same for asan/tsan build types (rpc timeout limit is increased 3 times to 180000ms).

…estPgRegressFeature Summary: The test TestPgRegressFeature.testPgRegressFeature is flaky and from stability page I saw one of the reasons is `Timed out waiting for Create Database`. After looking into several failed instances, I found that all of them have debug build. Some CreateDatabase RPC samples from one jenkins failed test: ``` ts1|pid4892|:27485 W0907 02:06:21.790311 4929 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 53) took 57375ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:07:12.885120 4930 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 80) took 47181ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:11:01.036902 8142 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 59) took 45512ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:12:00.640125 7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 91) took 50357ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:12:59.858043 7345 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 104) took 58888ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:13:51.610272 7346 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 118) took 51351ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:14:53.788679 7347 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 131) took 60007ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:15:59.452118 7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 145) took 60004ms (client timeout 60000ms). ``` Note the last 2 have timed out, the others are not very far from the 60000ms timeout threshold. Because debug build runs slower than release build, we should increase the timeout limit from 60000ms to say 120000ms. We already do the same for asan/tsan build types (rpc timeout limit is increased 3 times to 180000ms). I made a change to use a separate timeout for create database RPC call and also take debug build type into consideration when deciding the new create database RPC time out limit. I also split out a new test TestPgRegressFeatureDatabase to further improve the test stability as occasionally the original test itself can run over 1800 seconds and thus fail. Test Plan: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5 ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeatureDatabase' --tp 1 -n 5 No longer see the time out error in create database. Reviewers: tverona, zyu Reviewed By: zyu Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D19424

myang2021 · 2022-09-17T02:10:24Z

For the second type of error. Following is what I have found.

After looking into several failed instances, I extracted a test case that can reproduce the
bug in repeated runs:

create table atacc1 (test int not null);
alter table atacc1 alter column test drop not null;
insert into atacc1 values (null);
alter table atacc1 alter test set not null; -- statement A1
ERROR:  column "test" contains null values
delete from atacc1;
alter table atacc1 alter test set not null; -- statement A2
ERROR:  Query error: Restart read required at: { read: { days: 19249 time: 19:27:45.787447 }
local_limit: { days: 19249 time: 19:27:45.787447 } global_limit: <min> in_txn_limit: <max>
serial_no: 0 }

Notice that the statement alter table atacc1 alter test set not null is executed twice. The first
execution (A1) we see an expected ERROR. However for the second execution (A2) the Restart read required ERROR is unexpected. It is caused by a race condition that is described below.

When executing statement A1, a ddl transaction is started at the tserver side. Next PG needs to
verify that the table atacc1 does not contain any null values and therefore PG runs a sequential
scan of the table atacc1. During the sequential scan, YB makes Perform RPC calls to tserver. YB
has an optimization called pre-fetching: before returning the current batch of rows to PG layer,
another Perform RPC is issued to fetch the next batch of rows. This optimization can speed up the
scan process: while PG layer is verifying the rows already fetched, the next batch of rows are
fetched concurrently.

Because the table atacc1 does have null value, the verification failed and statement A1 is aborted
before all the batches are fetched. As part of that, the ddl transaction is also aborted via a
FinishTransaction RPC with ABORT argument from PG to tserver. Once tserver receives
FinishTransaction with ABORT argument, it will clear it from the ddl session and then abort the
ddl transaction. The relevant code:

  const auto txn_value = std::move(txn);
  Session(kind)->SetTransaction(nullptr);

and

  VLOG_WITH_PREFIX_AND_FUNC(2)
      << "ddl: " << req.ddl_mode() << ", txn: " << txn_value->id() << ", abort";
  txn_value->Abort();

There is a race condition between the pre-fetching Perform RPC and the FinishTransaction RPC
because pre-fetching is done asynchronously in a different thread of the PG process.

(1) If tserver receives Perform first, the next FinishTransaction will properly clear the ddl
transaction from the ddl session and abort the ddl transaction as shown above. This is the common
case.
(2) If tserver receives FinishTransaction first, it leads to the unexpected Restart read required error for the following reason.

After FinishTransaction terminates the ddl transaction of statement A1, the pre-fetching
Perform RPC (which belonged to the just terminated ddl transaction of statement A1) arrived. It
will re-establish a new ddl transaction because it has ddl_mode in its PgPerformOptionsPB set to
true. This newly created ddl transaction gets a readpoint T1 when executing the Perform operation.
Following that, statement delete from atacc1 is executed which causes a write to table atacct
with a commit time of T2. Note that T1 < T2. Then at tserver side the statement A2 will reuse the
newly created ddl transaction with readpoint T1 for the execution of A2. Because T1 < T2, in docdb
DocRowwiseIterator::RestartReadHt returns T2 instead of HybridTime::kInvalid, eventually leading
to the "Restart read required` error:

  if (max_seen_ht.is_valid() && max_seen_ht > db_iter_->read_time().read) {
    VLOG(4) << "Restart read: " << max_seen_ht << ", original: " << db_iter_->read_time();
    return max_seen_ht;
  }
  return HybridTime::kInvalid;

For non-ddl transaction, we have used txn_serial_no to detect PG/tserver transaction mismatch
and abort the previous transaction. For ddl transaction, we do not use txn_serial_no.

We can add a similar ddl_txn_serial_no to detect PG/tserver ddl transaction mismatch and
abort the previous ddl transaction.

…ing YB ddl transaction Summary: The test TestPgRegressFeature.testPgRegressFeature is flaky and from stability page I saw one of the two reasons is `Restart read required`. After looking into several failed instances, I extracted a test case that can reproduce the bug in repeated runs: ``` create table atacc1 (test int not null); alter table atacc1 alter column test drop not null; insert into atacc1 values (null); alter table atacc1 alter test set not null; -- statement A1 ERROR: column "test" contains null values delete from atacc1; alter table atacc1 alter test set not null; -- statement A2 ERROR: Query error: Restart read required at: { read: { days: 19249 time: 19:27:45.787447 } local_limit: { days: 19249 time: 19:27:45.787447 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } ``` Notice that the statement `alter table atacc1 alter test set not null` is executed twice. The first execution (A1) we see an expected ERROR. However for the second execution (A2) the `Restart read required` ERROR is unexpected. It is caused by a race condition that is described below. When executing statement A1, a ddl transaction is started at the tserver side. Next PG needs to verify that the table `atacc1` does not contain any null values and therefore PG runs a sequential scan of the table `atacc1`. During the sequential scan, YB makes `Perform` RPC calls to tserver. YB has an optimization called `pre-fetching`: before returning the current batch of rows to PG layer, another `Perform` RPC is issued to fetch the next batch of rows. This optimization can speed up the scan process: while PG layer is verifying the rows already fetched, the next batch of rows are fetched `concurrently`. Because the table `atacc1` does have null value, the verification failed and statement A1 is aborted before all the batches are fetched. As part of that, the ddl transaction is also aborted via a `FinishTransaction` RPC with ABORT argument from PG to tserver. Once tserver receives `FinishTransaction` with ABORT argument, it will clear it from the ddl session and then abort the ddl transaction. The relevant code: ``` const auto txn_value = std::move(txn); Session(kind)->SetTransaction(nullptr); ``` and ``` VLOG_WITH_PREFIX_AND_FUNC(2) << "ddl: " << req.ddl_mode() << ", txn: " << txn_value->id() << ", abort"; txn_value->Abort(); ``` There is a race condition between the pre-fetching `Perform` RPC and the `FinishTransaction` RPC because pre-fetching is done asynchronously in a different thread of the PG process. (1) If tserver receives `Perform` first, the next `FinishTransaction` will properly clear the ddl transaction from the ddl session and abort the ddl transaction as shown above. This is the common case. (2) If tserver receives `FinishTransaction` first, it leads to the unexpected `Restart read required` error for the following reason. After `FinishTransaction` terminates the ddl transaction of statement A1, the pre-fetching `Perform` RPC (which belonged to the just terminated ddl transaction of statement A1) arrived. It will re-establish a new ddl transaction because it has `ddl_mode` in its `PgPerformOptionsPB` set to true. This newly created ddl transaction gets a readpoint T1 when executing the `Perform` operation. Following that, statement `delete from atacc1` is executed which causes a write to table `atacct` with a commit time of T2. Note that T1 < T2. Then at tserver side the statement A2 will reuse the newly created ddl transaction with readpoint T1 for the execution of A2. Because T1 < T2, in docdb `DocRowwiseIterator::RestartReadHt` returns T2 instead of HybridTime::kInvalid, eventually leading to the "Restart read required` error: ``` if (max_seen_ht.is_valid() && max_seen_ht > db_iter_->read_time().read) { VLOG(4) << "Restart read: " << max_seen_ht << ", original: " << db_iter_->read_time(); return max_seen_ht; } return HybridTime::kInvalid; ``` I made changes to prevent the race condition by waiting for all outstanding `Perform` operations to complete their RPCs before aborting the ddl transaction. Test Plan: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5 No longer see the `Restart read required` error. Reviewers: tverona, mihnea, pjain, dmitry Reviewed By: pjain, dmitry Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19612

Summary: The test yb_feature_temp in TestPgRegressFeature.testPgRegressFeature is flaky and the test can fail with ``` java.lang.AssertionError: pg_regress exited with error code: 1, failed tests: [yb_feature_temp] ``` After debugging, I found that the following output isn't stable: ``` \d temptest Table "pg_temp_2.temptest" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- k | integer | | not null | v1 | integer | | | v2 | integer | | | Indexes: "temptest_pkey" PRIMARY KEY, btree (k ASC) "temptest_v1_idx" UNIQUE, btree (v1 ASC) "temptest_v2_idx" hash (v2) ``` Temporary schema name may be `pg_temp_1`, `pg_temp_2`, etc. As a result the above output isn't stable. I made a change to make the test output stable by (1) store the output of `\d temptest` into a file that can be auto-removed as part of test cleanup after the test completes (2) read the file back into a ysqlsh variable as a text value (3) use regexp_replace to change pg_temp_1, pg_temp_2 or the like to pg_temp_xxx Test Plan: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5 No longer see ``` java.lang.AssertionError: pg_regress exited with error code: 1, failed tests: [yb_feature_temp] ``` Reviewers: jason, tverona, dsrinivasan Reviewed By: dsrinivasan Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D20465

bmatican added the area/ysql Yugabyte SQL (YSQL) label Mar 3, 2022

bmatican assigned m-iancu and sushantrmishra Mar 3, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 8, 2022

yugabyte-ci added the kind/failing-test Tests and testing infra label Sep 5, 2022

m-iancu assigned myang2021 and unassigned m-iancu and sushantrmishra Sep 7, 2022

yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Sep 7, 2022

myang2021 closed this as completed Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645

[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645

bmatican commented Mar 3, 2022 •

edited by yugabyte-ci

myang2021 commented Sep 8, 2022 •

edited

myang2021 commented Sep 17, 2022

[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645

[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645

Comments

bmatican commented Mar 3, 2022 • edited by yugabyte-ci

Description

myang2021 commented Sep 8, 2022 • edited

myang2021 commented Sep 17, 2022

bmatican commented Mar 3, 2022 •

edited by yugabyte-ci

myang2021 commented Sep 8, 2022 •

edited