Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645

Closed
bmatican opened this issue Mar 3, 2022 · 2 comments
Closed

[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645

bmatican opened this issue Mar 3, 2022 · 2 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug kind/failing-test Tests and testing infra priority/high High Priority

Comments

@bmatican
Copy link
Contributor

bmatican commented Mar 3, 2022

@bmatican bmatican added the area/ysql Yugabyte SQL (YSQL) label Mar 3, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 8, 2022
@yugabyte-ci yugabyte-ci added the kind/failing-test Tests and testing infra label Sep 5, 2022
@m-iancu m-iancu assigned myang2021 and unassigned m-iancu and sushantrmishra Sep 7, 2022
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Sep 7, 2022
@myang2021
Copy link
Contributor

myang2021 commented Sep 8, 2022

There are two types of errors:

(1) Create database timed out
(2) Query error: Restart read required

For (1), the error only appeared in debug build, samples:

ts1|pid4892|:27485 W0907 02:06:21.790311  4929 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 53) took 57375ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:07:12.885120  4930 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 80) took 47181ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:11:01.036902  8142 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 59) took 45512ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:12:00.640125  7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 91) took 50357ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:12:59.858043  7345 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 104) took 58888ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:13:51.610272  7346 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 118) took 51351ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:14:53.788679  7347 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 131) took 60007ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:15:59.452118  7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 145) took 60004ms (client timeout 60000ms).

Note the last 2 has timed out, the others are not very far from the 60000ms timeout threshold. Because debug build runs slower than release build, we should increase the timeout limit from 60000ms to say 120000ms. We already do the same for asan/tsan build types (rpc timeout limit is increased 3 times to 180000ms).

myang2021 added a commit that referenced this issue Sep 16, 2022
…estPgRegressFeature

Summary:
The test TestPgRegressFeature.testPgRegressFeature is flaky and from stability page I saw
one of the reasons is `Timed out waiting for Create Database`.

After looking into several failed instances, I found that all of them have debug build.

Some CreateDatabase RPC samples from one jenkins failed test:

```
ts1|pid4892|:27485 W0907 02:06:21.790311  4929 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 53) took 57375ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:07:12.885120  4930 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 80) took 47181ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:11:01.036902  8142 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 59) took 45512ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:12:00.640125  7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 91) took 50357ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:12:59.858043  7345 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 104) took 58888ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:13:51.610272  7346 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 118) took 51351ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:14:53.788679  7347 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 131) took 60007ms (client timeout 60000ms).
ts1|pid4892|:27485 W0907 02:15:59.452118  7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 145) took 60004ms (client timeout 60000ms).
```

Note the last 2 have timed out, the others are not very far from the 60000ms timeout threshold.
Because debug build runs slower than release build, we should increase the timeout limit from
60000ms to say 120000ms. We already do the same for asan/tsan build types (rpc timeout limit is
increased 3 times to 180000ms).

I made a change to use a separate timeout for create database RPC call and also take debug build type
into consideration when deciding the new create database RPC time out limit.

I also split out a new test TestPgRegressFeatureDatabase to further improve the test stability as
occasionally the original test itself can run over 1800 seconds and thus fail.

Test Plan:
./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5
./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeatureDatabase' --tp 1 -n 5

No longer see the time out error in create database.

Reviewers: tverona, zyu

Reviewed By: zyu

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D19424
@myang2021
Copy link
Contributor

For the second type of error. Following is what I have found.

After looking into several failed instances, I extracted a test case that can reproduce the
bug in repeated runs:

create table atacc1 (test int not null);
alter table atacc1 alter column test drop not null;
insert into atacc1 values (null);
alter table atacc1 alter test set not null; -- statement A1
ERROR:  column "test" contains null values
delete from atacc1;
alter table atacc1 alter test set not null; -- statement A2
ERROR:  Query error: Restart read required at: { read: { days: 19249 time: 19:27:45.787447 }
local_limit: { days: 19249 time: 19:27:45.787447 } global_limit: <min> in_txn_limit: <max>
serial_no: 0 }

Notice that the statement alter table atacc1 alter test set not null is executed twice. The first
execution (A1) we see an expected ERROR. However for the second execution (A2) the Restart read required ERROR is unexpected. It is caused by a race condition that is described below.

When executing statement A1, a ddl transaction is started at the tserver side. Next PG needs to
verify that the table atacc1 does not contain any null values and therefore PG runs a sequential
scan of the table atacc1. During the sequential scan, YB makes Perform RPC calls to tserver. YB
has an optimization called pre-fetching: before returning the current batch of rows to PG layer,
another Perform RPC is issued to fetch the next batch of rows. This optimization can speed up the
scan process: while PG layer is verifying the rows already fetched, the next batch of rows are
fetched concurrently.

Because the table atacc1 does have null value, the verification failed and statement A1 is aborted
before all the batches are fetched. As part of that, the ddl transaction is also aborted via a
FinishTransaction RPC with ABORT argument from PG to tserver. Once tserver receives
FinishTransaction with ABORT argument, it will clear it from the ddl session and then abort the
ddl transaction. The relevant code:

  const auto txn_value = std::move(txn);
  Session(kind)->SetTransaction(nullptr);

and

  VLOG_WITH_PREFIX_AND_FUNC(2)
      << "ddl: " << req.ddl_mode() << ", txn: " << txn_value->id() << ", abort";
  txn_value->Abort();

There is a race condition between the pre-fetching Perform RPC and the FinishTransaction RPC
because pre-fetching is done asynchronously in a different thread of the PG process.

(1) If tserver receives Perform first, the next FinishTransaction will properly clear the ddl
transaction from the ddl session and abort the ddl transaction as shown above. This is the common
case.
(2) If tserver receives FinishTransaction first, it leads to the unexpected Restart read required error for the following reason.

After FinishTransaction terminates the ddl transaction of statement A1, the pre-fetching
Perform RPC (which belonged to the just terminated ddl transaction of statement A1) arrived. It
will re-establish a new ddl transaction because it has ddl_mode in its PgPerformOptionsPB set to
true. This newly created ddl transaction gets a readpoint T1 when executing the Perform operation.
Following that, statement delete from atacc1 is executed which causes a write to table atacct
with a commit time of T2. Note that T1 < T2. Then at tserver side the statement A2 will reuse the
newly created ddl transaction with readpoint T1 for the execution of A2. Because T1 < T2, in docdb
DocRowwiseIterator::RestartReadHt returns T2 instead of HybridTime::kInvalid, eventually leading
to the "Restart read required` error:

  if (max_seen_ht.is_valid() && max_seen_ht > db_iter_->read_time().read) {
    VLOG(4) << "Restart read: " << max_seen_ht << ", original: " << db_iter_->read_time();
    return max_seen_ht;
  }
  return HybridTime::kInvalid;

For non-ddl transaction, we have used txn_serial_no to detect PG/tserver transaction mismatch
and abort the previous transaction. For ddl transaction, we do not use txn_serial_no.

We can add a similar ddl_txn_serial_no to detect PG/tserver ddl transaction mismatch and
abort the previous ddl transaction.

myang2021 added a commit that referenced this issue Oct 19, 2022
…ing YB ddl transaction

Summary:
The test TestPgRegressFeature.testPgRegressFeature is flaky and from stability page I saw
one of the two reasons is `Restart read required`.

After looking into several failed instances, I extracted a test case that can reproduce the
bug in repeated runs:
```
create table atacc1 (test int not null);
alter table atacc1 alter column test drop not null;
insert into atacc1 values (null);
alter table atacc1 alter test set not null; -- statement A1
ERROR:  column "test" contains null values
delete from atacc1;
alter table atacc1 alter test set not null; -- statement A2
ERROR:  Query error: Restart read required at: { read: { days: 19249 time: 19:27:45.787447 }
local_limit: { days: 19249 time: 19:27:45.787447 } global_limit: <min> in_txn_limit: <max>
serial_no: 0 }

```

Notice that the statement `alter table atacc1 alter test set not null` is executed twice. The first
execution (A1) we see an expected ERROR. However for the second execution (A2) the `Restart read
required` ERROR is unexpected. It is caused by a race condition that is described below.

When executing statement A1, a ddl transaction is started at the tserver side. Next PG needs to
verify that the table `atacc1` does not contain any null values and therefore PG runs a sequential
scan of the table `atacc1`. During the sequential scan, YB makes `Perform` RPC calls to tserver. YB
has an optimization called `pre-fetching`: before returning the current batch of rows to PG layer,
another `Perform` RPC is issued to fetch the next batch of rows. This optimization can speed up the
scan process: while PG layer is verifying the rows already fetched, the next batch of rows are
fetched `concurrently`.

Because the table `atacc1` does have null value, the verification failed and statement A1 is aborted
before all the batches are fetched. As part of that, the ddl transaction is also aborted via a
`FinishTransaction` RPC with ABORT argument from PG to tserver. Once tserver receives
`FinishTransaction` with ABORT argument, it will clear it from the ddl session and then abort the
ddl transaction. The relevant code:

```
  const auto txn_value = std::move(txn);
  Session(kind)->SetTransaction(nullptr);
```
and

```
  VLOG_WITH_PREFIX_AND_FUNC(2)
      << "ddl: " << req.ddl_mode() << ", txn: " << txn_value->id() << ", abort";
  txn_value->Abort();
```

There is a race condition between the pre-fetching `Perform` RPC and the `FinishTransaction` RPC
because pre-fetching is done asynchronously in a different thread of the PG process.

(1) If tserver receives `Perform` first, the next `FinishTransaction` will properly clear the ddl
transaction from the ddl session and abort the ddl transaction as shown above. This is the common
case.
(2) If tserver receives `FinishTransaction` first, it leads to the unexpected `Restart read
required` error for the following reason.

After `FinishTransaction` terminates the ddl transaction of statement A1, the pre-fetching
`Perform` RPC (which belonged to the just terminated ddl transaction of statement A1) arrived. It
will re-establish a new ddl transaction because it has `ddl_mode` in its `PgPerformOptionsPB` set to
true. This newly created ddl transaction gets a readpoint T1 when executing the `Perform` operation.
Following that, statement `delete from atacc1` is executed which causes a write to table `atacct`
with a commit time of T2. Note that T1 < T2. Then at tserver side the statement A2 will reuse the
newly created ddl transaction with readpoint T1 for the execution of A2. Because T1 < T2, in docdb
`DocRowwiseIterator::RestartReadHt` returns T2 instead of HybridTime::kInvalid, eventually leading
to the "Restart read required` error:

```
  if (max_seen_ht.is_valid() && max_seen_ht > db_iter_->read_time().read) {
    VLOG(4) << "Restart read: " << max_seen_ht << ", original: " << db_iter_->read_time();
    return max_seen_ht;
  }
  return HybridTime::kInvalid;
```

I made changes to prevent the race condition by waiting for all outstanding `Perform` operations to
complete their RPCs before aborting the ddl transaction.

Test Plan:
./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5

No longer see the `Restart read required` error.

Reviewers: tverona, mihnea, pjain, dmitry

Reviewed By: pjain, dmitry

Subscribers: yql, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D19612
myang2021 added a commit that referenced this issue Oct 26, 2022
Summary:
The test yb_feature_temp in TestPgRegressFeature.testPgRegressFeature is flaky and the test
can fail with
```
java.lang.AssertionError: pg_regress exited with error code: 1, failed tests: [yb_feature_temp]
```

After debugging, I found that the following output isn't stable:

```
\d temptest
            Table "pg_temp_2.temptest"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 k      | integer |           | not null |
 v1     | integer |           |          |
 v2     | integer |           |          |
Indexes:
    "temptest_pkey" PRIMARY KEY, btree (k ASC)
    "temptest_v1_idx" UNIQUE, btree (v1 ASC)
    "temptest_v2_idx" hash (v2)
```

Temporary schema name may be `pg_temp_1`, `pg_temp_2`, etc. As a result the above output isn't
stable.

I made a change to make the test output stable by
(1) store the output of `\d temptest` into a file that can be auto-removed as part of
    test cleanup after the test completes
(2) read the file back into a ysqlsh variable as a text value
(3) use regexp_replace to change pg_temp_1, pg_temp_2 or the like to pg_temp_xxx

Test Plan:
./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5

No longer see
```
java.lang.AssertionError: pg_regress exited with error code: 1, failed tests: [yb_feature_temp]
```

Reviewers: jason, tverona, dsrinivasan

Reviewed By: dsrinivasan

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D20465
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug kind/failing-test Tests and testing infra priority/high High Priority
Projects
Status: Done
Development

No branches or pull requests

5 participants