New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flaky test] org.yb.pgsql.TestPgRegressFeature.testPgRegressFeature #11645
Comments
There are two types of errors: (1) Create database timed out For (1), the error only appeared in debug build, samples:
Note the last 2 has timed out, the others are not very far from the 60000ms timeout threshold. Because debug build runs slower than release build, we should increase the timeout limit from 60000ms to say 120000ms. We already do the same for asan/tsan build types (rpc timeout limit is increased 3 times to 180000ms). |
…estPgRegressFeature Summary: The test TestPgRegressFeature.testPgRegressFeature is flaky and from stability page I saw one of the reasons is `Timed out waiting for Create Database`. After looking into several failed instances, I found that all of them have debug build. Some CreateDatabase RPC samples from one jenkins failed test: ``` ts1|pid4892|:27485 W0907 02:06:21.790311 4929 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 53) took 57375ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:07:12.885120 4930 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:47425 => 127.212.99.145:27485 (request call id 80) took 47181ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:11:01.036902 8142 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 59) took 45512ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:12:00.640125 7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 91) took 50357ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:12:59.858043 7345 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 104) took 58888ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:13:51.610272 7346 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 118) took 51351ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:14:53.788679 7347 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 131) took 60007ms (client timeout 60000ms). ts1|pid4892|:27485 W0907 02:15:59.452118 7352 yb_rpc.cc:392] Call yb.tserver.PgClientService.CreateDatabase 127.212.99.145:37149 => 127.212.99.145:27485 (request call id 145) took 60004ms (client timeout 60000ms). ``` Note the last 2 have timed out, the others are not very far from the 60000ms timeout threshold. Because debug build runs slower than release build, we should increase the timeout limit from 60000ms to say 120000ms. We already do the same for asan/tsan build types (rpc timeout limit is increased 3 times to 180000ms). I made a change to use a separate timeout for create database RPC call and also take debug build type into consideration when deciding the new create database RPC time out limit. I also split out a new test TestPgRegressFeatureDatabase to further improve the test stability as occasionally the original test itself can run over 1800 seconds and thus fail. Test Plan: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5 ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeatureDatabase' --tp 1 -n 5 No longer see the time out error in create database. Reviewers: tverona, zyu Reviewed By: zyu Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D19424
For the second type of error. Following is what I have found. After looking into several failed instances, I extracted a test case that can reproduce the
Notice that the statement When executing statement A1, a ddl transaction is started at the tserver side. Next PG needs to Because the table
and
There is a race condition between the pre-fetching (1) If tserver receives After
For non-ddl transaction, we have used We can add a similar |
…ing YB ddl transaction Summary: The test TestPgRegressFeature.testPgRegressFeature is flaky and from stability page I saw one of the two reasons is `Restart read required`. After looking into several failed instances, I extracted a test case that can reproduce the bug in repeated runs: ``` create table atacc1 (test int not null); alter table atacc1 alter column test drop not null; insert into atacc1 values (null); alter table atacc1 alter test set not null; -- statement A1 ERROR: column "test" contains null values delete from atacc1; alter table atacc1 alter test set not null; -- statement A2 ERROR: Query error: Restart read required at: { read: { days: 19249 time: 19:27:45.787447 } local_limit: { days: 19249 time: 19:27:45.787447 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } ``` Notice that the statement `alter table atacc1 alter test set not null` is executed twice. The first execution (A1) we see an expected ERROR. However for the second execution (A2) the `Restart read required` ERROR is unexpected. It is caused by a race condition that is described below. When executing statement A1, a ddl transaction is started at the tserver side. Next PG needs to verify that the table `atacc1` does not contain any null values and therefore PG runs a sequential scan of the table `atacc1`. During the sequential scan, YB makes `Perform` RPC calls to tserver. YB has an optimization called `pre-fetching`: before returning the current batch of rows to PG layer, another `Perform` RPC is issued to fetch the next batch of rows. This optimization can speed up the scan process: while PG layer is verifying the rows already fetched, the next batch of rows are fetched `concurrently`. Because the table `atacc1` does have null value, the verification failed and statement A1 is aborted before all the batches are fetched. As part of that, the ddl transaction is also aborted via a `FinishTransaction` RPC with ABORT argument from PG to tserver. Once tserver receives `FinishTransaction` with ABORT argument, it will clear it from the ddl session and then abort the ddl transaction. The relevant code: ``` const auto txn_value = std::move(txn); Session(kind)->SetTransaction(nullptr); ``` and ``` VLOG_WITH_PREFIX_AND_FUNC(2) << "ddl: " << req.ddl_mode() << ", txn: " << txn_value->id() << ", abort"; txn_value->Abort(); ``` There is a race condition between the pre-fetching `Perform` RPC and the `FinishTransaction` RPC because pre-fetching is done asynchronously in a different thread of the PG process. (1) If tserver receives `Perform` first, the next `FinishTransaction` will properly clear the ddl transaction from the ddl session and abort the ddl transaction as shown above. This is the common case. (2) If tserver receives `FinishTransaction` first, it leads to the unexpected `Restart read required` error for the following reason. After `FinishTransaction` terminates the ddl transaction of statement A1, the pre-fetching `Perform` RPC (which belonged to the just terminated ddl transaction of statement A1) arrived. It will re-establish a new ddl transaction because it has `ddl_mode` in its `PgPerformOptionsPB` set to true. This newly created ddl transaction gets a readpoint T1 when executing the `Perform` operation. Following that, statement `delete from atacc1` is executed which causes a write to table `atacct` with a commit time of T2. Note that T1 < T2. Then at tserver side the statement A2 will reuse the newly created ddl transaction with readpoint T1 for the execution of A2. Because T1 < T2, in docdb `DocRowwiseIterator::RestartReadHt` returns T2 instead of HybridTime::kInvalid, eventually leading to the "Restart read required` error: ``` if (max_seen_ht.is_valid() && max_seen_ht > db_iter_->read_time().read) { VLOG(4) << "Restart read: " << max_seen_ht << ", original: " << db_iter_->read_time(); return max_seen_ht; } return HybridTime::kInvalid; ``` I made changes to prevent the race condition by waiting for all outstanding `Perform` operations to complete their RPCs before aborting the ddl transaction. Test Plan: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5 No longer see the `Restart read required` error. Reviewers: tverona, mihnea, pjain, dmitry Reviewed By: pjain, dmitry Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19612
Summary: The test yb_feature_temp in TestPgRegressFeature.testPgRegressFeature is flaky and the test can fail with ``` java.lang.AssertionError: pg_regress exited with error code: 1, failed tests: [yb_feature_temp] ``` After debugging, I found that the following output isn't stable: ``` \d temptest Table "pg_temp_2.temptest" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- k | integer | | not null | v1 | integer | | | v2 | integer | | | Indexes: "temptest_pkey" PRIMARY KEY, btree (k ASC) "temptest_v1_idx" UNIQUE, btree (v1 ASC) "temptest_v2_idx" hash (v2) ``` Temporary schema name may be `pg_temp_1`, `pg_temp_2`, etc. As a result the above output isn't stable. I made a change to make the test output stable by (1) store the output of `\d temptest` into a file that can be auto-removed as part of test cleanup after the test completes (2) read the file back into a ysqlsh variable as a text value (3) use regexp_replace to change pg_temp_1, pg_temp_2 or the like to pg_temp_xxx Test Plan: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressFeature' --tp 1 -n 5 No longer see ``` java.lang.AssertionError: pg_regress exited with error code: 1, failed tests: [yb_feature_temp] ``` Reviewers: jason, tverona, dsrinivasan Reviewed By: dsrinivasan Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D20465
Jira Link: DB-1008
Description
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=org.yb.pgsql.TestPgRegressFeature&name=testPgRegressFeature&platform=all
The text was updated successfully, but these errors were encountered: