-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] flaky test: org.yb.pgsql.TestPgRegressIndex.testPgRegressIndex #16408
Comments
Re-assigning to @amitanandaiyer as based on some of the error messages this looks to be caused by #17342. |
As of 2023-10-30, frequently failing tests are yb_pg_indexing (sometimes) and yb_reindex. At the time of the original report, this test was flaky for other reasons (I believe it was transactions related).
|
The reindex test failure might be an actual regression. Here is a test case: derived from yb_reindex.sql CREATE TEMP TABLE tmp (i int PRIMARY KEY, j int);
CREATE INDEX ON tmp (j);
INSERT INTO tmp SELECT g, -g FROM generate_series(1, 10) g;
-- Disable reads/writes to the index.
UPDATE pg_index SET indislive = false, indisready = false, indisvalid = false
WHERE indexrelid = 'tmp_j_idx'::regclass;
--- Force cache refresh.
SELECT * from pg_yb_catalog_version;
SET yb_non_ddl_txn_for_sys_tables_allowed TO on;
UPDATE pg_yb_catalog_version SET current_version = current_version + 1;
UPDATE pg_yb_catalog_version SET last_breaking_version = current_version;
RESET yb_non_ddl_txn_for_sys_tables_allowed;
SELECT * from pg_yb_catalog_version; Run with debugger breakpoint For recent master cd69a84,
For recent 2.18 7c9798f,
Notice t_infomask2 differs
In upstream PG 15.2, the line moved somewhere else:
Notice t_infomask2 is 2, matching 2.18. So something happened in master that likely messed up this field. |
yb_reindex failure is a catalog_version and cache issue. Originating from commit 6fec2ec, the condition for doing
/*
* Set the current syscatalog version (will check that we are up to date).
* Avoid it for syscatalog tables so that we can still use this for
* refreshing the caches when we are behind.
* Note: This works because we do not allow modifying schemas (alter/drop)
* for system catalog tables.
*/
if (!IsSystemRelation(rel))
/* Set the current syscatalog version (will check that we are up to date) */ Since Andrei's commit removes foreign scan, direct system table reads no longer use foreign scan and instead use yb seq scan. So they don't send catalog version and don't notice catalog version mismatch. Here is the key snippet: SET yb_non_ddl_txn_for_sys_tables_allowed TO on;
UPDATE pg_yb_catalog_version SET current_version = current_version + 1;
UPDATE pg_yb_catalog_version SET last_breaking_version = current_version;
RESET yb_non_ddl_txn_for_sys_tables_allowed;
SELECT distinct(current_version = last_breaking_version) from pg_yb_catalog_version;
-- Show the corruption.
/*+SeqScan(tmp) */
SELECT i FROM tmp WHERE j = -5;
/*+IndexScan(tmp_j_idx) */
SELECT i FROM tmp WHERE j = -5; Before, SELECT from pg_yb_catalog_version gets catalog version mismatch and causes remaining scans to operate on up-to-date cache. After, it doesn't notice mismatch (except for the rare timing where catalog version propogates through heartbeat fast enough), and temp table scans also don't notice since they don't reach out to master/tserver, which is where catalog version mismatch checks happen. Putting a sleep before the cache-dependent select (the last select) causes the issue to go away. Putting an EXPLAIN instead almost always shows sequential scan being chosen instead of index scan because, operating off an old cache, it thinks indislive, indisvalid, indisready are still false. One fix in this case is to send catalog version in the system relation requests (it seems to me the comment justification is outdated). But if the SELECT to pg_yb_catalog_version never existed, then this would be a problem even with that fix. If we accept that the catalog changes can propagate slowly over heartbeat, then having a command to explicitly clear caches would be nice (though we have to be careful about tserver response cache, so this command should either clear both caches or clear just pg cache but also get latest catalog version from master -- though, now that I think about it, a more direct approach may be a command that force rechecks with master's catalog version which can hook up with the existing refresh logic). If we do not accept cases where queries are completely local and can get away with not checking catalog version, then a different more proper solution should be done (there are other similar cases besides select from temp table cc: @deeps1991). |
Summary: TestPgRegressIndex rarely fails with the following errror if READ COMMITTED isolation is enabled: ``` jenkins@jenkins-master ~/jobs/github-yugabyte-db-alma8-master-gcc11-fastdebug/builds$ find -name '*fatal_failure_details*txt' | xargs grep -l 'Restarting a DDL transaction not supported' ./4258/archive/java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgRegressIndex__testPgRegressIndex/org.yb.pgsql.TestPgRegressIndex.testPgRegressIndex.fatal_failure_details.ts-1.127.52.22.121-port13683.2023-11-19T05_58_16.pid13287.txt ./4248/archive/java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgRegressIndex__testPgRegressIndex/org.yb.pgsql.TestPgRegressIndex.testPgRegressIndex.fatal_failure_details.ts-1.127.29.54.114-port15898.2023-11-18T00_42_03.pid223695.txt ./4262/archive/java/yb-pgsql/target/surefire-reports_org.yb.pgsql.TestPgRegressIndex__testPgRegressIndex/org.yb.pgsql.TestPgRegressIndex.testPgRegressIndex.fatal_failure_details.ts-1.127.206.118.77-port16906.2023-11-20T11_42_24.pid40994.txt ``` Disabling read committed in this test to avoid this. Added a TODO to re-enable read committed for this test as part of #19975. Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressIndex' Reviewers: jason, tvesely Reviewed By: tvesely Subscribers: tvesely, yql Differential Revision: https://phorge.dev.yugabyte.com/D30509
Summary: Test TestPgRegressIndex has been often failing since commit 465ee2c, titled [#18082] YSQL: Stop using ForeignScan for YB relations That commit changes user-initiated system table requests such as SELECT distinct(current_version = last_breaking_version) from pg_yb_catalog_version; to use YbSeqScan rather than ForeignScan. A difference between the two is that YbSeqScan does not set catalog_version in the request to DocDB. The function ybcBeginScan, which is shared by both YbSeqScan and internal system table scans, has logic to avoid setting catalog_version for system tables. This logic has been in place so that the internal system table scans don't hit catalog version mismatch error (which is actually less correct, but that is a discussion for another day). Therefore, on server-side (in this case, master), the catalog version check does not happen for this query, so catalog version mismatch is not detected. Then, the subsequent queries almost always run off an outdated catalog because the true catalog version doesn't propagate fast enough. The test relies on these queries to execute with an up-to-date catalog, so this results in failure. A simple fix for the test is to add sleeps so that the catalog version can propagate. This strategy is already being used by other tests, and this situation could be treated as not much different. Instead, bring back the old behavior of sending ysql_catalog_version in user-initiated system table requests. Do so by further requiring the scans to be internally-generated in order to skip setting catalog_version. The IsSystemRelation condition is still needed because internally-generated scans can scan user tables, such as in index build. Add the assumptions about queries causing catalog version mismatch and cache refresh in the test. Leave issue #16408 open since the overall java test is still failing due to issue #19807 among other reasons. Jira: DB-8984 Test Plan: TestPgRegressIndex is flaky for multiple reasons, such as - issue #19807 - DropTable RPC timed out - expired or aborted by a conflict: 40001 - Restart read required at - could not serialize access due to concurrent update Manually check the following: ./yb_build.sh fastdebug --gcc11 \ --java-test TestPgRegressIndex -n 10 Backport-through: 2.20 Close: #20017 Reviewers: myang Reviewed By: myang Subscribers: kfranz, amartsinchyk, yql Differential Revision: https://phorge.dev.yugabyte.com/D30412
Summary: Test TestPgRegressIndex has been often failing since commit 465ee2c, titled [#18082] YSQL: Stop using ForeignScan for YB relations That commit changes user-initiated system table requests such as SELECT distinct(current_version = last_breaking_version) from pg_yb_catalog_version; to use YbSeqScan rather than ForeignScan. A difference between the two is that YbSeqScan does not set catalog_version in the request to DocDB. The function ybcBeginScan, which is shared by both YbSeqScan and internal system table scans, has logic to avoid setting catalog_version for system tables. This logic has been in place so that the internal system table scans don't hit catalog version mismatch error (which is actually less correct, but that is a discussion for another day). Therefore, on server-side (in this case, master), the catalog version check does not happen for this query, so catalog version mismatch is not detected. Then, the subsequent queries almost always run off an outdated catalog because the true catalog version doesn't propagate fast enough. The test relies on these queries to execute with an up-to-date catalog, so this results in failure. A simple fix for the test is to add sleeps so that the catalog version can propagate. This strategy is already being used by other tests, and this situation could be treated as not much different. Instead, bring back the old behavior of sending ysql_catalog_version in user-initiated system table requests. Do so by further requiring the scans to be internally-generated in order to skip setting catalog_version. The IsSystemRelation condition is still needed because internally-generated scans can scan user tables, such as in index build. Add the assumptions about queries causing catalog version mismatch and cache refresh in the test. Leave issue #16408 open since the overall java test is still failing due to issue #19807 among other reasons. Jira: DB-8984 Test Plan: TestPgRegressIndex is flaky for multiple reasons, such as - issue #19807 - DropTable RPC timed out - expired or aborted by a conflict: 40001 - Restart read required at - could not serialize access due to concurrent update Manually check the following: ./yb_build.sh fastdebug --gcc11 \ --java-test TestPgRegressIndex -n 10 Backport-through: 2.20 Original commit: 1350cef / D30412 Reviewers: myang Reviewed By: myang Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30851
Jira Link: DB-5819
Description
Keeps popping up on per-diff Detective for 2-3 build types.
report
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: