Summary:
The test PgDDLConcurrencyTest.IndexCreation is flaky. It runs concurrent
create index statements to trigger race conditions that can cause some of the create
index statements to fail. The test verifies that when create index aborts, the
PG backend's DDL state is properly reset. The test has a set of expected errors that
are suppressed. The test fails if an unexpected error is encountered, or the test
itself times out after 10 minutes.
When the test fails the following error is found:
```
Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:457): Execute of 'CREATE INDEX IF NOT EXISTS t0_v ON t0(v)' failed: 7, message: ERROR: timed out waiting for postgres backends to catch up
DETAIL: 2 backends on database 13515 are still behind catalog version 15.
HINT: Run the following query on all tservers to find the lagging backends: SELECT * FROM pg_stat_activity WHERE backend_type != 'walsender' AND backend_type != 'yb-conn-mgr walsender' AND catalog_version < 15 AND datid = 13515; (pgsql error XX000) (aux msg ERROR: timed out waiting for postgres backends to catch up
```
I found two reasons for the test flakiness:
(1) The test can fail due to `WaitForYsqlBackendsCatalogVersion` timed out.
Because we do not officially support concurrent DDLs, it can happen that a PG
backend cannot update its local catalog version because it is calling
`WaitForYsqlBackendsCatalogVersion`. If another PG backend also stucks for the
same reason, we can have a deadlock like situation until
`WaitForYsqlBackendsCatalogVersion` times out.
By default --ysql_yb_wait_for_backends_catalog_version_timeout=300000ms (5 min).
It only takes two `WaitForYsqlBackendsCatalogVersion` timeout before the test itself
times out.
(2) Even when `WaitForYsqlBackendsCatalogVersion` times out, the test will check
for the returned error to see if it should be suppressed, and PG will usually append
the following message to the error message:
```
[ts-1] 2025-07-24 11:58:42.372 GMT [56059] CONTEXT: Catalog Version Mismatch: A DDL occurred while processing this query. Try again.
```
The test has
```
Status SuppressAllowedErrors(const Status& s) {
if (HasTransactionError(s) || IsRetryable(s)) {
return Status::OK();
}
return s;
}
bool IsRetryable(const Status& status) {
static const auto kExpectedErrors = {
"Try again",
"Catalog Version Mismatch",
"Restart read required",
"schema version mismatch for table"
};
return HasSubstring(status.message(), kExpectedErrors);
}
```
which means on seeing "Try again" in the error message, the test will continue
its execution and will not fail unless due to timeout as described in (1).
However, PG only appends the "Try again" message in the common case, in other
uncommon situations (e.g., when `need_global_cache_refresh` is false), PG does
not append the "Try again" message. When that happens, the error is not
suppressed and the test fails with just the error show above.
To fix the test failure, I made two changes:
(1) changed two gflags to have smaller values:
--wait_for_ysql_backends_catalog_version_client_master_rpc_timeout_ms
from 20s to 2s
--ysql_yb_wait_for_backends_catalog_version_timeout
from 300s to 30s
(2) if the error contains "waiting for postgres backends to catch up",
suppress the error and let the test continue to execute.
Jira: DB-16337
Test Plan:
./yb_build.sh release --cxx-test pgwrapper_pg_ddl_concurrency-test --gtest_filter PgDDLConcurrencyTest.IndexCreation -n 200
Backport-through: 2025.1
The test seems stable in 2024.2, 2024.1 and 2.20. Probably some code changes have happened
that caused the flakiness. For example, some PG error handling code may have changed so that
earlier we always had "Try again" in the error text and the error was suppressed.
Reviewers: jason, sanketh
Reviewed By: jason
Subscribers: yql
Differential Revision: https://phorge.dev.yugabyte.com/D45593