Summary:
Concurrent index creation enforces state changes where clients cannot be
more than one state apart. The state is tracked by the catalog version,
and it is enforced by making sure all clients are at the catalog version
corresponding to the state change before moving on to the next state.
Problem is, the catalog version we wait on does not necessarily
correspond to the actual state change. The catalog version used for
waiting uses the local catalog version, which is just +1 of the previous
version. That is not accurate when other DDLs bump up the catalog
version simultaneously.
Fix with two parts:
1. wait on a proper catalog version: instead of using the local catalog
version, fetch the master's catalog version after the previous state
change commit, then wait on that version. The downside is that there
is a time window between commit and fetching of master's catalog
version such that we may end up waiting on a later version than
necessary, but I didn't find an easy way to get the actual version
corresponding to the commit, especially considering issue #5030.
1. avoid waiting on the backend that is running the CREATE INDEX: since
we are now possibly waiting on catalog versions that are beyond the
version that the CREATE INDEX is at, it becomes necessary to
whitelist the backend running CREATE INDEX from being taken into
consideration during the wait on backends catalog version request.
This involves a lot of plumbing to send down that information. The
info to identify the backend is the tserver UUID and backend PID.
Some considerations:
- setting the local catalog version to +1 is an existing optimization
that should be reconsidered. If a concurrent DDL happened such that
the actual catalog version is higher, this backend may pass breaking
catalog version checks even if it doesn't deserve to. Do not deal
with this issue in this commit. Filed #25068.
- two callers of wait on backends catalog version for the same
db+version is no longer as shareable because only one backend is
excepted. Whichever caller was first will register their backend
tserver/pid for exemption from the check. If a second caller has the
same db+version, it will have to wait behind the existing information.
It is not safe to add this second caller's tserver/pid for additional
exemption because both callers can actually be waiting on different
catalog versions, but they happened to pick up a later catalog version
since the picking is not perfect right now. In that case, it is
incorrect to ignore both backends in case the first caller was
corresponding to an actual version later than the second caller's
actual version. So such two-caller cases might end up conflicting on
each other. Even if these two callers were not combined and instead
got their separate jobs, they could still be waiting on each other if
their CREATE INDEXes started on old versions.
Jira: DB-13866
Test Plan:
On Almalinux 8:
./yb_build.sh fastdebug --gcc11 \
--gtest_filter PgIndexBackfillTest.CatVerBumps \
-n 100
./yb_build.sh release \
--gtest_filter PgIndexBackfillTest.CatVerBumps \
-n 100
./yb_build.sh tsan \
--gtest_filter PgIndexBackfillTest.CatVerBumps \
-n 300
./yb_build.sh fastdebug --gcc11 \
--cxx-test pg_backends-test
Backport-through: 2.20
Reviewers: myang, amartsinchyk
Reviewed By: amartsinchyk
Subscribers: amartsinchyk, ybase, yql
Differential Revision: https://phorge.dev.yugabyte.com/D39780