Summary:
Commit 43032537707c82693281c33574e21d1b35f2da2d made a change that when doing
incremental catalog cache refresh on detecting a newer shared catalog version,
an additional RPC to master is made to retrieve the latest catalog version.
This additional RPC to master allows to handle the case where a parent ysqlsh
forks out a child ysqlsh that does a number of DDLs and then exits to return the
control back to the parent ysqlsh. The parent ysqlsh's PG backend detects a
newer shared catalog version caused by the catalog version increments made by
the DDLs executed in the child ysqlsh. Without retrieving the latest master
catalog version, it is possible that due to heartbeat delay, the parent ysqlsh's
PG backend operates on the newer shared catalog version, which is already stale
relative to the latest master catalog version. So its next DML failed with
error:
```
The catalog snapshot used for this transaction has been invalidated:
```
Although this additional master RPC call avoids the error, it hurts performance.
In particular, in TPCC benchmark with auto-analyze enabled the `Connection Acq
Latency` has increased more than 300% compared with the baseline which has
auto-analyze disabled. After debugging I found that this additional master RPC
call is the culprit.
Instead of making additional RPC call to master, if we can let the DDL
statement update the shared catalog version, then after the child ysqlsh exits
the parent ysqlsh's PG backend will see the latest catalog version in shared
memory and there is no need to send a RPC to master for that.
This diff adds `YbCheckNewSharedCatalogVersionOptimization` that is called after
a DDL increments the catalog version successfully. It makes a local RPC call to
the local tserver to setup the new catalog version together with the
invalidation messages. All the backends on the same node will be able to see the
latest catalog version earlier, before the next heartbeat response from master.
A few unit tests are updated because they used to make two connections to the
same node and relies on heartbeat delay for them to pass. I changed the tests to
make two connections to different nodes so that heartbeat delay continue to work
as expected by these tests. TestPgDdlConcurrency.java is also updated because the
gflag name was wrong.
**Upgrade/Rollback safety:**
The src/yb/tserver/pg_client.proto change is only used in PG -> tserver
communication which is upgrade safe.
Jira: DB-16501
Test Plan:
(1) ./yb_build.sh --cxx-test pg_catalog_version-test --gtest_filter PgCatalogVersionTest.WaitForSharedCatalogVersionToCatchup
(2)
Running tpcc benchmark with 10 warehouses on my local dev vm RF-3 cluster:
```
incremental refresh/auto-analyze off (without diff)
Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency
NewOrder | 3809 | 27.52 | 45.94 | 0.47
incremental refresh/auto-analyze on (without diff)
Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency
NewOrder | 3836 | 30.37 | 57.84 | 1.18
incremental refresh/auto-analyze off (with diff)
Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency
NewOrder | 3890 | 27.99 | 48.79 | 0.46
incremental refresh/auto-analyze on (with diff)
Transaction | Count | Avg. Latency | P99 Latency | Connection Acq Latency
NewOrder | 3900 | 29.91 | 51.47 | 0.56
```
Consider `Connection Acq Latency`:
Without diff, we see 1.18/0.47 = 251%. with diff, we see 0.56/0.46 = 122%.
Reviewers: kfranz, sanketh, mihnea
Reviewed By: kfranz
Subscribers: yql
Differential Revision: https://phorge.dev.yugabyte.com/D43651