Skip to content

2.25.2.0-b158

tagged this 13 Mar 18:35
Summary:
The `active_pid` field in `pg_replication_slots` view was always null. This was due to the fact that we were never storing or updating the pid for any slot. Now with this revision we are storing `active_pid` for each slot in the `cdc_state` table.

New field introduced is present under the data column: active_pid (uint64).

We have used the RPCs for initialising the virtual wal (`YBCInitVirtualWal`) and destroying it (`YBCDestroyVirtualWal`) for updating the value of `active_pid` in the `cdc_state` table. When the virtual wal is initialised we set the `active_pid` to `MyProcPid` and whenever it is destroyed we set it to 0.

The `backend_xmin` field in `pg_stat_replication` view was not being populated. With this revision we are using the `xmin` value stored in `cdc_state` table to fill it.

The `state` field in `pg_stat_replication` view was set to `catchup` always. With this revision we are setting the value of `state` to always `streaming`. The reason why we are setting this to always `streaming` is that the other two walsender states `catchup` and `stopping` will never be reached as when the walsender process goes down we just remove it, hence no possibility of `stopping` state and similarly since this entry will only be populated when walsender is active, and we use walsender for streaming only there is no possibility of `catchup` state.

The three lag metrics (`flush_lag`, `write_lag` and `replay_lag`) in `pg_stat_replication` view were not being populated. With this revision we are setting the value of these metrics to the value of `cdcsdk_flush_lag`. For this a new RPC `YBCGetLagMetrics` has been introduced which based on the `stream_id` gets the value of the lag metric.

**Upgrade / rollback safety:**

This diff introduces a new RPC `GetLagMetrics`. This rpc will always be sent from PG to local tserver, hence an auto flag is not needed.

This diff adds a field `active_pid` to the message `PgReplicationSlotInfoPB`. New field is added to the tserver response (to pggate) proto. When the value is absent, master (which is upgraded first) is expected to fill in the appropriate default value. This is upgrade and rollback safe. No new flags are added to guard the feature.
Jira: DB-14647

Test Plan:
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testActivePidNull'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testActivePidPopulationOnStreamRestart'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testActivePidPopulationFromDifferentTServers'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testBackendXminAndStatePopulation'

Manual testing is performed for ensuring correct population of lag metrics in the pg_stat_replication view by comparing the output with the value of cdcsdk_flush_lag.

Reviewers: skumar, vkushwaha, utkarsh.munjal, sumukh.phalgaonkar, stiwary

Reviewed By: utkarsh.munjal, sumukh.phalgaonkar

Subscribers: yql, ybase, ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D41819
Assets 2
Loading