Skip to content

2.25.0.0-b419

@kai-franz kai-franz tagged this 04 Dec 11:44
Summary:
### Overview

This revision addresses a regression introduced by D25586 which causes PG backend memory to spike to >1GB during connection startup on certain schemas.

### Background

During connection startup and during cache refresh, the PG backend process needs to build the relcache, a per-process in-memory cache that has an entry for each table. We build the relcache in the following scenarios:
1. During connection startup for the first connection after a DDL or after cluster startup.
2. During cache refresh.
3. During connection startup for any connection if preloading is enabled on the cluster.

Each relcache entry has a section that stores information about the triggers on the table. We currently build this by calling `RelationBuildTriggers` for each table—each invocation of `RelationBuildTriggers` scans `pg_trigger` for the triggers on the relation and loads them into the cache entry. In vanilla PG, `RelationBuildTriggers` uses an index scan to efficiently lookup the relevant triggers by the table's OID; however, in Yugabyte, the lookup works differently. Because we call `RelationBuildTriggers` many times, we prefetch the entire `pg_trigger` table and scan the prefetched table rather than going all the way to the master leader (which may be on a different node). However, the trigger relid index is not available on the prefetched table, so we have to scan the entire table into PG and then filter for the relevant rows.

This method of loading the relcache is slow and memory inefficient. As an example, one customer has about 700 tables and 4,000 triggers (mainly created automatically from foreign key constraints). This results in having to copy and scan 3,000,000 rows during relcache preloading. We create a new memory context for loading the relcache and only free it after the relcache is fully loaded—this context balloons to about 400MB due to all the copies of `pg_trigger`. Furthermore, `RelationBuildTriggers` accounts for about 70% of the time it takes to start up a connection in this case.

### RelationBuildRowSecurity

The same issue also affects `RelationBuildRowSecurity`. During connection startup, peak memory usage is Θ(# of row security policies * # of tables). Loading the row-level security policies into the relcache is also slow, with a time complexity of Θ(# of row security policies * # of tables).

### The Fix

To fix this issue, we copy `pg_trigger` into an in-memory hash table indexed by the relation oid and use this hash table to lookup triggers by relation OID when building the trigger entries in the relcache. This reduces the time of relcache trigger initialization from Θ(m * n) to Θ(m + n), and the memory usage from Θ(m * n) to Θ(m), where m is the total # of triggers and n is the total # of tables. This optimization is similar to the ones added in D25586 for `pg_attrdef` and `pg_constraint`.

We implement the same fix for the `pg_policy` table in `RelationBuildRowSecurity`.

This is optimization is guarded by the t-server flag `ysql_use_optimized_relcache_update` (default true).

### Root Cause

Previously, we used a a separate context for each tuple in `YBUpdateRelationsAttributes` (added in D25433). This per-tuple context was then used for `YbCompleteAttrProcessing`, which invokes `RelationBuildTriggers` and `RelationBuildRowSecurity`. So although these functions were inefficiently copying (# of triggers + policies * # tables) tuples, we would load the `pg_trigger` & `pg_policy` tables into memory once before deleting the context, keeping peak memory usage manageable.

D25586 removed this per-tuple context, causing the pg_trigger and pg_policy tables to be accumulated in memory *n* times over before being freed.
Jira: DB-14063

Test Plan:
```
./yb_build.sh --java-test 'org.yb.pgsql.TestRelcacheUpdate'
```

Reviewers: myang

Reviewed By: myang

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D40003
Assets 2
Loading