Skip to content

Garbage Collector is very agressive and kills some queries without waiting. #3977

Open
@sharewax

Description

@sharewax

Describe the Bug

I observe in logs of puppetdb that GC kills some connections to postgres.
We have main node located in EU with settings (currently it runs GC every day, was 1 hour and for reproduce more frequient I can set it to less value):

node-ttl = 13d
node-purge-ttl = 14d
node-purge-gc-batch-limit = 50
gc-interval = 1440

and more nodes puppetdb which the most far located in Singapore with RTT 183 ms with settings to the same database in EU

gc-interval = 0
migrate = false
maximum-pool-size = 30
conn-max-age = 27

So the issue - I observe drops of connection to postgres by GC even without some respectful waiting during phase Starting sweep of stale reports and resource events:

69848-2024-05-09T05:53:43.280Z INFO  [p.p.c.services] Starting sweep of stale nodes (threshold: 13 days)
69850-2024-05-09T05:53:43.348Z INFO  [p.p.c.services] Finished sweep of stale nodes (threshold: 13 days)
69851-2024-05-09T05:53:43.353Z INFO  [p.p.c.services] Starting purge deactivated and expired nodes (threshold: 14 days)
69854-2024-05-09T05:53:43.867Z INFO  [p.p.c.services] Finished purge deactivated and expired nodes (threshold: 14 days)
69855-2024-05-09T05:53:43.889Z INFO  [p.p.c.services] Starting sweep of stale reports (threshold: 14 days) and resource events (threshold: 14 days)
69857:2024-05-09T05:53:44.112Z INFO  [p.p.s.storage] Partition GC terminated queries from the following PostgreSQL pids: [21453]
69858:2024-05-09T05:53:44.319Z INFO  [p.p.s.storage] Partition GC terminated queries from the following PostgreSQL pids: [23047]
69859:2024-05-09T05:53:44.523Z INFO  [p.p.s.storage] Partition GC terminated queries from the following PostgreSQL pids: [16534]
69860-2024-05-09T05:53:44.530Z INFO  [p.p.c.services] Finished sweep of stale reports (threshold: 14 days) and resource events (threshold: 14 days)
69861-2024-05-09T05:53:44.534Z INFO  [p.p.c.services] Starting gc packages
69862-2024-05-09T05:53:44.538Z INFO  [p.p.c.services] Finished gc packages
69863-2024-05-09T05:53:44.540Z INFO  [p.p.c.services] Starting sweep of stale catalog data

So it kills 3 connection in time less than half of second (300 ms according log).
this connections are actively sent something from remote location (grep by pid is below):

{"timestamp":"2024-05-09 05:53:43.982 UTC","user":"puppetdb","dbname":"puppetdb","pid":21453,"remote_host":"10.169.2.254","remote_port":51616,"session_id":"663c603d.53cd","line_num":2438,"ps":"BIND","session_start":"2024-05-09 05:33:49 UTC","vxid":"133/20249","txid":115886112,"error_severity":"LOG","message":"duration: 0.085 ms  bind S_15: SELECT producer_timestamp FROM factsets WHERE certname = $1 ORDER BY producer_timestamp DESC LIMIT $2","detail":"parameters: $1 = 'sg2-sdb-a11.be.core.pw', $2 = '1'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":-7343572058853576852}
{"timestamp":"2024-05-09 05:53:43.982 UTC","user":"puppetdb","dbname":"puppetdb","pid":21453,"remote_host":"10.169.2.254","remote_port":51616,"session_id":"663c603d.53cd","line_num":2439,"ps":"SELECT","session_start":"2024-05-09 05:33:49 UTC","vxid":"133/20249","txid":115886112,"error_severity":"LOG","message":"duration: 0.096 ms  execute S_15: SELECT producer_timestamp FROM factsets WHERE certname = $1 ORDER BY producer_timestamp DESC LIMIT $2","detail":"parameters: $1 = 'sg2-sdb-a11.be.core.pw', $2 = '1'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:53:44.106 UTC","user":"puppetdb","dbname":"puppetdb","pid":21453,"remote_host":"10.169.2.254","remote_port":51616,"session_id":"663c603d.53cd","line_num":2440,"ps":"idle in transaction","session_start":"2024-05-09 05:33:49 UTC","vxid":"133/20249","txid":115886112,"error_severity":"FATAL","state_code":"57P01","message":"terminating connection due to administrator command","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:51:55.377 UTC","user":"puppetdb","dbname":"puppetdb","pid":23047,"remote_host":"10.169.2.254","remote_port":46560,"session_id":"663c606c.5a07","line_num":1215,"ps":"BIND","session_start":"2024-05-09 05:34:36 UTC","vxid":"111/39546","txid":0,"error_severity":"LOG","message":"duration: 0.097 ms  bind <unnamed>: insert into certnames (certname) values ($1)  on conflict (certname) do update set deactivated=null, expired=null  where (certnames.deactivated < $2 or certnames.expired < $3)","detail":"parameters: $1 = 'sg2-sl-b60.be.core.pw', $2 = '2024-05-09 05:51:48.896+00', $3 = '2024-05-09 05:51:48.896+00'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:51:56.377 UTC","user":"puppetdb","dbname":"puppetdb","pid":23047,"remote_host":"10.169.2.254","remote_port":46560,"session_id":"663c606c.5a07","line_num":1216,"ps":"INSERT waiting","session_start":"2024-05-09 05:34:36 UTC","vxid":"111/39546","txid":0,"error_severity":"LOG","message":"process 23047 still waiting for ShareLock on transaction 115885248 after 1000.154 ms","detail":"Process holding the lock: 16534. Wait queue: 23047.","context":"while locking tuple (33,96) in relation \"certnames\"","statement":"insert into certnames (certname) values ($1)  on conflict (certname) do update set deactivated=null, expired=null  where (certnames.deactivated < $2 or certnames.expired < $3)","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":8056160586392606418}
{"timestamp":"2024-05-09 05:53:44.319 UTC","user":"puppetdb","dbname":"puppetdb","pid":23047,"remote_host":"10.169.2.254","remote_port":46560,"session_id":"663c606c.5a07","line_num":1217,"ps":"INSERT waiting","session_start":"2024-05-09 05:34:36 UTC","vxid":"111/39546","txid":0,"error_severity":"FATAL","state_code":"57P01","message":"terminating connection due to administrator command","context":"while locking tuple (33,96) in relation \"certnames\"","statement":"insert into certnames (certname) values ($1)  on conflict (certname) do update set deactivated=null, expired=null  where (certnames.deactivated < $2 or certnames.expired < $3)","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":8056160586392606418}
{"timestamp":"2024-05-09 05:53:44.378 UTC","user":"puppetdb","dbname":"puppetdb","pid":16534,"remote_host":"10.169.2.254","remote_port":62478,"session_id":"663c5fca.4096","line_num":4582,"ps":"BIND","session_start":"2024-05-09 05:31:54 UTC","vxid":"142/114381","txid":115885248,"error_severity":"LOG","message":"duration: 0.034 ms  bind S_30: INSERT INTO fact_paths ( path, value_type_id, path_array, name, depth ) VALUES ( $1, $2, $3, $4, $5 ) ON CONFLICT do nothing\nRETURNING *","detail":"parameters: $1 = 'processors#~models#~4', $2 = '0', $3 = '{processors,models,4}', $4 = 'processors', $5 = '2'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":0}
{"timestamp":"2024-05-09 05:53:44.378 UTC","user":"puppetdb","dbname":"puppetdb","pid":16534,"remote_host":"10.169.2.254","remote_port":62478,"session_id":"663c5fca.4096","line_num":4583,"ps":"INSERT","session_start":"2024-05-09 05:31:54 UTC","vxid":"142/114381","txid":115885248,"error_severity":"LOG","message":"duration: 0.101 ms  execute S_30: INSERT INTO fact_paths ( path, value_type_id, path_array, name, depth ) VALUES ( $1, $2, $3, $4, $5 ) ON CONFLICT do nothing\nRETURNING *","detail":"parameters: $1 = 'processors#~models#~4', $2 = '0', $3 = '{processors,models,4}', $4 = 'processors', $5 = '2'","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":4553857851137822237}
{"timestamp":"2024-05-09 05:53:44.522 UTC","user":"puppetdb","dbname":"puppetdb","pid":16534,"remote_host":"10.169.2.254","remote_port":62478,"session_id":"663c5fca.4096","line_num":4584,"ps":"idle in transaction","session_start":"2024-05-09 05:31:54 UTC","vxid":"142/114381","txid":115885248,"error_severity":"FATAL","state_code":"57P01","message":"terminating connection due to administrator command","application_name":"PostgreSQL JDBC Driver","backend_type":"client backend","query_id":4553857851137822237}

Expected Behavior

I suppose that GC should wait for something and drops queries only of something really terrible happens.
Maybe be more patient or expand in logs why this connections to postgres should be terminated during GC. As for my side GC shouldn't be destructive.

Steps to Reproduce

I can reproduce in with every run GC with 2 puppetdb which are located quite far away (for example add by tc some delay between nodes)

Environment

  • Version [7.15.0]
  • Platform [CentOS7]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions