Skip to content

2.29.0.0-b33

@iSignal iSignal tagged this 09 Oct 17:55
Summary:
Currently, txns are aborted when they fail to send 10 heartbeats 500 ms apart. Under load
 or even under temporary stalls, we see spurious aborts of txns even when there are no long
term issues with the RAFT participants of the YSQL PG process. This diff raises the abort timeout to 30
heartbeats (500 ms apart) and adds some logging to be able to identify such situations
better.

One result of this change would have been that when a PG backend dies suddenly (SIGKILL, for example), it can now take up to 15
secs for the corresponding transactions to be aborted instead of 5 secs, so other transactions
that are waiting for locks held by the aborted transaction will have to wait longer or time out. To abort more quickly in this scenario, transactions attached to a PG backend that dies are explicitly aborted when the tserver notices the PG backend has exited (the tserver polls for the PG backend pid's status every 50 ms). So PG backend abort will still cause transactions to be aborted quickly.

However, if a tserver dies - it can now take up to 15 secs for the corresponding transactions to be aborted instead of 5 secs. Given this is not a common scenario, it seems acceptable to take this hit.

One side effect of this change could be slightly higher WAL consumption on transaction status tablets with many expired transactions.
Jira: DB-17288

Test Plan:
Verify this scenario manually.

```
Session 1                           Session 2
begin;
update abc set id = id + 1;
select pg_sleep(3600);              begin;
                                    update abc set id = id + 2; -- waits for lock
                                    select clock_timestamp();
                                    commit;
kill -9 backend pid of session 1
select clock_timestamp();
```

Verify that the last two clock timestamps are close enough (within 1 second).

Also verified that for a backend that exits normally by client closing it after completing its statements, no transactions are leftover by the time we reach this point in StartShutdown.

Reviewers: esheng, pjain, bkolagani, zdrudi

Reviewed By: bkolagani

Subscribers: rthallam, zdrudi, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D46875
Assets 2
Loading