Description
Jira Link: DB-17288
During large txns, such as those that result from table rewrites on large tables or a large partition hierarchy, we run into errors such as
ysqlsh:alter_table.sql:1: ERROR: could not serialize access due to concurrent update (query layer retry isn't possible, READ COMMITTED transaction was aborted and some data was already sent to the user) DETAIL: Heartbeat: Transaction 73389530-f34a-49e0-82f4-c408cfc6f770 expired or aborted by a conflict: YB001: . Errors from tablet servers: [Operation expired (yb/tablet/transaction_coordinator.cc:1766): Heartbeat: Transaction 73389530-f34a-49e0-82f4-c408cfc6f770 expired or aborted by a conflict: YB001 (pgsql error YB001) (transaction error 1)]
One way to simulate this is to trigger RAFT leader failures while a table rewrite is running. It can be fixed by increasing the txn timeout via --transaction_max_missed_heartbeat_periods=60 but it would be better to increase this timeouts automatically for such txns.