You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Performance of synchronous replication degrades too much with the increase of parallel synchronous requests.
For example, on my machine with 6000 requests in parallel (box.info.synchro.queue.len reads 6000) master was only able to pull about 8000 RPS, and messages like this were frequent in the log:
2024-04-08 17:23:11.047 [43663] main txn.c:830 W> too long WAL write: 1 rows at LSN 11794161: 0.520 sec
The issue manifested itself even with replication_synchro_quorum = 1, meaning it's not related to network delay, also the size of the quorum didn't influence the results too much. It seems the problem lies in the way synchronous transactions are processed in the queue.
Besides, when trying the same 6000 concurrent requests to replace something in an async space, the RPS was as high as 300k, meaning the issue isn't related to batch finalization of transactions.
Most likely the cause of degradation is the way our txn_limbo_ack traverses the whole transaction list. In the example above txn_limbo_ack is always called with lsn of the last of 6000 transactions, but it still traverses the whole list and assigns ack_count separately to each transaction. We might improve this: persist an array of lsn's of acks, once the ack_lsn is increased - find the point up to which everything should be committed via binary search, for example.
The text was updated successfully, but these errors were encountered:
Performance of synchronous replication degrades too much with the increase of parallel synchronous requests.
For example, on my machine with 6000 requests in parallel (
box.info.synchro.queue.len
reads6000
) master was only able to pull about 8000 RPS, and messages like this were frequent in the log:The issue manifested itself even with
replication_synchro_quorum = 1
, meaning it's not related to network delay, also the size of the quorum didn't influence the results too much. It seems the problem lies in the way synchronous transactions are processed in the queue.Besides, when trying the same 6000 concurrent requests to replace something in an async space, the RPS was as high as 300k, meaning the issue isn't related to batch finalization of transactions.
Most likely the cause of degradation is the way our
txn_limbo_ack
traverses the whole transaction list. In the example abovetxn_limbo_ack
is always called with lsn of the last of 6000 transactions, but it still traverses the whole list and assignsack_count
separately to each transaction. We might improve this: persist an array of lsn's of acks, once the ack_lsn is increased - find the point up to which everything should be committed via binary search, for example.The text was updated successfully, but these errors were encountered: