Synchronous replication #980

kostja · 2015-08-13T13:37:09Z

Do not start a transaction for each local journal or final join row but follow transaction boundaries instead. Part of #980

In case of synchronous replication there are two instance vclocks the last written one and the last committed one. Despite the fact they both are equal track they separately as a part of preparation for synchronous replication. Part of #980

Implement cord region cache with allows to get a dedicated memory region from and put them back. This is more common approach as txn cache uses. Part of #980

As a cord has a region cache there is no point to implement the same feature for transaction only. Part of #980

Wal does not take any responsibility about transaction commit or rollback. The only thing a wal module does is to write journal entries and report write status. Part of #980

In case of synchronous replication transaction could be written to wal but rolled back in case of conflict. So it is not enough to check that wal was synced while checkpoint or initial join feeding. To handle this txn engine emits a zero-rows transaction which is passed through wal and then committed in the same queue as standard transactions. So successful txn_sync means that all previous transaction were committed but not just written. Part of #980

Wal uses a matrix clock (mclock) in order to track vclocks reported by relay. This allows wal to build the minimal boundary vclock which could be used in order to collect wal unused files. Box protects logs from collecting using wal_set_first_checkpoint call. In order to preserve logs while joining gc tracks all join-vclocks as checkpoints with a special mark - is_join_readview set to true. Also there is no more gc consumer in tx thread, gc consumer info in box.info output and corresponding lines were commented from test out. Part of #3795, #980

* xrow buffer structure Introduce a xrow buffer which stores encoded xrows in a memory after transaction was finished. Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file. Xrow buffer consist of not more than XROW_BUF_CHUNK_COUNT rotating chunks organized in a ring. Rotation thresholds and XROW_BUF_CHUNK_COUNT are the empiric values now. * xrow buffer cursor This structure allows to find a xrow buffer row with vclock less than given one and then fetch row by row from the xrow forwards to the last appended row. A xrow buffer cursor is essential to allow the from memory replication and will be used by a relay to fetch all logged rows, stored in a wal memory (implemented as xrow buffer) from given position and then follow all new changes. Part of #3974 #980

Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file so encoded data still live in memory for some time after a transaction is finished. Part of #3794, #980

Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980

Synchronous wal writes require a queue which contains written but not yet confirmed journal entries. Wal checkpoint requests as well as wall sync requests should be also processed through the queue. So in order to make write, checkpoint and sync requests as similar as possible we could eliminate wal rotation from the checkpoint logic. Also this patch adds box.internal.wal_rotate() function which rotates the wall and used to preserve previous test behavior. Part of #980

Synchronous replication implementation suggest a queue which would collect all written journal entries until they would be committed. So wal synchronization requests should be processed using the queue. This patch introduces journal entry flags and allows wal to process checkpoint requests and writes using the common write code. Also the patch adds a vclock to journal entry in order to return a checkpoint vclock. Currently it looks as weird overhead but dedicated journal entries would be used by synchronous replication implementation to decide which journal entry is committed. Part of #980

If some cases (like there is not data update in case of recovery) a vy_tx could be inserted twice into the corresponding writers list as the vy_tx would have empty log. So check that a vy_tx is already inserted. This was not detected before as we did not do recovery preserving transaction boundaries before. Part of #980

Refactoring: track recovery journal vclock instead of to use the recovery ones. Now replicaset vclock will rely on recovery stream content instead of wal directory content (xlog names and meta). This enables applier to use this journal and generalize wal recovery and applier final join handling. Part of #980

Do not start a transaction for each local journal or final join row but follow transaction boundaries instead. Part of #980

In case of synchronous replication there are two instance vclocks the last written one and the last committed one. Despite the fact they both are equal track they separately as a part of preparation for synchronous replication. Part of #980

Implement cord region cache with allows to get a dedicated memory region from and put them back. This is more common approach as txn cache uses. Part of #980

As a cord has a region cache there is no point to implement the same feature for transaction only. Part of #980

Wal does not take any responsibility about transaction commit or rollback. The only thing a wal module does is to write journal entries and report write status. Part of #980

In case of synchronous replication transaction could be written to wal but rolled back in case of conflict. So it is not enough to check that wal was synced while checkpoint or initial join feeding. To handle this txn engine emits a zero-rows transaction which is passed through wal and then committed in the same queue as standard transactions. So successful txn_sync means that all previous transaction were committed but not just written. Part of #980

Add an internal routine which copies a xrow to a given region Part of #980

An anonymous replica vclock could be behind the master one so we should process final join stage from the replica vclock not master in order to not to skip master transactions between the replica and master vclock. Part of #980

There are minor changes in joining process made in order to unify anonymous replica registration and full join process. Part of #980

The fetch snapshot, register and full join routines are implementing the same logic and could be refactoring with common routines. This helps for further synchronous replication implementation. Part of #980

This patch introduces a special request witch a vclock. Such request means that all queued transactions up to an ack vclock are committed. Any non-readonly instance could emit such ack request after corresponding transaction were written to wal. Also an ack request is processed through relays and appliers and in case of readonly replica it is the only way to commit replicated transaction. This patch prepares tarantool to emit an ack request only after required majority was reached. Part of #980

An anonymous replica vclock could be behind the master one so we should process final join stage from the replica vclock not master in order to not to skip master transactions between the replica and master vclock. Part of #980

There are minor changes in joining process made in order to unify anonymous replica registration and full join process. Part of #980

The fetch snapshot, register and full join routines are implementing the same logic and could be refactoring with common routines. This helps for further synchronous replication implementation. Part of #980

kostja added this to the 1.6.6 milestone Aug 13, 2015

kostja assigned anton-barabanov Aug 13, 2015

kostja modified the milestones: 1.6.7, 1.6.6, 1.7, 1.7.0 Aug 25, 2015

kostja added the feature A new functionality label Aug 31, 2015

rtsisyk added the replication label Sep 24, 2015

rtsisyk assigned rtsisyk and unassigned anton-barabanov Nov 13, 2015

kostja mentioned this issue Feb 18, 2016

1.8 alpha #1209

Closed

8 tasks

kostja changed the title ~~synchronous replication todo~~ bsync Feb 18, 2016

rtsisyk modified the milestones: 1.8.0, 1.7.0 Jun 7, 2016

kostja modified the milestones: 1.8.0, 1.8.1 Mar 31, 2017

kostja changed the title ~~bsync~~ Synchronous replication Aug 11, 2017

kostja mentioned this issue Aug 13, 2017

raft replication #2598

Closed

tarantool deleted a comment from rtsisyk Aug 17, 2017

kostja modified the milestones: 1.8.3, 1.8.4, 1.8.5 Oct 27, 2017

Gerold103 self-assigned this Feb 9, 2018

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

recovery: follow transaction boundaries while recovery or join

ffd625a

Do not start a transaction for each local journal or final join row but follow transaction boundaries instead. Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

fiber: cord region cache

6dbfcea

Implement cord region cache with allows to get a dedicated memory region from and put them back. This is more common approach as txn cache uses. Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

refactoring: use cord region cache for transactions

ee5cd7c

As a cord has a region cache there is no point to implement the same feature for transaction only. Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

recovery: follow transaction boundaries while recovery or join

ece0a7d

Do not start a transaction for each local journal or final join row but follow transaction boundaries instead. Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

fiber: cord region cache

b8b104b

Implement cord region cache with allows to get a dedicated memory region from and put them back. This is more common approach as txn cache uses. Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

refactoring: use cord region cache for transactions

0c401d2

As a cord has a region cache there is no point to implement the same feature for transaction only. Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

xrow copy routine

f37d9c2

Add an internal routine which copies a xrow to a given region Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020

replication: send initial join stage marker before registration

a26de15

There are minor changes in joining process made in order to unify anonymous replica registration and full join process. Part of #980

GeorgyKirichenko pushed a commit that referenced this issue May 3, 2020

replication: send initial join stage marker before registration

253310a

There are minor changes in joining process made in order to unify anonymous replica registration and full join process. Part of #980

kyukhin modified the milestones: 2.6.1, wishlist Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronous replication #980

Synchronous replication #980

kostja commented Aug 13, 2015 •

edited by sergos

Loading

Synchronous replication #980

Synchronous replication #980

Comments

kostja commented Aug 13, 2015 • edited by sergos Loading

kostja commented Aug 13, 2015 •

edited by sergos

Loading