relay: in-memory relay directly from WAL thread #3794

kostja · 2018-11-08T10:12:04Z

Currently Tarantool first writes a rows to the local .xlog file, then reads and unpacks this local file in a separate thread to relay to the replica.
We can speed it up quite a bit by relaying the rows to all subscribed replicas using non-blocking I/O directly from WAL. When a replica starts to lag too much behind, we could spawn it into its own relay thread just like we do today.
We will need to centralize transaction validation when we add synchronous replication and this is a rpe-requisite for this step. WAL thread of the leader will be able to validate all changes from all replicas in the replica set.

…-boundaries' into g.kirichenko/gh-3794-relay-from-wal

…ring' into g.kirichenko/gh-3794-relay-from-wal

Use wal thread to relay rows. Relay writer and reader fibers live in a wal thread. Writer fiber uses wal memory to send rows to the peer when it possible or spawns cords to recover rows from a file. Wal writer stores rows into two memory buffers swapping by buffer threshold. Memory buffers are splitted into chunks with the same server id issued by chunk threshold. In order to search position in wal memory all chunks indexed by wal_mem_index array. Each wal_mem_index contains corresponding replica_id and vclock just before first row in the chunk as well as chunk buffer number, position and size of the chunk in memory. Closes: #3794

…h-3794-relay-from-wal

Use wal thread to relay rows. Relay writer and reader fibers live in a wal thread. Writer fiber uses wal memory to send rows to the peer when it possible or spawns cords to recover rows from a file. Wal writer stores rows into two memory buffers swapping by buffer threshold. Memory buffers are splitted into chunks with the same server id issued by chunk threshold. In order to search position in wal memory all chunks indexed by wal_mem_index array. Each wal_mem_index contains corresponding replica_id and vclock just before first row in the chunk as well as chunk buffer number, position and size of the chunk in memory. Closes: #3794

Encode transactions into a memory buffer and then out encode data from buffer to xlog file. Wal memory is organized in some count of rotated memory buffers containing rows headers array and raw data storage. More that two buffer are used in order to increase buffer average usage. Rows are written to have a possibility to filter data while relaying in the future. Need for: #3794

Relay uses both xlog files and wal memory relaying to feed a replica with transaction data. New workflow is to read xlog files until the last one and then switch to wal memory. If relay is out of wal in-memory data then relay returns to a file mode. A heartbeat is sent only while in-memory mode from wal thread because there are always data in files. Closes #3794

Don't use on_close_log trigger to track xlog file boundaries. As we intend implement in-memory replication relay could have no more xlog file operations and couldn't rely on previous trigger invocations. Now the consumer state is advanced together with relay vclock. After parallel applier implementation relay wouldn't receive an ACK packet for each transaction (because an applier groups them) so it should not be too expensive to advance gc on each relay vclock update. Prerequisites: #3794

This structure enables to find a xrow buffer row less than given vclock and then fetch row by row from the xrow forwards to the last appended ones. A xrow buffer cursor is essential to allow from memory replication because of relay which required to be able to fetch all log rows, stored in a wal memory (implemented as xrow buffer), from given position and then follow all new changes. Prerequisites: #3794

A relay tries to create a wal memory cursor and follow them to relay data to it's replica. If a relay failed to attach to a wal memory buffer or went out of the buffer then the relay recovers xlogs from files and the makes a new try to attach. Closes: #3794

Don't use on_close_log trigger to track xlog file boundaries. As we intend implement in-memory replication relay could have no more xlog file operations and couldn't rely on previous trigger invocations. Now the consumer state is advanced together with relay vclock. After parallel applier implementation relay wouldn't receive an ACK packet for each transaction (because an applier groups them) so it should not be too expensive to advance gc on each relay vclock update. Note: this changes cluster gc behavior - an instance gc will hold only it's locally generated transaction. Also it is only a temporary solution until relay processing would be merged with a wal writer context when wal will process relay ACK requests as well as log writing and redundancy evaluating. Part of #3794

Introduce a xrow buffer to store encoded xrows in memory after transaction was finished. Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file so encoded data still lives in memory for some time after a transaction is finished and cleared by an engine. Xrow buffer consist of not more than XROW_BUF_CHUNK_COUNT rotating chunks organized in a ring. Rotation thresholds and XROW_BUF_CHUNK_COUNT are empiric values and were hardcoded. Part of #3794

This structure enables to find a xrow buffer row less than given vclock and then fetch row by row from the xrow forwards to the last appended one. A xrow buffer cursor is essential to allow the from memory replication because of a relay which required to be able to fetch all logged rows, stored in a wal memory (implemented as xrow buffer), from given position and then follow all new changes. Part of #3794

Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. Relay connects to wal with a replica known vclock and tries to relay data. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) the relay makes a fallback in order to read logged data from file and then makes another try to connect to wal with updated vclock and so waiter. In file mode a relay already has a data to send to a replica so from not the relay has not any duty to send heartbeat messages - it is done by wal relay fiber while it waits for new transactions written by wal. Closes #3794

This refactoring helps to implement wal memory buffer. Part of #980, #3794

A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is greater or equal that n (n-majority) corresponding lsn from a matrix clock. * For instance if we have vclocks like: * 1: {1: 10, 2: 12, 3: 0} * 2: {1: 10, 2: 14, 3: 1} * 3: {1: 0, 2: 8, 3: 4} * Then majority 1 builds {1: 10, 2: 14, 3: 4} * whereas majority 2 build {1: 10, 2: 12, 3: 1} * and majority 3 build {1: 0, 2: 8, 3: 0}. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. Part of #980, #3794

This refactoring helps to implement wal memory buffer. Part of #980, #3794

A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is greater or equal that n (n-majority) corresponding lsn from a matrix clock. * For instance if we have vclocks like: * 1: {1: 10, 2: 12, 3: 0} * 2: {1: 10, 2: 14, 3: 1} * 3: {1: 0, 2: 8, 3: 4} * Then majority 1 builds {1: 10, 2: 14, 3: 4} * whereas majority 2 build {1: 10, 2: 12, 3: 1} * and majority 3 build {1: 0, 2: 8, 3: 0}. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. Part of #980, #3794

Wal uses a matrix clock (mclock) in order to store all relay accepted vclocks. This allows wal to build the minimal vclock and then collect all files before the vclock. Box protects logs from collecting with was_set_first_checkpoint call which instructs wal to do not unlink all logs after given vclock. In order to preserve logs while joining gc tracks all join-vclocks as checkpoints with a special mark - is_join_readview set to true. Part of #3794, #980

Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file so encoded data still live in memory for some time after a transaction is finished. Part of #3794, #980

Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980

@sergepetrenko

A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is less or equal that n corresponding lsn from a matrix clock. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. @sergepetrenko: refactoring & rewrite comments to doxygen style. Part of #980, #3794 Prerequisite #4114

@sergepetrenko

Wal uses a matrix clock (mclock) in order to track vclocks reported by relay. This allows wal to build the minimal boundary vclock which is used in order to collect unneeded files. Box protects logs from collecting using wal_set_first_checkpoint() call. In order to preserve logs while joining a replica, gc tracks all join readview vclocks as checkpoints with a special mark - is_join_readview set to true. Also there is no more gc consumer in tx thread, gc consumer info in box.info output and corresponding lines were commented from test out. @sergepetrenko: reword some comments and do a bit of refactoring. Part of #3794, #980 Prerequisite #4114

@sergepetrenko

A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is less or equal that n corresponding lsn from a matrix clock. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. @sergepetrenko: refactoring & rewrite comments to doxygen style. Part of #980, #3794 Prerequisite #4114

@sergepetrenko

Wal uses a matrix clock (mclock) in order to track vclocks reported by relay. This allows wal to build the minimal boundary vclock which is used in order to collect unneeded files. Box protects logs from collecting using wal_set_first_checkpoint() call. In order to preserve logs while joining a replica, gc tracks all join readview vclocks as checkpoints with a special mark - is_join_readview set to true. Also there is no more gc consumer in tx thread, gc consumer info in box.info output and corresponding lines were commented from test out. @sergepetrenko: reword some comments and do a bit of refactoring. Part of #3794, #980 Prerequisite #4114

Recovery stop local raises an exception in case of an recovery error so it is not safe to stop recovery inside recovery delete and guard inside local_recovery. So call recovery_stop_local manually. Part of #980, #3794

The class is not needed here, we are moving from cpp to c language. This is a first step to make recovery code fully C compliant. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

Recovery stop local raises an exception in case of an recovery error so it is not safe to stop recovery inside recovery delete and guard inside local_recovery. So call recovery_stop_local manually. Part of #980, #3794

The class is not needed here, we are moving from cpp to c language. This is a first step to make recovery code fully C compliant. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

We're ready now to compile the recovery code in plain C mode. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

Introduce a routine which transfers journal entries from an input to an output queue writing them to a xlog file. On xlog output the routine breaks transferring loop and returns writing result code. After this the output queue containing entries which were written to xlog (despite the disk write status) whereas the input queue contains untouched entries. If an input queue is processed without actual xlog write then a xlog file is flushed manually. This refactoring helps to implement wal memory buffer. Part of #980, #3794

A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is less or equal that n corresponding lsn from a matrix clock. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. Part of #980, #3794

Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file so encoded data still live in memory for some time after a transaction is finished. Part of #3794, #980

Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980

Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file so encoded data still live in memory for some time after a transaction is finished. Part of #3794, #980

Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980

kostja added replication feature A new functionality labels Nov 8, 2018

kostja assigned GeorgyKirichenko Nov 8, 2018

kyukhin added this to the 2.2.0 milestone Dec 7, 2018

GeorgyKirichenko pushed a commit that referenced this issue Jan 21, 2019

Merge remote-tracking branch 'origin/g.kirichenko/gh-2798-transaction…

9e3fe12

…-boundaries' into g.kirichenko/gh-3794-relay-from-wal

GeorgyKirichenko pushed a commit that referenced this issue Jan 21, 2019

Merge remote-tracking branch 'origin/g.kirichenko/gh-1025-relay-buffe…

09a5ceb

…ring' into g.kirichenko/gh-3794-relay-from-wal

GeorgyKirichenko pushed a commit that referenced this issue Jan 22, 2019

Merge branch 'g.kirichenko/gh-1025-relay-buffering' into g.kirichenko/g…

022c5cf

…h-3794-relay-from-wal

kyukhin added the prio1 label Apr 8, 2019

kyukhin modified the milestones: 2.2.0, 2.3.0 Jul 29, 2019

GeorgyKirichenko pushed a commit that referenced this issue Nov 19, 2019

wal: extract log write batch into a separate routine

efdf4b1

This refactoring helps to implement wal memory buffer. Part of #980, #3794

GeorgyKirichenko pushed a commit that referenced this issue Nov 20, 2019

wal: extract log write batch into a separate routine

41307f6

This refactoring helps to implement wal memory buffer. Part of #980, #3794

kyukhin added this to the 2.4.1 milestone Feb 12, 2020

kyukhin modified the milestones: 2.4.1, 2.5.1 Mar 26, 2020

cyrillos added a commit that referenced this issue Apr 28, 2020

recovery: cxx to c transition

0be8b79

We're ready now to compile the recovery code in plain C mode. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

kyukhin modified the milestones: 2.5.1, 2.6.1 Jun 10, 2020

cyrillos mentioned this issue Jul 28, 2020

Performance analysis and relay-from-memory importance #5207

Closed

kyukhin modified the milestones: 2.6.1, wishlist Oct 23, 2020

kyukhin removed the prio1 label Nov 18, 2020

kyukhin removed this from the wishlist milestone Dec 25, 2020

kyukhin closed this as completed Dec 25, 2020

kyukhin unassigned cyrillos Dec 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relay: in-memory relay directly from WAL thread #3794

relay: in-memory relay directly from WAL thread #3794

kostja commented Nov 8, 2018

relay: in-memory relay directly from WAL thread #3794

relay: in-memory relay directly from WAL thread #3794

Comments

kostja commented Nov 8, 2018