-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
relay: in-memory relay directly from WAL thread #3794
Labels
Comments
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 21, 2019
…-boundaries' into g.kirichenko/gh-3794-relay-from-wal
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 21, 2019
…ring' into g.kirichenko/gh-3794-relay-from-wal
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 21, 2019
Use wal thread to relay rows. Relay writer and reader fibers live in a wal thread. Writer fiber uses wal memory to send rows to the peer when it possible or spawns cords to recover rows from a file. Wal writer stores rows into two memory buffers swapping by buffer threshold. Memory buffers are splitted into chunks with the same server id issued by chunk threshold. In order to search position in wal memory all chunks indexed by wal_mem_index array. Each wal_mem_index contains corresponding replica_id and vclock just before first row in the chunk as well as chunk buffer number, position and size of the chunk in memory. Closes: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 21, 2019
Use wal thread to relay rows. Relay writer and reader fibers live in a wal thread. Writer fiber uses wal memory to send rows to the peer when it possible or spawns cords to recover rows from a file. Wal writer stores rows into two memory buffers swapping by buffer threshold. Memory buffers are splitted into chunks with the same server id issued by chunk threshold. In order to search position in wal memory all chunks indexed by wal_mem_index array. Each wal_mem_index contains corresponding replica_id and vclock just before first row in the chunk as well as chunk buffer number, position and size of the chunk in memory. Closes: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 21, 2019
Use wal thread to relay rows. Relay writer and reader fibers live in a wal thread. Writer fiber uses wal memory to send rows to the peer when it possible or spawns cords to recover rows from a file. Wal writer stores rows into two memory buffers swapping by buffer threshold. Memory buffers are splitted into chunks with the same server id issued by chunk threshold. In order to search position in wal memory all chunks indexed by wal_mem_index array. Each wal_mem_index contains corresponding replica_id and vclock just before first row in the chunk as well as chunk buffer number, position and size of the chunk in memory. Closes: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 22, 2019
Use wal thread to relay rows. Relay writer and reader fibers live in a wal thread. Writer fiber uses wal memory to send rows to the peer when it possible or spawns cords to recover rows from a file. Wal writer stores rows into two memory buffers swapping by buffer threshold. Memory buffers are splitted into chunks with the same server id issued by chunk threshold. In order to search position in wal memory all chunks indexed by wal_mem_index array. Each wal_mem_index contains corresponding replica_id and vclock just before first row in the chunk as well as chunk buffer number, position and size of the chunk in memory. Closes: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 22, 2019
GeorgyKirichenko
pushed a commit
that referenced
this issue
Jan 22, 2019
Use wal thread to relay rows. Relay writer and reader fibers live in a wal thread. Writer fiber uses wal memory to send rows to the peer when it possible or spawns cords to recover rows from a file. Wal writer stores rows into two memory buffers swapping by buffer threshold. Memory buffers are splitted into chunks with the same server id issued by chunk threshold. In order to search position in wal memory all chunks indexed by wal_mem_index array. Each wal_mem_index contains corresponding replica_id and vclock just before first row in the chunk as well as chunk buffer number, position and size of the chunk in memory. Closes: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Feb 27, 2019
Encode transactions into a memory buffer and then out encode data from buffer to xlog file. Wal memory is organized in some count of rotated memory buffers containing rows headers array and raw data storage. More that two buffer are used in order to increase buffer average usage. Rows are written to have a possibility to filter data while relaying in the future. Need for: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Aug 13, 2019
Relay uses both xlog files and wal memory relaying to feed a replica with transaction data. New workflow is to read xlog files until the last one and then switch to wal memory. If relay is out of wal in-memory data then relay returns to a file mode. A heartbeat is sent only while in-memory mode from wal thread because there are always data in files. Closes #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Sep 18, 2019
Don't use on_close_log trigger to track xlog file boundaries. As we intend implement in-memory replication relay could have no more xlog file operations and couldn't rely on previous trigger invocations. Now the consumer state is advanced together with relay vclock. After parallel applier implementation relay wouldn't receive an ACK packet for each transaction (because an applier groups them) so it should not be too expensive to advance gc on each relay vclock update. Prerequisites: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Sep 18, 2019
This structure enables to find a xrow buffer row less than given vclock and then fetch row by row from the xrow forwards to the last appended ones. A xrow buffer cursor is essential to allow from memory replication because of relay which required to be able to fetch all log rows, stored in a wal memory (implemented as xrow buffer), from given position and then follow all new changes. Prerequisites: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Sep 18, 2019
A relay tries to create a wal memory cursor and follow them to relay data to it's replica. If a relay failed to attach to a wal memory buffer or went out of the buffer then the relay recovers xlogs from files and the makes a new try to attach. Closes: #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Oct 9, 2019
Don't use on_close_log trigger to track xlog file boundaries. As we intend implement in-memory replication relay could have no more xlog file operations and couldn't rely on previous trigger invocations. Now the consumer state is advanced together with relay vclock. After parallel applier implementation relay wouldn't receive an ACK packet for each transaction (because an applier groups them) so it should not be too expensive to advance gc on each relay vclock update. Note: this changes cluster gc behavior - an instance gc will hold only it's locally generated transaction. Also it is only a temporary solution until relay processing would be merged with a wal writer context when wal will process relay ACK requests as well as log writing and redundancy evaluating. Part of #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Oct 9, 2019
Introduce a xrow buffer to store encoded xrows in memory after transaction was finished. Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file so encoded data still lives in memory for some time after a transaction is finished and cleared by an engine. Xrow buffer consist of not more than XROW_BUF_CHUNK_COUNT rotating chunks organized in a ring. Rotation thresholds and XROW_BUF_CHUNK_COUNT are empiric values and were hardcoded. Part of #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Oct 9, 2019
This structure enables to find a xrow buffer row less than given vclock and then fetch row by row from the xrow forwards to the last appended one. A xrow buffer cursor is essential to allow the from memory replication because of a relay which required to be able to fetch all logged rows, stored in a wal memory (implemented as xrow buffer), from given position and then follow all new changes. Part of #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Oct 9, 2019
Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. Relay connects to wal with a replica known vclock and tries to relay data. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) the relay makes a fallback in order to read logged data from file and then makes another try to connect to wal with updated vclock and so waiter. In file mode a relay already has a data to send to a replica so from not the relay has not any duty to send heartbeat messages - it is done by wal relay fiber while it waits for new transactions written by wal. Closes #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Nov 19, 2019
A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is greater or equal that n (n-majority) corresponding lsn from a matrix clock. * For instance if we have vclocks like: * 1: {1: 10, 2: 12, 3: 0} * 2: {1: 10, 2: 14, 3: 1} * 3: {1: 0, 2: 8, 3: 4} * Then majority 1 builds {1: 10, 2: 14, 3: 4} * whereas majority 2 build {1: 10, 2: 12, 3: 1} * and majority 3 build {1: 0, 2: 8, 3: 0}. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. Part of #980, #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Nov 20, 2019
A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is greater or equal that n (n-majority) corresponding lsn from a matrix clock. * For instance if we have vclocks like: * 1: {1: 10, 2: 12, 3: 0} * 2: {1: 10, 2: 14, 3: 1} * 3: {1: 0, 2: 8, 3: 4} * Then majority 1 builds {1: 10, 2: 14, 3: 4} * whereas majority 2 build {1: 10, 2: 12, 3: 1} * and majority 3 build {1: 0, 2: 8, 3: 0}. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. Part of #980, #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
Nov 20, 2019
Wal uses a matrix clock (mclock) in order to store all relay accepted vclocks. This allows wal to build the minimal vclock and then collect all files before the vclock. Box protects logs from collecting with was_set_first_checkpoint call which instructs wal to do not unlink all logs after given vclock. In order to preserve logs while joining gc tracks all join-vclocks as checkpoints with a special mark - is_join_readview set to true. Part of #3794, #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
Mar 4, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980
sergepetrenko
pushed a commit
that referenced
this issue
Mar 13, 2020
A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is less or equal that n corresponding lsn from a matrix clock. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. @sergepetrenko: refactoring & rewrite comments to doxygen style. Part of #980, #3794 Prerequisite #4114
sergepetrenko
pushed a commit
that referenced
this issue
Mar 13, 2020
Wal uses a matrix clock (mclock) in order to track vclocks reported by relay. This allows wal to build the minimal boundary vclock which is used in order to collect unneeded files. Box protects logs from collecting using wal_set_first_checkpoint() call. In order to preserve logs while joining a replica, gc tracks all join readview vclocks as checkpoints with a special mark - is_join_readview set to true. Also there is no more gc consumer in tx thread, gc consumer info in box.info output and corresponding lines were commented from test out. @sergepetrenko: reword some comments and do a bit of refactoring. Part of #3794, #980 Prerequisite #4114
sergepetrenko
pushed a commit
that referenced
this issue
Mar 18, 2020
A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is less or equal that n corresponding lsn from a matrix clock. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. @sergepetrenko: refactoring & rewrite comments to doxygen style. Part of #980, #3794 Prerequisite #4114
sergepetrenko
pushed a commit
that referenced
this issue
Mar 18, 2020
Wal uses a matrix clock (mclock) in order to track vclocks reported by relay. This allows wal to build the minimal boundary vclock which is used in order to collect unneeded files. Box protects logs from collecting using wal_set_first_checkpoint() call. In order to preserve logs while joining a replica, gc tracks all join readview vclocks as checkpoints with a special mark - is_join_readview set to true. Also there is no more gc consumer in tx thread, gc consumer info in box.info output and corresponding lines were commented from test out. @sergepetrenko: reword some comments and do a bit of refactoring. Part of #3794, #980 Prerequisite #4114
cyrillos
added a commit
that referenced
this issue
Apr 27, 2020
The class is not needed here, we are moving from cpp to c language. This is a first step to make recovery code fully C compliant. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
cyrillos
added a commit
that referenced
this issue
Apr 28, 2020
The class is not needed here, we are moving from cpp to c language. This is a first step to make recovery code fully C compliant. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
cyrillos
added a commit
that referenced
this issue
Apr 28, 2020
The class is not needed here, we are moving from cpp to c language. This is a first step to make recovery code fully C compliant. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
cyrillos
added a commit
that referenced
this issue
Apr 28, 2020
We're ready now to compile the recovery code in plain C mode. Part of #3794 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Introduce a routine which transfers journal entries from an input to an output queue writing them to a xlog file. On xlog output the routine breaks transferring loop and returns writing result code. After this the output queue containing entries which were written to xlog (despite the disk write status) whereas the input queue contains untouched entries. If an input queue is processed without actual xlog write then a xlog file is flushed manually. This refactoring helps to implement wal memory buffer. Part of #980, #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
A matrix clock which allows to maintain a set of vclocks and their components order. The main target is to be able to build a vclock which contains lsns each one is less or equal that n corresponding lsn from a matrix clock. The purpose of the matrix clock is to evaluate a vclock which is already processed by wal consumers like relays or to obtain a majority vclock to commit journal entries in case of synchronous replication. Part of #980, #3794
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently Tarantool first writes a rows to the local .xlog file, then reads and unpacks this local file in a separate thread to relay to the replica.
We can speed it up quite a bit by relaying the rows to all subscribed replicas using non-blocking I/O directly from WAL. When a replica starts to lag too much behind, we could spawn it into its own relay thread just like we do today.
We will need to centralize transaction validation when we add synchronous replication and this is a rpe-requisite for this step. WAL thread of the leader will be able to validate all changes from all replicas in the replica set.
The text was updated successfully, but these errors were encountered: