Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relay: in-memory relay directly from WAL thread #3794

Closed
kostja opened this issue Nov 8, 2018 · 0 comments
Closed

relay: in-memory relay directly from WAL thread #3794

kostja opened this issue Nov 8, 2018 · 0 comments
Labels
feature A new functionality replication

Comments

@kostja
Copy link
Contributor

kostja commented Nov 8, 2018

Currently Tarantool first writes a rows to the local .xlog file, then reads and unpacks this local file in a separate thread to relay to the replica.
We can speed it up quite a bit by relaying the rows to all subscribed replicas using non-blocking I/O directly from WAL. When a replica starts to lag too much behind, we could spawn it into its own relay thread just like we do today.
We will need to centralize transaction validation when we add synchronous replication and this is a rpe-requisite for this step. WAL thread of the leader will be able to validate all changes from all replicas in the replica set.

@kostja kostja added replication feature A new functionality labels Nov 8, 2018
@kyukhin kyukhin added this to the 2.2.0 milestone Dec 7, 2018
GeorgyKirichenko pushed a commit that referenced this issue Jan 21, 2019
GeorgyKirichenko pushed a commit that referenced this issue Jan 21, 2019
GeorgyKirichenko pushed a commit that referenced this issue Jan 21, 2019
Use wal thread to relay rows. Relay writer and reader fibers live
in a wal thread. Writer fiber uses wal memory to send rows to the peer
when it possible or spawns cords to recover rows from a file.

Wal writer stores rows into two memory buffers swapping by buffer
threshold. Memory buffers are splitted into chunks with the
same server id issued by chunk threshold. In order to search
position in wal memory all chunks indexed by wal_mem_index
array. Each wal_mem_index contains corresponding replica_id and
vclock just before first row in the chunk as well as chunk buffer
number, position and size of the chunk in memory.

Closes: #3794
GeorgyKirichenko pushed a commit that referenced this issue Jan 21, 2019
Use wal thread to relay rows. Relay writer and reader fibers live
in a wal thread. Writer fiber uses wal memory to send rows to the
peer when it possible or spawns cords to recover rows from a file.

Wal writer stores rows into two memory buffers swapping by buffer
threshold. Memory buffers are splitted into chunks with the
same server id issued by chunk threshold. In order to search
position in wal memory all chunks indexed by wal_mem_index
array. Each wal_mem_index contains corresponding replica_id and
vclock just before first row in the chunk as well as chunk buffer
number, position and size of the chunk in memory.

Closes: #3794
GeorgyKirichenko pushed a commit that referenced this issue Jan 21, 2019
Use wal thread to relay rows. Relay writer and reader fibers live
in a wal thread. Writer fiber uses wal memory to send rows to the
peer when it possible or spawns cords to recover rows from a file.

Wal writer stores rows into two memory buffers swapping by buffer
threshold. Memory buffers are splitted into chunks with the
same server id issued by chunk threshold. In order to search
position in wal memory all chunks indexed by wal_mem_index
array. Each wal_mem_index contains corresponding replica_id and
vclock just before first row in the chunk as well as chunk buffer
number, position and size of the chunk in memory.

Closes: #3794
GeorgyKirichenko pushed a commit that referenced this issue Jan 22, 2019
Use wal thread to relay rows. Relay writer and reader fibers live
in a wal thread. Writer fiber uses wal memory to send rows to the
peer when it possible or spawns cords to recover rows from a file.

Wal writer stores rows into two memory buffers swapping by buffer
threshold. Memory buffers are splitted into chunks with the
same server id issued by chunk threshold. In order to search
position in wal memory all chunks indexed by wal_mem_index
array. Each wal_mem_index contains corresponding replica_id and
vclock just before first row in the chunk as well as chunk buffer
number, position and size of the chunk in memory.

Closes: #3794
GeorgyKirichenko pushed a commit that referenced this issue Jan 22, 2019
Use wal thread to relay rows. Relay writer and reader fibers live
in a wal thread. Writer fiber uses wal memory to send rows to the
peer when it possible or spawns cords to recover rows from a file.

Wal writer stores rows into two memory buffers swapping by buffer
threshold. Memory buffers are splitted into chunks with the
same server id issued by chunk threshold. In order to search
position in wal memory all chunks indexed by wal_mem_index
array. Each wal_mem_index contains corresponding replica_id and
vclock just before first row in the chunk as well as chunk buffer
number, position and size of the chunk in memory.

Closes: #3794
GeorgyKirichenko pushed a commit that referenced this issue Feb 27, 2019
Encode transactions into a memory buffer and then out encode data from
buffer to xlog file. Wal memory is organized in some count of rotated
memory buffers containing rows headers array and raw data storage.
More that two buffer are used in order to increase buffer average usage.
Rows are written to have a possibility to filter data while relaying in
the future.

Need for: #3794
@kyukhin kyukhin added the prio1 label Apr 8, 2019
@kyukhin kyukhin modified the milestones: 2.2.0, 2.3.0 Jul 29, 2019
GeorgyKirichenko pushed a commit that referenced this issue Aug 13, 2019
Relay uses both xlog files and wal memory relaying to feed a replica
with transaction data. New workflow is to read xlog files until the last
one and then switch to wal memory. If relay is out of wal in-memory data
then relay returns to a file mode. A heartbeat is sent only while
in-memory mode from wal thread because there are always data in files.

Closes #3794
GeorgyKirichenko pushed a commit that referenced this issue Sep 18, 2019
Don't use on_close_log trigger to track xlog file boundaries. As we
intend implement in-memory replication relay could have no more xlog
file operations and couldn't rely on previous trigger invocations. Now
the consumer state is advanced together with relay vclock. After
parallel applier implementation relay wouldn't receive an ACK packet for
each transaction (because an applier groups them) so it should not be
too expensive to advance gc on each relay vclock update.

Prerequisites: #3794
GeorgyKirichenko pushed a commit that referenced this issue Sep 18, 2019
This structure enables to find a xrow buffer row less than given vclock
and then fetch row by row from the xrow forwards to the last appended ones.
A xrow buffer cursor is essential to allow from memory replication because
of relay which required to be able to fetch all log rows, stored in a wal
memory (implemented as xrow buffer), from given position and then follow
all new changes.

Prerequisites: #3794
GeorgyKirichenko pushed a commit that referenced this issue Sep 18, 2019
A relay tries to create a wal memory cursor and follow them to relay
data to it's replica. If a relay failed to attach to a wal memory buffer
or went out of the buffer then the relay recovers xlogs from files and
the makes a new try to attach.

Closes: #3794
GeorgyKirichenko pushed a commit that referenced this issue Oct 9, 2019
Don't use on_close_log trigger to track xlog file boundaries. As we
intend implement in-memory replication relay could have no more xlog
file operations and couldn't rely on previous trigger invocations. Now
the consumer state is advanced together with relay vclock. After
parallel applier implementation relay wouldn't receive an ACK packet for
each transaction (because an applier groups them) so it should not be
too expensive to advance gc on each relay vclock update.

Note: this changes cluster gc behavior - an instance gc will hold
only it's locally generated transaction. Also it is only a
temporary solution until relay processing would be merged with
a wal writer context when wal will process relay ACK requests
as well as log writing and redundancy evaluating.

Part of #3794
GeorgyKirichenko pushed a commit that referenced this issue Oct 9, 2019
Introduce a xrow buffer to store encoded xrows in memory after
transaction was finished. Wal uses an xrow buffer object in
order to encode transactions and then writes encoded data
to a log file so encoded data still lives in memory for
some time after a transaction is finished and cleared by an engine.
Xrow buffer consist of not more than XROW_BUF_CHUNK_COUNT rotating
chunks organized in a ring. Rotation thresholds and
XROW_BUF_CHUNK_COUNT are empiric values and were hardcoded.

Part of #3794
GeorgyKirichenko pushed a commit that referenced this issue Oct 9, 2019
This structure enables to find a xrow buffer row less than given vclock
and then fetch row by row from the xrow forwards to the last appended one.
A xrow buffer cursor is essential to allow the from memory replication
because of a relay which required to be able to fetch all logged rows,
stored in a wal memory (implemented as xrow buffer), from given position
and then follow all new changes.

Part of #3794
GeorgyKirichenko pushed a commit that referenced this issue Oct 9, 2019
Fetch data from wal in-memory buffer. Wal allows to start a fiber
which creates a xrow buffer cursor with given vclock and then fetches
row from the xrow buffer one by one and calls given callback for each
row. Also the wal relaying fiber send a heartbeat message if all
rows were processed there were no rows written for replication timeout
period.
Relay connects to wal with a replica known vclock and tries to
relay data. In case of outdated vclock (wal could not create a cursor
or fetch new row from the cursor) the relay makes a fallback in
order to read logged data from file and then makes another try
to connect to wal with updated vclock and so waiter.
In file mode a relay already has a data to send to a replica so from
not the relay  has not any duty to send heartbeat messages - it
is done by wal relay fiber while it waits for new transactions
written by wal.

Closes #3794
GeorgyKirichenko pushed a commit that referenced this issue Nov 19, 2019
This refactoring helps to implement wal memory buffer.

Part of #980, #3794
GeorgyKirichenko pushed a commit that referenced this issue Nov 19, 2019
A matrix clock which allows to maintain a set of vclocks and
their components order. The main target is to be able to
build a vclock which contains lsns each one is greater or equal
that n (n-majority) corresponding lsn from a matrix clock.

* For instance if we have vclocks like:
 *  1: {1: 10, 2: 12, 3: 0}
 *  2: {1: 10, 2: 14, 3: 1}
 *  3: {1: 0, 2: 8, 3: 4}
 * Then majority 1 builds {1: 10, 2: 14, 3: 4}
 * whereas majority 2 build {1: 10, 2: 12, 3: 1}
 * and majority 3 build {1: 0, 2: 8, 3: 0}.

The purpose of the matrix clock is to evaluate a vclock
which is already processed by wal consumers like relays
or to obtain a majority vclock to commit journal entries
in case of synchronous replication.

Part of #980, #3794
GeorgyKirichenko pushed a commit that referenced this issue Nov 20, 2019
This refactoring helps to implement wal memory buffer.

Part of #980, #3794
GeorgyKirichenko pushed a commit that referenced this issue Nov 20, 2019
A matrix clock which allows to maintain a set of vclocks and
their components order. The main target is to be able to
build a vclock which contains lsns each one is greater or equal
that n (n-majority) corresponding lsn from a matrix clock.

* For instance if we have vclocks like:
 *  1: {1: 10, 2: 12, 3: 0}
 *  2: {1: 10, 2: 14, 3: 1}
 *  3: {1: 0, 2: 8, 3: 4}
 * Then majority 1 builds {1: 10, 2: 14, 3: 4}
 * whereas majority 2 build {1: 10, 2: 12, 3: 1}
 * and majority 3 build {1: 0, 2: 8, 3: 0}.

The purpose of the matrix clock is to evaluate a vclock
which is already processed by wal consumers like relays
or to obtain a majority vclock to commit journal entries
in case of synchronous replication.

Part of #980, #3794
GeorgyKirichenko pushed a commit that referenced this issue Nov 20, 2019
Wal uses a matrix clock (mclock) in order to store all relay accepted
vclocks. This allows wal to build the minimal vclock and then collect
all files before the vclock. Box protects logs from collecting with
was_set_first_checkpoint call which instructs wal to do not unlink
all logs after given vclock.
In order to preserve logs while joining gc tracks all join-vclocks as
checkpoints with a special mark - is_join_readview set to true.

Part of #3794, #980
GeorgyKirichenko pushed a commit that referenced this issue Nov 20, 2019
Wal uses an xrow buffer object in order to encode transactions and
then writes encoded data to a log file so encoded data still live
in memory for some time after a transaction is finished.

Part of #3794, #980
@kyukhin kyukhin added this to the 2.4.1 milestone Feb 12, 2020
GeorgyKirichenko pushed a commit that referenced this issue Mar 4, 2020
Wal uses an xrow buffer object in order to encode transactions and
then writes encoded data to a log file so encoded data still live
in memory for some time after a transaction is finished.

Part of #3794, #980
GeorgyKirichenko pushed a commit that referenced this issue Mar 4, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber
which creates a xrow buffer cursor with given vclock and then
fetches row from the xrow buffer one by one and calls given callback
for each row. Also the wal relaying fiber send a heartbeat message if
all rows were processed there were no rows written for replication
timeout period.
In case of outdated vclock (wal could not create a cursor
or fetch new row from the cursor) a relay switch to reading logged
data from file up to the current vclock and then makes next attempt
to fetch data from wal memory.
In file mode there is always data to send to a replica so relay
do not have to heartbeat messages.
From this point relay creates a cord only when switches to reading
from file. Frequent memory-file oscillation is not very likely
because two consideration:
 1. If replica is to slow (slower than master writes) - it will
    switch to disk and then fall behind
 2. If replica is fast enough - it will catch memory and then
    consume memory before the memory buffer rotation.
In order to split wal and relay logic a relay filter function
were introduced which should be passed while relay attaches
to wal.

Note: wal exit is not graceful - tx sends a break loop message and
wal just stops cbus processing without any care about other fibers
which could still use cbus. To overcome this there is a special
trigger which is signaled just before cbus pipe destroy.

Close #3794
Part of #980
sergepetrenko pushed a commit that referenced this issue Mar 13, 2020
A matrix clock which allows to maintain a set of vclocks and
their components order. The main target is to be able to
build a vclock which contains lsns each one is less or equal
that n corresponding lsn from a matrix clock.

The purpose of the matrix clock is to evaluate a vclock
which is already processed by wal consumers like relays
or to obtain a majority vclock to commit journal entries
in case of synchronous replication.

@sergepetrenko: refactoring & rewrite comments to doxygen style.

Part of #980, #3794
Prerequisite #4114
sergepetrenko pushed a commit that referenced this issue Mar 13, 2020
Wal uses a matrix clock (mclock) in order to track vclocks reported
by relay. This allows wal to build the minimal boundary vclock which
is used in order to collect unneeded files. Box protects logs
from collecting using wal_set_first_checkpoint() call.
In order to preserve logs while joining a replica, gc tracks all
join readview vclocks as checkpoints with a special mark -
is_join_readview set to true.

Also there is no more gc consumer in tx thread, gc consumer info in
box.info output and corresponding lines were commented from test out.

@sergepetrenko: reword some comments and do a bit of refactoring.

Part of #3794, #980
Prerequisite #4114
sergepetrenko pushed a commit that referenced this issue Mar 18, 2020
A matrix clock which allows to maintain a set of vclocks and
their components order. The main target is to be able to
build a vclock which contains lsns each one is less or equal
that n corresponding lsn from a matrix clock.

The purpose of the matrix clock is to evaluate a vclock
which is already processed by wal consumers like relays
or to obtain a majority vclock to commit journal entries
in case of synchronous replication.

@sergepetrenko: refactoring & rewrite comments to doxygen style.

Part of #980, #3794
Prerequisite #4114
sergepetrenko pushed a commit that referenced this issue Mar 18, 2020
Wal uses a matrix clock (mclock) in order to track vclocks reported
by relay. This allows wal to build the minimal boundary vclock which
is used in order to collect unneeded files. Box protects logs
from collecting using wal_set_first_checkpoint() call.
In order to preserve logs while joining a replica, gc tracks all
join readview vclocks as checkpoints with a special mark -
is_join_readview set to true.

Also there is no more gc consumer in tx thread, gc consumer info in
box.info output and corresponding lines were commented from test out.

@sergepetrenko: reword some comments and do a bit of refactoring.

Part of #3794, #980
Prerequisite #4114
@kyukhin kyukhin modified the milestones: 2.4.1, 2.5.1 Mar 26, 2020
cyrillos pushed a commit that referenced this issue Apr 27, 2020
Recovery stop local raises an exception in case of an recovery error
so it is not safe to stop recovery inside recovery delete and guard
inside local_recovery. So call recovery_stop_local manually.

Part of #980, #3794
cyrillos added a commit that referenced this issue Apr 27, 2020
The class is not needed here, we are moving from
cpp to c language. This is a first step to make
recovery code fully C compliant.

Part of #3794

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
cyrillos added a commit that referenced this issue Apr 28, 2020
The class is not needed here, we are moving from
cpp to c language. This is a first step to make
recovery code fully C compliant.

Part of #3794

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
cyrillos pushed a commit that referenced this issue Apr 28, 2020
Recovery stop local raises an exception in case of an recovery error
so it is not safe to stop recovery inside recovery delete and guard
inside local_recovery. So call recovery_stop_local manually.

Part of #980, #3794
cyrillos added a commit that referenced this issue Apr 28, 2020
The class is not needed here, we are moving from
cpp to c language. This is a first step to make
recovery code fully C compliant.

Part of #3794

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
cyrillos added a commit that referenced this issue Apr 28, 2020
We're ready now to compile the recovery
code in plain C mode.

Part of #3794

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Introduce a routine which transfers journal entries from an input
to an output queue writing them to a xlog file. On xlog output
the routine breaks transferring loop and returns writing result code.
After this the output queue containing entries which were written to
xlog (despite the disk write status) whereas the input queue contains
untouched entries. If an input queue is processed without actual
xlog write then a xlog file is flushed manually.
This refactoring helps to implement wal memory buffer.

Part of #980, #3794
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
A matrix clock which allows to maintain a set of vclocks and
their components order. The main target is to be able to
build a vclock which contains lsns each one is less or equal
that n corresponding lsn from a matrix clock.

The purpose of the matrix clock is to evaluate a vclock
which is already processed by wal consumers like relays
or to obtain a majority vclock to commit journal entries
in case of synchronous replication.

Part of #980, #3794
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Wal uses an xrow buffer object in order to encode transactions and
then writes encoded data to a log file so encoded data still live
in memory for some time after a transaction is finished.

Part of #3794, #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber
which creates a xrow buffer cursor with given vclock and then
fetches row from the xrow buffer one by one and calls given callback
for each row. Also the wal relaying fiber send a heartbeat message if
all rows were processed there were no rows written for replication
timeout period.
In case of outdated vclock (wal could not create a cursor
or fetch new row from the cursor) a relay switch to reading logged
data from file up to the current vclock and then makes next attempt
to fetch data from wal memory.
In file mode there is always data to send to a replica so relay
do not have to heartbeat messages.
From this point relay creates a cord only when switches to reading
from file. Frequent memory-file oscillation is not very likely
because two consideration:
 1. If replica is to slow (slower than master writes) - it will
    switch to disk and then fall behind
 2. If replica is fast enough - it will catch memory and then
    consume memory before the memory buffer rotation.
In order to split wal and relay logic a relay filter function
were introduced which should be passed while relay attaches
to wal.

Note: wal exit is not graceful - tx sends a break loop message and
wal just stops cbus processing without any care about other fibers
which could still use cbus. To overcome this there is a special
trigger which is signaled just before cbus pipe destroy.

Close #3794
Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Wal uses an xrow buffer object in order to encode transactions and
then writes encoded data to a log file so encoded data still live
in memory for some time after a transaction is finished.

Part of #3794, #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber
which creates a xrow buffer cursor with given vclock and then
fetches row from the xrow buffer one by one and calls given callback
for each row. Also the wal relaying fiber send a heartbeat message if
all rows were processed there were no rows written for replication
timeout period.
In case of outdated vclock (wal could not create a cursor
or fetch new row from the cursor) a relay switch to reading logged
data from file up to the current vclock and then makes next attempt
to fetch data from wal memory.
In file mode there is always data to send to a replica so relay
do not have to heartbeat messages.
From this point relay creates a cord only when switches to reading
from file. Frequent memory-file oscillation is not very likely
because two consideration:
 1. If replica is to slow (slower than master writes) - it will
    switch to disk and then fall behind
 2. If replica is fast enough - it will catch memory and then
    consume memory before the memory buffer rotation.
In order to split wal and relay logic a relay filter function
were introduced which should be passed while relay attaches
to wal.

Note: wal exit is not graceful - tx sends a break loop message and
wal just stops cbus processing without any care about other fibers
which could still use cbus. To overcome this there is a special
trigger which is signaled just before cbus pipe destroy.

Close #3794
Part of #980
@kyukhin kyukhin modified the milestones: 2.5.1, 2.6.1 Jun 10, 2020
@kyukhin kyukhin modified the milestones: 2.6.1, wishlist Oct 23, 2020
@kyukhin kyukhin removed the prio1 label Nov 18, 2020
@kyukhin kyukhin removed this from the wishlist milestone Dec 25, 2020
@kyukhin kyukhin closed this as completed Dec 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality replication
Projects
None yet
Development

No branches or pull requests

5 participants