Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous replication #980

Open
3 of 17 tasks
kostja opened this issue Aug 13, 2015 · 1 comment
Open
3 of 17 tasks

Synchronous replication #980

kostja opened this issue Aug 13, 2015 · 1 comment
Labels
feature A new functionality replication
Milestone

Comments

@kostja
Copy link
Contributor

kostja commented Aug 13, 2015

I am updating this ticket to list RAFT implementation plan as I see it. It turns out the implementation plan is nowhere but in my head.

  • in-memory WAL. We need to run the RAFT state machine strictly in the WAL thread, to make to maximally condense/simplify the implementation. There are 3 state machines in fact: append entries, recovery and leader election, so append entries and leader election has to reside in WAL thread
  • use swim for replication failure detection. Right now each node runs heartbeat protocol over relay/applier channels, but this heartbeat can not be directly applied to RAFT, since it is from the follower to the leader, while RAFT heartbeats are from the leader to the follower. Since replication factor is per space, we will in fact have multiple RAFT instances running inside each Tarantool instance, and should use SWIM heartbeats for all of the instances.
  • automatically build full-mesh topology using SWIM for topology discovery. RAFT can not work in non-full-mesh topologies, and the best bet is to assemble full-mesh automatically from all partial topologies
  • implement replication filters, i.e. make sure the node only sends its own changes to followers. Now that we have replication filter id in the protocol it is easy, and after all topologies are automatically upgraded to the full mesh, it will become mandatory, since otherwise replication will be streaming a lot of unnecessary changes when replication factor is large
  • implement durable commit log, also known as semi-sync mode - the WAL thread doesn't ack back to TX thread before obtaining a majority of responses from replicas. This is what some people call qsync repl
  • make sure all system tables use semi sync mode, to ensure replication topology changes do not break consistency. At this step also introduce the concept of groups and group properties. Each space from now on will belong to a named group, and group attributes include features like sync, async, node uuid list for the members of the group.
  • add system table for the raft state and implement RAFT leader election machinery
  • implement proxy module. With RAFT, we need to forward writes to the leader transparently to the user, and this is the task of the new listen implementation in the proxy.

QA

  • basic functional tests - @rtsisyk (some tests already exist on bsync branch)
  • functional tests for slave rollback
  • functional tests for master rollback
  • fix test system to allow writing cluster tests Cluster testing test-run#11
  • error injection tests
  • functional tests and fixes for master roll back on conflict
  • function tests for slave reconnect after a timeout
  • functional tests for re-election on master shutdown
  • function tests for 3-2-1-0-1-2-3 restart of the cluster
@kostja kostja added this to the 1.6.6 milestone Aug 13, 2015
@kostja kostja modified the milestones: 1.6.7, 1.6.6, 1.7, 1.7.0 Aug 25, 2015
@kostja kostja added the feature A new functionality label Aug 31, 2015
@rtsisyk rtsisyk assigned rtsisyk and unassigned anton-barabanov Nov 13, 2015
@kostja kostja mentioned this issue Feb 18, 2016
8 tasks
@kostja kostja changed the title synchronous replication todo bsync Feb 18, 2016
@rtsisyk rtsisyk modified the milestones: 1.8.0, 1.7.0 Jun 7, 2016
@kostja kostja modified the milestones: 1.8.0, 1.8.1 Mar 31, 2017
@kostja kostja changed the title bsync Synchronous replication Aug 11, 2017
@tarantool tarantool deleted a comment from rtsisyk Aug 17, 2017
@kostja kostja modified the milestones: 1.8.3, 1.8.4, 1.8.5 Oct 27, 2017
@Gerold103 Gerold103 self-assigned this Feb 9, 2018
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Do not start a transaction for each local journal or final join row
but follow transaction boundaries instead.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
In case of synchronous replication there are two instance
vclocks the last written one and the last committed one.
Despite the fact they both are equal track they separately
as a part of preparation for synchronous replication.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Implement cord region cache with allows to get a dedicated
memory region from and put them back. This is more common
approach as txn cache uses.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
As a cord has a region cache there is no point to implement
the same feature for transaction only.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Wal does not take any responsibility about transaction
commit or rollback. The only thing a wal module does is to
write journal entries and report write status.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
In case of synchronous replication transaction could be
written to wal but rolled back in case of conflict. So it
is not enough to check that wal was synced while checkpoint
or initial join feeding. To handle this txn engine emits a
zero-rows transaction which is passed through wal and then
committed in the same queue as standard transactions. So
successful txn_sync means that all previous transaction
were committed but not just written.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Wal uses a matrix clock (mclock) in order to track vclocks reported
by relay. This allows wal to build the minimal boundary vclock which
could be used in order to collect wal unused files. Box protects logs
from collecting using wal_set_first_checkpoint call.
In order to preserve logs while joining gc tracks all join-vclocks as
checkpoints with a special mark - is_join_readview set to true.

Also there is no more gc consumer in tx thread, gc consumer info in
box.info output and corresponding lines were commented from test out.

Part of #3795, #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
* xrow buffer structure
  Introduce a xrow buffer which stores encoded xrows in a memory
  after transaction was finished. Wal uses an xrow buffer object in
  order to encode transactions and then writes encoded data
  to a log file. Xrow buffer consist of not more than
  XROW_BUF_CHUNK_COUNT rotating chunks organized in a ring. Rotation
  thresholds and XROW_BUF_CHUNK_COUNT are the empiric values now.

* xrow buffer cursor
  This structure allows to find a xrow buffer row with vclock less
  than given one and then fetch row by row from the xrow forwards
  to the last appended row. A xrow buffer cursor is essential to
  allow the from memory replication and will be used by a relay
  to fetch all logged rows, stored in a wal memory (implemented as
  xrow buffer) from given position and then follow all new changes.

Part of #3974 #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Wal uses an xrow buffer object in order to encode transactions and
then writes encoded data to a log file so encoded data still live
in memory for some time after a transaction is finished.

Part of #3794, #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber
which creates a xrow buffer cursor with given vclock and then
fetches row from the xrow buffer one by one and calls given callback
for each row. Also the wal relaying fiber send a heartbeat message if
all rows were processed there were no rows written for replication
timeout period.
In case of outdated vclock (wal could not create a cursor
or fetch new row from the cursor) a relay switch to reading logged
data from file up to the current vclock and then makes next attempt
to fetch data from wal memory.
In file mode there is always data to send to a replica so relay
do not have to heartbeat messages.
From this point relay creates a cord only when switches to reading
from file. Frequent memory-file oscillation is not very likely
because two consideration:
 1. If replica is to slow (slower than master writes) - it will
    switch to disk and then fall behind
 2. If replica is fast enough - it will catch memory and then
    consume memory before the memory buffer rotation.
In order to split wal and relay logic a relay filter function
were introduced which should be passed while relay attaches
to wal.

Note: wal exit is not graceful - tx sends a break loop message and
wal just stops cbus processing without any care about other fibers
which could still use cbus. To overcome this there is a special
trigger which is signaled just before cbus pipe destroy.

Close #3794
Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Synchronous wal writes require a queue which contains written but
not yet confirmed journal entries. Wal checkpoint requests as well
as wall sync requests should be also processed through the queue.
So in order to make write, checkpoint and sync requests as similar
as possible we could eliminate wal rotation from the checkpoint
logic.

Also this patch adds box.internal.wal_rotate() function which
rotates the wall and used to preserve previous test behavior.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Synchronous replication implementation suggest a queue
which would collect all written journal entries until
they would be committed. So wal synchronization requests
should be processed using the queue.
This patch introduces journal entry flags and allows
wal to process checkpoint requests and writes using the
common write code.
Also the patch adds a vclock to journal entry in order to
return a checkpoint vclock. Currently it looks as weird
overhead but dedicated journal entries would be used by
synchronous replication implementation to decide which
journal entry is committed.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
If some cases (like there is not data update in case of recovery) a
vy_tx could be inserted twice into the corresponding writers list as
the vy_tx would have empty log. So check that a vy_tx is already
inserted.
This was not detected before as we did not do recovery preserving
transaction boundaries before.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Refactoring: track recovery journal vclock instead of to use
the recovery ones. Now replicaset vclock will rely on recovery stream
content instead of wal directory content (xlog names and meta). This
enables applier to use this journal and  generalize wal recovery and
applier final join handling.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Do not start a transaction for each local journal or final join row
but follow transaction boundaries instead.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
In case of synchronous replication there are two instance
vclocks the last written one and the last committed one.
Despite the fact they both are equal track they separately
as a part of preparation for synchronous replication.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Implement cord region cache with allows to get a dedicated
memory region from and put them back. This is more common
approach as txn cache uses.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
As a cord has a region cache there is no point to implement
the same feature for transaction only.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Wal does not take any responsibility about transaction
commit or rollback. The only thing a wal module does is to
write journal entries and report write status.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
In case of synchronous replication transaction could be
written to wal but rolled back in case of conflict. So it
is not enough to check that wal was synced while checkpoint
or initial join feeding. To handle this txn engine emits a
zero-rows transaction which is passed through wal and then
committed in the same queue as standard transactions. So
successful txn_sync means that all previous transaction
were committed but not just written.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
Add an internal routine which copies a xrow to a given region

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
An anonymous replica vclock could be behind the master one so
we should process final join stage from the replica vclock not
master in order to not to skip master transactions between the
replica and master vclock.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
There are minor changes in joining process made in order to
unify anonymous replica registration and full join process.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
The fetch snapshot, register and full join routines are implementing
the same logic and could be refactoring with common routines. This
helps for further synchronous replication implementation.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 2, 2020
This patch introduces a special request witch a vclock. Such
request means that all queued transactions up to an ack vclock
are committed. Any non-readonly instance could emit such ack
request after corresponding transaction were written to wal.
Also an ack request is processed through relays and appliers
and in case of readonly replica it is the only way to commit
replicated transaction.
This patch prepares tarantool to emit an ack request only after
required majority was reached.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 3, 2020
An anonymous replica vclock could be behind the master one so
we should process final join stage from the replica vclock not
master in order to not to skip master transactions between the
replica and master vclock.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 3, 2020
There are minor changes in joining process made in order to
unify anonymous replica registration and full join process.

Part of #980
GeorgyKirichenko pushed a commit that referenced this issue May 3, 2020
The fetch snapshot, register and full join routines are implementing
the same logic and could be refactoring with common routines. This
helps for further synchronous replication implementation.

Part of #980
@kyukhin kyukhin modified the milestones: 2.6.1, wishlist Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality replication
Projects
None yet
Development

No branches or pull requests

6 participants