-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronous replication #980
Labels
Milestone
Comments
Closed
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Do not start a transaction for each local journal or final join row but follow transaction boundaries instead. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
In case of synchronous replication there are two instance vclocks the last written one and the last committed one. Despite the fact they both are equal track they separately as a part of preparation for synchronous replication. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Implement cord region cache with allows to get a dedicated memory region from and put them back. This is more common approach as txn cache uses. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
As a cord has a region cache there is no point to implement the same feature for transaction only. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Wal does not take any responsibility about transaction commit or rollback. The only thing a wal module does is to write journal entries and report write status. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
In case of synchronous replication transaction could be written to wal but rolled back in case of conflict. So it is not enough to check that wal was synced while checkpoint or initial join feeding. To handle this txn engine emits a zero-rows transaction which is passed through wal and then committed in the same queue as standard transactions. So successful txn_sync means that all previous transaction were committed but not just written. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Wal uses a matrix clock (mclock) in order to track vclocks reported by relay. This allows wal to build the minimal boundary vclock which could be used in order to collect wal unused files. Box protects logs from collecting using wal_set_first_checkpoint call. In order to preserve logs while joining gc tracks all join-vclocks as checkpoints with a special mark - is_join_readview set to true. Also there is no more gc consumer in tx thread, gc consumer info in box.info output and corresponding lines were commented from test out. Part of #3795, #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
* xrow buffer structure Introduce a xrow buffer which stores encoded xrows in a memory after transaction was finished. Wal uses an xrow buffer object in order to encode transactions and then writes encoded data to a log file. Xrow buffer consist of not more than XROW_BUF_CHUNK_COUNT rotating chunks organized in a ring. Rotation thresholds and XROW_BUF_CHUNK_COUNT are the empiric values now. * xrow buffer cursor This structure allows to find a xrow buffer row with vclock less than given one and then fetch row by row from the xrow forwards to the last appended row. A xrow buffer cursor is essential to allow the from memory replication and will be used by a relay to fetch all logged rows, stored in a wal memory (implemented as xrow buffer) from given position and then follow all new changes. Part of #3974 #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Fetch data from wal in-memory buffer. Wal allows to start a fiber which creates a xrow buffer cursor with given vclock and then fetches row from the xrow buffer one by one and calls given callback for each row. Also the wal relaying fiber send a heartbeat message if all rows were processed there were no rows written for replication timeout period. In case of outdated vclock (wal could not create a cursor or fetch new row from the cursor) a relay switch to reading logged data from file up to the current vclock and then makes next attempt to fetch data from wal memory. In file mode there is always data to send to a replica so relay do not have to heartbeat messages. From this point relay creates a cord only when switches to reading from file. Frequent memory-file oscillation is not very likely because two consideration: 1. If replica is to slow (slower than master writes) - it will switch to disk and then fall behind 2. If replica is fast enough - it will catch memory and then consume memory before the memory buffer rotation. In order to split wal and relay logic a relay filter function were introduced which should be passed while relay attaches to wal. Note: wal exit is not graceful - tx sends a break loop message and wal just stops cbus processing without any care about other fibers which could still use cbus. To overcome this there is a special trigger which is signaled just before cbus pipe destroy. Close #3794 Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Synchronous wal writes require a queue which contains written but not yet confirmed journal entries. Wal checkpoint requests as well as wall sync requests should be also processed through the queue. So in order to make write, checkpoint and sync requests as similar as possible we could eliminate wal rotation from the checkpoint logic. Also this patch adds box.internal.wal_rotate() function which rotates the wall and used to preserve previous test behavior. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Synchronous replication implementation suggest a queue which would collect all written journal entries until they would be committed. So wal synchronization requests should be processed using the queue. This patch introduces journal entry flags and allows wal to process checkpoint requests and writes using the common write code. Also the patch adds a vclock to journal entry in order to return a checkpoint vclock. Currently it looks as weird overhead but dedicated journal entries would be used by synchronous replication implementation to decide which journal entry is committed. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
If some cases (like there is not data update in case of recovery) a vy_tx could be inserted twice into the corresponding writers list as the vy_tx would have empty log. So check that a vy_tx is already inserted. This was not detected before as we did not do recovery preserving transaction boundaries before. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Refactoring: track recovery journal vclock instead of to use the recovery ones. Now replicaset vclock will rely on recovery stream content instead of wal directory content (xlog names and meta). This enables applier to use this journal and generalize wal recovery and applier final join handling. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Do not start a transaction for each local journal or final join row but follow transaction boundaries instead. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
In case of synchronous replication there are two instance vclocks the last written one and the last committed one. Despite the fact they both are equal track they separately as a part of preparation for synchronous replication. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Implement cord region cache with allows to get a dedicated memory region from and put them back. This is more common approach as txn cache uses. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
As a cord has a region cache there is no point to implement the same feature for transaction only. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Wal does not take any responsibility about transaction commit or rollback. The only thing a wal module does is to write journal entries and report write status. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
In case of synchronous replication transaction could be written to wal but rolled back in case of conflict. So it is not enough to check that wal was synced while checkpoint or initial join feeding. To handle this txn engine emits a zero-rows transaction which is passed through wal and then committed in the same queue as standard transactions. So successful txn_sync means that all previous transaction were committed but not just written. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
Add an internal routine which copies a xrow to a given region Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
An anonymous replica vclock could be behind the master one so we should process final join stage from the replica vclock not master in order to not to skip master transactions between the replica and master vclock. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
There are minor changes in joining process made in order to unify anonymous replica registration and full join process. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
The fetch snapshot, register and full join routines are implementing the same logic and could be refactoring with common routines. This helps for further synchronous replication implementation. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 2, 2020
This patch introduces a special request witch a vclock. Such request means that all queued transactions up to an ack vclock are committed. Any non-readonly instance could emit such ack request after corresponding transaction were written to wal. Also an ack request is processed through relays and appliers and in case of readonly replica it is the only way to commit replicated transaction. This patch prepares tarantool to emit an ack request only after required majority was reached. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 3, 2020
An anonymous replica vclock could be behind the master one so we should process final join stage from the replica vclock not master in order to not to skip master transactions between the replica and master vclock. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 3, 2020
There are minor changes in joining process made in order to unify anonymous replica registration and full join process. Part of #980
GeorgyKirichenko
pushed a commit
that referenced
this issue
May 3, 2020
The fetch snapshot, register and full join routines are implementing the same logic and could be refactoring with common routines. This helps for further synchronous replication implementation. Part of #980
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am updating this ticket to list RAFT implementation plan as I see it. It turns out the implementation plan is nowhere but in my head.
QA
The text was updated successfully, but these errors were encountered: