Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Сannot recover from an anonymous replica after writing to a local space due to an on_replace trigger for a global space. #8746

Closed
yanshtunder opened this issue Jun 7, 2023 · 0 comments · Fixed by #9032
Assignees
Labels
2.10 Target is 2.10 and all newer release/master branches bug Something isn't working replication

Comments

@yanshtunder
Copy link
Contributor

yanshtunder commented Jun 7, 2023

If at an anonymous replica due to an on_replace trigger on the global space write is made to the local space, then recovering will be impossible. Run the tarantool like below:

------------
-- Master
------------

-- Step 1
box.cfg{
    listen = 3301,
    replication = {3301, 3302},
}
box.schema.user.grant('guest', 'replication')

-- Step 3   
box.schema.space.create('loc', {is_local = true})
box.space.loc:create_index('pk')
box.schema.space.create('glob')
box.space.glob:create_index('pk')

-- Step 5
box.space.glob:insert{1}


---------------------
-- Anonymous replica
---------------------

-- Step 2
box.cfg{
    listen = 3302,
    replication = {3301, 3302},
    replication_anon = true,
    read_only = true,
}

-- Step 4
box.space.glob:on_replace(function()
    box.space.loc:replace{42}
end)

Now try to recover from the anonymous replica. If you recover without force_recovery, you will get error:

2023-06-07 10:53:32.787 [8125] main/103/interactive I> recover from `./00000000000000000004.xlog'
2023-06-07 10:53:32.788 [8125] main/103/interactive box.cc:822 E> error at request: {type: 'INSERT', replica_id: 1, lsn: 6, space_id: 513, index_id: 0, tuple: [42]}
2023-06-07 10:53:32.788 [8125] main/103/interactive box.cc:765 E> XlogError: found a first global row in a transaction with LSN/TSN mismatch
2023-06-07 10:53:32.788 [8125] main/103/interactive F> can't initialize storage: found a first global row in a transaction with LSN/TSN mismatch

If you recover with force_recovery = true, you will get error:

2023-06-07 12:32:19.031 [16600] main/103/interactive I> recover from `./00000000000000000004.xlog'
2023-06-07 12:32:19.031 [16600] main/103/interactive box.cc:822 E> error at request: {type: 'INSERT', replica_id: 1, lsn: 6, space_id: 513, index_id: 0, tuple: [42]}
2023-06-07 12:32:19.031 [16600] main/103/interactive recovery.cc:292 E> skipping row {1: 6}
2023-06-07 12:32:19.031 [16600] main/103/interactive box.cc:765 E> XlogError: found a first global row in a transaction with LSN/TSN mismatch
tarantool: ./src/box/recovery.cc:280: void recover_xlog(recovery*, xstream*, const vclock*): Assertion `row.replica_id != 0 || row.group_id == GROUP_LOCAL' failed.
@yanshtunder yanshtunder added the bug Something isn't working label Jun 7, 2023
@yanshtunder yanshtunder changed the title Сannot recover from an anonymous replica due to writing to a local space in an on_replace trigger for a global space. Сannot recover from an anonymous replica after writing to a local space due to an on_replace trigger for a global space. Jun 7, 2023
@sergepetrenko sergepetrenko self-assigned this Jun 7, 2023
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 27, 2023
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Jul 31, 2023
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Aug 23, 2023
@sergepetrenko sergepetrenko added replication 2.10 Target is 2.10 and all newer release/master branches labels Aug 24, 2023
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Aug 24, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Fix
transaction boundary calculations for such cases.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Aug 24, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Aug 30, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Fix
transaction boundary calculations for such cases.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Aug 30, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Aug 31, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Aug 31, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 1, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 8, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 8, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 11, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 11, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 13, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 13, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 15, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 15, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 18, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Sep 18, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Oct 2, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Oct 2, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Oct 5, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Oct 5, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Oct 6, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up tarantool#8746
Follow-up tarantool#7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug tarantool#8746
sergepetrenko added a commit that referenced this issue Oct 9, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes #8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit
sergepetrenko added a commit that referenced this issue Oct 9, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up #8746
Follow-up #7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug #8746
sergepetrenko added a commit that referenced this issue Oct 9, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes #8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit

(cherry picked from commit 099cb2d)
sergepetrenko added a commit that referenced this issue Oct 9, 2023
Force recovery first tries to collect all rows of a transaction into a
single list, and only then applies those rows.

The problem was that it collected rows based on the row replica_id. For
local rows replica_id is set to 0, but actually such rows can be part
of a transaction coming from any instance.

Fix recovery of such rows

Follow-up #8746
Follow-up #7932

NO_DOC=bugfix
NO_CHANGELOG=the broken behaviour couldn't be seen due to bug #8746

(cherry picked from commit 85df1c9)
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Oct 9, 2023
Transaction boundaries were not updated correctly for transactions in
which local space writes were made from a replication trigger. Existing
transaction boundaries and row flags from the master were written as is
on the replica. Actually, the replica should recalculate transaction
boundaries and even WAIT_SYNC/WAIT_ACK flags.

Transaction boundaries should be recalculated when a replica appends a
local write at the end of the master's transaction, and
WAIT_SYNC/WAIT_ACK should be overwritten when nopifying synchronous
transactions coming from an old term.

The latter fix has uncovered the bug in skipping outdated synchronous
transactions: if one replica replaces a transaction from an old term
with NOPs and then passes that transaction to the other replica, the
other replica raises a split brain error. It believes the NOPs are an
async transaction form an old term. This worked before the fix, because
the rows were written with the original WAIT_ACK = true bit. Now this
is fixed properly: we allow fully NOP async tranasctions from the old
term.

Closes tarantool#8746

NO_DOC=bugfix
NO_CHANGELOG=covered by the next commit

(cherry picked from commit 099cb2d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.10 Target is 2.10 and all newer release/master branches bug Something isn't working replication
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants