Assertion fail in vclock_follow() #4739

Totktonada · 2020-01-22T12:23:40Z

Tarantool version: 2.4.0-16-gcdf502c66.
OS version: Linux.

How to reproduce

Add the file:

$ cat test/replication/upsert_stress.test.lua
test_run = require('test_run').new()
fiber = require('fiber')

SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
test_run:create_cluster(SERVERS, "replication", {args='0.1'})
test_run:wait_fullmesh(SERVERS)

_ = test_run:cmd("switch autobootstrap1")
test_run = require('test_run').new()
engine = test_run:get_cfg('engine')
_ = pcall(function() box.space.test:drop() end)
s = box.schema.space.create('test', {engine = engine})
_ = s:create_index('pk')
_ = s:create_index('sk', {parts = {{2, 'string'}}, unique = false})

-- Wait for schema update on all instances.
_ = test_run:cmd('switch default')
vclock = test_run:get_cluster_vclock(SERVERS)
_ = test_run:wait_cluster_vclock(SERVERS, vclock)

test_run:cmd("stop server autobootstrap1 with signal=KILL")
test_run:cmd("start server autobootstrap1")
_ = test_run:cmd("switch autobootstrap1")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap2 with signal=KILL")
test_run:cmd("start server autobootstrap2")
_ = test_run:cmd("switch autobootstrap2")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        box.begin()                                                          \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
        box.commit()                                                         \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap3 with signal=KILL")
test_run:cmd("start server autobootstrap3")
_ = test_run:cmd("switch autobootstrap3")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap1 with signal=KILL")
test_run:cmd("start server autobootstrap1")
_ = test_run:cmd("switch autobootstrap1")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap2 with signal=KILL")
test_run:cmd("start server autobootstrap2")
_ = test_run:cmd("switch autobootstrap2")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap3 with signal=KILL")
test_run:cmd("start server autobootstrap3")
_ = test_run:cmd("switch autobootstrap3")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        box.begin()                                                          \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
        box.commit()                                                         \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap1 with signal=KILL")
test_run:cmd("start server autobootstrap1")
_ = test_run:cmd("switch autobootstrap1")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        box.begin()                                                          \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
        box.commit()                                                         \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap2 with signal=KILL")
test_run:cmd("start server autobootstrap2")
_ = test_run:cmd("switch autobootstrap2")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
    end                                                                      \
end)

test_run:cmd("stop server autobootstrap3 with signal=KILL")
test_run:cmd("start server autobootstrap3")
_ = test_run:cmd("switch autobootstrap3")
fiber = require('fiber')
_ = fiber.create(function()                                                  \
    for j = 1, 10 do                                                         \
        for i = 1, 1000 do                                                   \
            box.space.test:upsert({i, tostring(i)}, {{'=', 2, tostring(i)}}) \
        end                                                                  \
    end                                                                      \
end)

_ = test_run:cmd("switch autobootstrap1")
box.space.test:drop()

_ = test_run:cmd("switch default")

test_run:drop_cluster(SERVERS)

Generate the result file:

$ (cd test && ./test-run.py upsert_stress --conf memtx)

Run in parallel (for memtx or vinyl, does not matter):

$ (cd test && ./test-run.py -j 20 $(yes upsert_stress | head -n 100) --conf memtx)

Got result

[014] replication/upsert_stress.test.lua              memtx           
[014] 
[014] [Instance "autobootstrap1" killed by signal: 6 (SIGABRT)]
[014] Found assertion fail in the results file [/home/alex/projects/tarantool-meta/tarantool/test/var/014_replication/autobootstrap1.log]:
2020-01-22 14:42:10.561 [28419] main/110/main I> remote vclock {1: 9017, 2: 8962, 3: 10000} local vclock {1: 9017, 2: 8961, 3: 10000}
2020-01-22 14:42:10.620 [28419] relay/unix/:(socket)/101/main I> recover from `/home/alex/projects/tarantool-meta/tarantool/test/var/014_replication/autobootstrap1/00000000000000009035.xlog'
2020-01-22 14:42:10.647 [28419] relay/unix/:(socket)/101/main I> recover from `/home/alex/projects/tarantool-meta/tarantool/test/var/014_replication/autobootstrap1/00000000000000009035.xlog'
2020-01-22 14:42:10.910 [28419] main/112/applier/cluster@unix/:/home/ale I> can't read row
2020-01-22 14:42:10.910 [28419] main/112/applier/cluster@unix/:/home/ale coio.cc:378 !> SystemError unexpected EOF when reading from socket, called on fd 22, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2020-01-22 14:42:10.910 [28419] main/112/applier/cluster@unix/:/home/ale I> will retry every 1.00 second
2020-01-22 14:42:10.929 [28419] relay/unix/:(socket)/101/main sio.c:268 !> SystemError writev(2), called on fd 32, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2020-01-22 14:42:10.929 [28419] relay/unix/:(socket)/101/main C> exiting the relay loop
2020-01-22 14:42:13.595 [28419] main/113/applier/cluster@unix/:/home/ale I> can't read row
2020-01-22 14:42:13.595 [28419] main/113/applier/cluster@unix/:/home/ale xrow.c:1082 E> ER_SYSTEM: timed out
2020-01-22 14:42:13.595 [28419] main/113/applier/cluster@unix/:/home/ale I> will retry every 1.00 second
2020-01-22 14:42:14.596 [28419] main/113/applier/cluster@unix/:/home/ale I> authenticated
2020-01-22 14:42:14.597 [28419] main/113/applier/cluster@unix/:/home/ale I> subscribed
2020-01-22 14:42:14.597 [28419] main/113/applier/cluster@unix/:/home/ale I> remote vclock {1: 12991, 2: 9198, 3: 10000} local vclock {1: 12990, 2: 9198, 3: 10000}
tarantool: /home/alex/p/tarantool-meta/tarantool/src/box/vclock.c:43: vclock_follow: Assertion `lsn >= 0' failed.
[014] [ fail ]

Backtrace:

(gdb) bt
#0  0x00007fa78c6b61f1 in raise () from /lib64/libc.so.6
#1  0x00007fa78c69e55b in abort () from /lib64/libc.so.6
#2  0x00007fa78c69e42f in __assert_fail_base.cold () from /lib64/libc.so.6
#3  0x00007fa78c6ad9e2 in __assert_fail () from /lib64/libc.so.6
#4  0x000055889f0e40c8 in vclock_follow (vclock=0x7fa7884fe320, replica_id=1, lsn=-5929) at /home/alex/p/tarantool-meta/tarantool/src/box/vclock.c:43
#5  0x000055889ef27221 in wal_assign_lsn (vclock_diff=0x7fa7884fe320, base=0x55889f3ef518 <wal_writer_singleton+5976>, row=0x7fa755e3d2e0, end=0x7fa755e3d2e8)
    at /home/alex/p/tarantool-meta/tarantool/src/box/wal.c:954
#6  0x000055889ef27438 in wal_write_to_disk (msg=0x7fa78bc343a0) at /home/alex/p/tarantool-meta/tarantool/src/box/wal.c:1029
#7  0x000055889ef76efd in cmsg_deliver (msg=0x7fa78bc343a0) at /home/alex/p/tarantool-meta/tarantool/src/lib/core/cbus.c:353
#8  0x000055889ef77780 in cbus_process (endpoint=0x7fa7884ffe90) at /home/alex/p/tarantool-meta/tarantool/src/lib/core/cbus.c:635
#9  0x000055889ef777cf in cbus_loop (endpoint=0x7fa7884ffe90) at /home/alex/p/tarantool-meta/tarantool/src/lib/core/cbus.c:642
#10 0x000055889ef277c9 in wal_writer_f (ap=0x7fa788400278) at /home/alex/p/tarantool-meta/tarantool/src/box/wal.c:1127
#11 0x000055889ee25bc9 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55889ef27751 <wal_writer_f>, ap=0x7fa788400278)
    at /home/alex/p/tarantool-meta/tarantool/src/lib/core/fiber.h:761
#12 0x000055889ef7096b in fiber_loop (data=0x0) at /home/alex/p/tarantool-meta/tarantool/src/lib/core/fiber.c:830
#13 0x000055889f17e09f in coro_init () at /home/alex/p/tarantool-meta/tarantool/third_party/coro/coro.c:110

Observations

It seems the LSN is not sequential for replica with id 1 (it is the instance that is crashed) or replica_id in struct row is incorrect.

The text was updated successfully, but these errors were encountered:

kyukhin · 2020-01-25T07:46:52Z

I've reproduced it on 1.10.
On 4th server, command is:

./test-run.py --builddir ../../bld -j 200 $(yes replication/upsert-stress | head -n 200) --conf memtx

Totktonada · 2020-01-31T17:01:05Z

RelWithDebInfo build does not fail on assertion, of course, but writes duplicate entries into an xlog file:

$ cat test/var/014_replication/autobootstrap1.log
2020-01-31 19:42:04.884 [23055] main/102/autobootstrap1 C> Tarantool 2.2.1-117-gb62c11108
2020-01-31 19:42:04.884 [23055] main/102/autobootstrap1 C> log level 5
2020-01-31 19:42:04.884 [23055] main/102/autobootstrap1 I> mapping 268435456 bytes for memtx tuple arena...
2020-01-31 19:42:04.884 [23055] main/102/autobootstrap1 I> mapping 134217728 bytes for vinyl tuple arena...
2020-01-31 19:42:05.027 [23055] main/102/autobootstrap1 I> instance uuid a62762e5-5437-4a65-9223-ad19747f0527
2020-01-31 19:42:05.030 [23055] main/102/autobootstrap1 F> LSN for 3 is used twice or COMMIT order is broken: confirmed: 7835, new: 7541, req: {type: 'UPSERT', replica_id: 3, lsn: 7541, space_id: 513, index_id: 0, tuple: [533, "533"], ops: [["=", 2, "533"]]}
2020-01-31 19:42:05.030 [23055] main/102/autobootstrap1 F> LSN for 3 is used twice or COMMIT order is broken: confirmed: 7835, new: 7541, req: {type: 'UPSERT', replica_id: 3, lsn: 7541, space_id: 513, index_id: 0, tuple: [533, "533"], ops: [["=", 2, "533"]]}

$ tarantoolctl cat --show-system test/var/014_replication/autobootstrap1/00000000000000000017.xlog | grep -A 9 -B 2 'lsn: 7541'
Processing file 'test/var/014_replication/autobootstrap1/00000000000000000017.xlog'
---
HEADER:
  lsn: 7541
  replica_id: 3
  type: UPSERT
  timestamp: 1580488922.5749
BODY:
  space_id: 513
  operations: [['=', 2, '533']]
  index_base: 1
  tuple: [533, '533']
---
--
---
HEADER:
  lsn: 7541
  replica_id: 3
  type: UPSERT
  timestamp: 1580488922.6664
BODY:
  space_id: 513
  operations: [['=', 2, '533']]
  index_base: 1
  tuple: [533, '533']
---
--
---
HEADER:
  lsn: 7541
  replica_id: 1
  type: UPSERT
  tsn: 7010
  timestamp: 1580488923.1669
BODY:
  space_id: 513
  operations: [['=', 2, '532']]
  index_base: 1
  tuple: [532, '532']

It seems we should panic in the case in a RelWithDebInfo build rather than write an incorrect xlog file.

Symptoms looks similar to #4749.

sergepetrenko · 2020-02-03T17:28:32Z

Bisect shows that the problem lies in commit 8c84932

sergepetrenko · 2020-02-04T10:20:01Z

A different error appears on 1.10: XlogError: invalid magic 0x0
This probably has to be investigated separately.

Fix replicaset.applier.vclock initialization issues: it wasn't initialized at all previously. Moreover, there is no valid point in code to initialize it, since it may get stale right away if new entries are written to WAL. So, check for both applier and replicaset vclocks. The greater one protects the instance from applying the rows it has already applied or has already scheduled to write. Also remove an unnecessary aplier vclock initialization from replication_init(). Closes #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. So we better panic on an attemt to write a record with a duplicate or otherwise broken lsn. Follow-up #4739

When master processes a subscribe response, it responds with its vclock at the moment of receiving the request. However, the fiber processing the request may yield on coio_write_xrow, when sending the response to the replica. In the meantime, master may apply additional rows coming from the replica after it has issued SUBSCRIBE. Then in relay_subscribe master sets its local vclock_at_subscribe to a possibly updated value of replicaset.vclock So, set local_vclock_at_subscribe to a remembered value, rather than an updated one. Part of #4739

is_orphan status check is needed by applier in order to not re-apply local instance rows coming from the replica after replication has synced. Prerequisite #4739

Remove applier vclock initialization from replication_init(), where it is zeroed-out, and place it in the end of box_cfg_xc(), where replicaset vclock already has a meaningful value. Do not apply rows originating form the current instance if replication sync has ended. Closes #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attemt to write a record with a duplicate or otherwise broken lsn. Follow-up #4739

When master processes a subscribe response, it responds with its vclock at the moment of receiving the request. However, the fiber processing the request may yield on coio_write_xrow, when sending the response to the replica. In the meantime, master may apply additional rows coming from the replica after it has issued SUBSCRIBE. Then in relay_subscribe master sets its local vclock_at_subscribe to a possibly updated value of replicaset.vclock So, set local_vclock_at_subscribe to a remembered value, rather than an updated one. Follow-up #4739

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

Prerequisite #4739

We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attemt to write a record with a duplicate or otherwise broken lsn. Follow-up #4739

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

Prerequisite #4739

We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attemt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attemt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739

@TarantoolBot

Add a filter for relay to skip rows coming from unwanted instances. A list of instance ids whose rows replica doesn't want to fetch is encoded together with SUBSCRIBE request after a freshly introduced flag IPROTO_ID_FILTER. Filtering rows is needed to prevent an instance from fetching its own rows from a remote master, which is useful on initial configuration and harmful on resubscribe. Prerequisite #4739, #3294 @TarantoolBot document Title: document new binary protocol key and subscribe request changes Add key `IPROTO_ID_FILTER = 0x51` to the internals reference. This is an optional key used in SUBSCRIBE request followed by an array of ids of instances whose rows won't be relayed to the replica. SUBSCRIBE request is supplemented with an optional field of the following structure: ``` +====================+ | ID_FILTER | | 0x51 : ID LIST | | MP_INT : MP_ARRRAY | | | +====================+ ``` The field is encoded only when the id list is not empty.

We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739

…g.replication When checking wheter rejoin is needed, replica loops through all the instances in box.cfg.replication, which makes it believe that there is a master holding files, needed by it, since it accounts itself just like all other instances. So make replica skip itself when finding an instance which holds files needed by it, and determining whether rebootstrap is needed. We already have a working test for the issue, it missed the issue due to replica.lua settings. Fix replica.lua to include itself in box.cfg.replication Closes #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attempt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739

@TarantoolBot

Add a filter for relay to skip rows coming from unwanted instances. A list of instance ids whose rows replica doesn't want to fetch is encoded together with SUBSCRIBE request after a freshly introduced flag IPROTO_ID_FILTER. Filtering rows is needed to prevent an instance from fetching its own rows from a remote master, which is useful on initial configuration and harmful on resubscribe. Prerequisite #4739, #3294 @TarantoolBot document Title: document new binary protocol key and subscribe request changes Add key `IPROTO_ID_FILTER = 0x51` to the internals reference. This is an optional key used in SUBSCRIBE request followed by an array of ids of instances whose rows won't be relayed to the replica. SUBSCRIBE request is supplemented with an optional field of the following structure: ``` +====================+ | ID_FILTER | | 0x51 : ID LIST | | MP_INT : MP_ARRRAY | | | +====================+ ``` The field is encoded only when the id list is not empty.

We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attempt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739

@TarantoolBot

Add a filter for relay to skip rows coming from unwanted instances. A list of instance ids whose rows replica doesn't want to fetch is encoded together with SUBSCRIBE request after a freshly introduced flag IPROTO_ID_FILTER. Filtering rows is needed to prevent an instance from fetching its own rows from a remote master, which is useful on initial configuration and harmful on resubscribe. Prerequisite #4739, #3294 @TarantoolBot document Title: document new binary protocol key and subscribe request changes Add key `IPROTO_ID_FILTER = 0x51` to the internals reference. This is an optional key used in SUBSCRIBE request followed by an array of ids of instances whose rows won't be relayed to the replica. SUBSCRIBE request is supplemented with an optional field of the following structure: ``` +====================+ | ID_FILTER | | 0x51 : ID LIST | | MP_INT : MP_ARRRAY | | | +====================+ ``` The field is encoded only when the id list is not empty.

We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attempt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739

@TarantoolBot

Add a filter for relay to skip rows coming from unwanted instances. A list of instance ids whose rows replica doesn't want to fetch is encoded together with SUBSCRIBE request after a freshly introduced flag IPROTO_ID_FILTER. Filtering rows is needed to prevent an instance from fetching its own rows from a remote master, which is useful on initial configuration and harmful on resubscribe. Prerequisite #4739, #3294 @TarantoolBot document Title: document new binary protocol key and subscribe request changes Add key `IPROTO_ID_FILTER = 0x51` to the internals reference. This is an optional key used in SUBSCRIBE request followed by an array of ids of instances whose rows won't be relayed to the replica. SUBSCRIBE request is supplemented with an optional field of the following structure: ``` +====================+ | ID_FILTER | | 0x51 : ID LIST | | MP_INT : MP_ARRRAY | | | +====================+ ``` The field is encoded only when the id list is not empty.

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739 (cherry picked from commit 7b83b73)

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attempt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739 (cherry picked from commit e075026)

@TarantoolBot

Add a filter for relay to skip rows coming from unwanted instances. A list of instance ids whose rows replica doesn't want to fetch is encoded together with SUBSCRIBE request after a freshly introduced flag IPROTO_ID_FILTER. Filtering rows is needed to prevent an instance from fetching its own rows from a remote master, which is useful on initial configuration and harmful on resubscribe. Prerequisite #4739, #3294 @TarantoolBot document Title: document new binary protocol key and subscribe request changes Add key `IPROTO_ID_FILTER = 0x51` to the internals reference. This is an optional key used in SUBSCRIBE request followed by an array of ids of instances whose rows won't be relayed to the replica. SUBSCRIBE request is supplemented with an optional field of the following structure: ``` +====================+ | ID_FILTER | | 0x51 : ID LIST | | MP_INT : MP_ARRRAY | | | +====================+ ``` The field is encoded only when the id list is not empty. (cherry picked from commit 45de990)

We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739 (cherry picked from commit ed2e143)

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739 (cherry picked from commit 7b83b73)

There is an assertion in vclock_follow `lsn > prev_lsn`, which doesn't fire in release builds, of course. Let's at least warn the user on an attempt to write a record with a duplicate or otherwise broken lsn, and not follow such an lsn. Follow-up #4739 (cherry picked from commit e075026)

@TarantoolBot

Add a filter for relay to skip rows coming from unwanted instances. A list of instance ids whose rows replica doesn't want to fetch is encoded together with SUBSCRIBE request after a freshly introduced flag IPROTO_ID_FILTER. Filtering rows is needed to prevent an instance from fetching its own rows from a remote master, which is useful on initial configuration and harmful on resubscribe. Prerequisite #4739, #3294 @TarantoolBot document Title: document new binary protocol key and subscribe request changes Add key `IPROTO_ID_FILTER = 0x51` to the internals reference. This is an optional key used in SUBSCRIBE request followed by an array of ids of instances whose rows won't be relayed to the replica. SUBSCRIBE request is supplemented with an optional field of the following structure: ``` +====================+ | ID_FILTER | | 0x51 : ID LIST | | MP_INT : MP_ARRRAY | | | +====================+ ``` The field is encoded only when the id list is not empty. (cherry picked from commit 45de990)

We have a mechanism for restoring rows originating from an instance that suffered a sudden power loss: remote masters resend the isntance's rows received before a certain point in time, defined by remote master vclock at the moment of subscribe. However, this is useful only on initial replication configuraiton, when an instance has just recovered, so that it can receive what it has relayed but haven't synced to disk. In other cases, when an instance is operating normally and master-master replication is configured, the mechanism described above may lead to instance re-applying instance's own rows, coming from a master it has just subscribed to. To fix the problem do not relay rows coming from a remote instance, if the instance has already recovered. Closes #4739 (cherry picked from commit ed2e143)

Totktonada added bug Something isn't working crash replication labels Jan 22, 2020

kyukhin added this to the 2.4.1 milestone Jan 24, 2020

kyukhin modified the milestones: 2.4.1, 1.10.6 Jan 25, 2020

Totktonada assigned servoin Jan 27, 2020

servoin added the in progress label Jan 30, 2020

Totktonada mentioned this issue Jan 31, 2020

LSN is used twice or COMMIT order is broken #4749

Closed

sergepetrenko self-assigned this Feb 3, 2020

sergepetrenko added a commit that referenced this issue Feb 13, 2020

box: expose box_is_orphan method

ab391b2

is_orphan status check is needed by applier in order to not re-apply local instance rows coming from the replica after replication has synced. Prerequisite #4739

sergepetrenko added a commit that referenced this issue Feb 14, 2020

box: expose box_is_orphan method

f012caa

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

sergepetrenko added a commit that referenced this issue Feb 14, 2020

vclock: add an ability to set individual clock components

9d6a891

Prerequisite #4739

kyukhin unassigned servoin Feb 18, 2020

sergepetrenko added a commit that referenced this issue Feb 18, 2020

box: expose box_is_orphan method

f07e49a

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

sergepetrenko added a commit that referenced this issue Feb 18, 2020

recovery: allow to ignore rows coming from a certain instance

257cb27

Prerequisite #4739

sergepetrenko added a commit that referenced this issue Feb 28, 2020

box: expose box_is_orphan method

bdf51e6

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

sergepetrenko added ready for review and removed in progress labels Feb 28, 2020

sergepetrenko added a commit that referenced this issue Feb 29, 2020

box: expose box_is_orphan method

41ab4de

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

kyukhin pushed a commit that referenced this issue Mar 2, 2020

box: expose box_is_orphan method

7b83b73

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739

kyukhin closed this as completed in ed2e143 Mar 2, 2020

kyukhin pushed a commit that referenced this issue Mar 2, 2020

box: expose box_is_orphan method

e612dce

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739 (cherry picked from commit 7b83b73)

kyukhin pushed a commit that referenced this issue Mar 2, 2020

box: expose box_is_orphan method

ae98e25

is_orphan status check is needed by applier in order to tell relay whether to send the instance's own rows back or not. Prerequisite #4739 (cherry picked from commit 7b83b73)

kyukhin modified the milestones: 1.10.6, 2.2.3 Mar 2, 2020

Totktonada removed the ready for review label Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion fail in vclock_follow() #4739

Assertion fail in vclock_follow() #4739

Totktonada commented Jan 22, 2020

kyukhin commented Jan 25, 2020

Totktonada commented Jan 31, 2020

sergepetrenko commented Feb 3, 2020

sergepetrenko commented Feb 4, 2020

Assertion fail in vclock_follow() #4739

Assertion fail in vclock_follow() #4739

Comments

Totktonada commented Jan 22, 2020

How to reproduce

Got result

Observations

kyukhin commented Jan 25, 2020

Totktonada commented Jan 31, 2020

sergepetrenko commented Feb 3, 2020

sergepetrenko commented Feb 4, 2020