Skip to content

test: flaky replication/replica_rejoin.test.lua test #222

@avtikhon

Description

@avtikhon

Tarantool version:
Tarantool 2.3.2-24-g94d54e9
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/bin/c++
C_FLAGS: -static-libstdc++ -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -fno-gnu89-inline -Wno-cast-function-type
CXX_FLAGS: -static-libstdc++ -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-cast-function-type

OS version:
Linux

Bug description:

  1. https://gitlab.com/tarantool/tarantool/-/jobs/554861247

Not possible to use results file checksum due to different data in error message:

 [045] --- replication/replica_rejoin.result	Thu May  7 11:06:02 2020
 [045] +++ replication/replica_rejoin.reject	Fri May 15 17:30:14 2020
 [045] @@ -47,7 +47,39 @@
 [045]  ...
 [045]  box.info.replication[1].upstream.status == 'follow' or box.info
 [045]  ---
 [045] -- true
 [045] +- version: 2.3.2-24-g94d54e9
 [045] +  id: 2
 [045] +  ro: false
 [045] +  uuid: a22acccf-f442-47c6-9717-c2fe0ef1435f
 [045] +  package: Tarantool
 [045] +  cluster:
 [045] +    uuid: c706ab36-96b9-4ba7-a4e8-b10969106914
 [045] +  replication:
 [045] +    1:
 [045] +      id: 1
 [045] +      uuid: dda72cb7-1c86-49f7-b7e1-00b8e8478b23
 [045] +      lsn: 14
 [045] +      upstream:
 [045] +        peer: unix/:/tarantool/test/var/045_replication/master.socket-iproto
 [045] +        lag: 0.00059795379638672
 [045] +        status: disconnected
 [045] +        idle: 1.1095711551607
 [045] +        message: timed out
 [045] +        system_message: Connection timed out
 [045] +    2:
 [045] +      id: 2
 [045] +      uuid: a22acccf-f442-47c6-9717-c2fe0ef1435f
 [045] +      lsn: 0
 [045] +  signature: 14
 [045] +  status: running
 [045] +  vinyl: []
 [045] +  uptime: 4
 [045] +  lsn: 0
 [045] +  sql: []
 [045] +  gc: []
 [045] +  pid: 2430
 [045] +  memory: []
 [045] +  vclock: {1: 14}
 [045]  ...
 [045]  box.space.test:select()
 [045]  ---
[151] --- replication/replica_rejoin.result     Tue Sep 29 10:57:26 2020
[151] +++ replication/replica_rejoin.reject     Tue Sep 29 10:57:48 2020
[151] @@ -230,7 +230,12 @@
[151]      return box.info ~= nil and box.info.replication[1] ~= nil                   \
[151]  end)
[151]  ---
[151] -- true
[151] +- error: "builtin/box/load_cfg.lua:601: Please call box.cfg{} first\nstack traceback:\n\tbuiltin/box/load_cfg.lua:601:
[151] +    in function '__index'\n\t[string \"return test_run:wait_cond(function()         ...\"]:1:
[151] +    in function 'cond'\n\t/tmp/tnt/151_replication/test_run.lua:411: in function </tmp/tnt/151_replication/test_run.lua:404>\n\t[C]:
[151] +    in function 'pcall'\n\tbuiltin/box/console.lua:402: in function 'eval'\n\tbuiltin/box/console.lua:708:
[151] +    in function 'repl'\n\tbuiltin/box/console.lua:842: in function <builtin/box/console.lua:828>\n\t[C]:
[125] replication/replica_rejoin.test.lua             vinyl           [ pass ]
[151] +    in function 'pcall'\n\tbuiltin/socket.lua:1081: in function <builtin/socket.lua:1079>"
[151]  ...
[151]  test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
[151]  ---
[151] 
[010] replication/replica_rejoin.test.lua             memtx           [ fail ]
[010]
[010] Test failed! Result content mismatch:
[010] --- replication/replica_rejoin.result     Wed Jun 30 18:12:15 2021
[010] +++ var/rejects/replication/replica_rejoin.reject Wed Jun 30 18:32:21 2021
[010] @@ -160,7 +160,7 @@
[010]  ...
[010]  box.space.test:replace{1}
[010]  ---
[010] -- [1]
[010] +- error: Can't modify data because this instance is in read-only mode.
[010]  ...
[010]  -- Cleanup.
[010]  --test_run:cmd("switch default")
[010]
[014] @@ -138,7 +138,8 @@
[014]  ...
[014]  box.info.replication[1].upstream.status == 'follow' or log.error(box.info)
[014]  ---
[014] -- true
[014] +- error: '[string "return box.info.replication[1].upstream.statu..."]:1: variable
[014] +    ''log'' is not declared'
[014]  ...

to reproduce run on FreeBSD VMware:
artifacts.zip

./test-run.py `for r in {1..40} ; do echo replication/replica_rejoin_failing.test.lua ; done` -j25 --force

cat replication/replica_rejoin_failing.test.lua

env = require('test_run')
test_run = env.new()
log = require('log')
engine = test_run:get_cfg('engine') 

test_run:cleanup_cluster()

--
-- tarantool/tarantool#461: check that a replica refetches the last checkpoint
-- in case it fell behind the master.
--
box.schema.user.grant('guest', 'replication')
_ = box.schema.space.create('test', {engine = engine})
_ = box.space.test:create_index('pk')
_ = box.space.test:insert{1}
_ = box.space.test:insert{2}
_ = box.space.test:insert{3}

-- Join a replica, then stop it.
test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")

-- Restart the server to purge the replica from
-- the garbage collection state.
test_run:cmd("restart server default")
box.cfg{wal_cleanup_delay = 0}

-- Make some checkpoints to remove old xlogs.
checkpoint_count = box.cfg.checkpoint_count
box.cfg{checkpoint_count = 1}
_ = box.space.test:delete{1}
_ = box.space.test:insert{10}
box.snapshot()
_ = box.space.test:delete{2}
_ = box.space.test:insert{20}
box.snapshot()
_ = box.space.test:delete{3}
_ = box.space.test:insert{30}

-- Restart the replica. Since xlogs have been removed,
-- it is supposed to rejoin without changing id.
test_run:cmd("start server replica")
test_run:wait_downstream(2, {status = 'follow'})
box.info.replication[2].downstream.vclock ~= nil or log.error(box.info)
test_run:cmd("switch replica")
box.space.test:select()

--
-- tarantool/tarantool#3740: rebootstrap crashes if the master has rows originating
-- from the replica.
--

-- Bootstrap a new replica.
test_run:cmd("switch default")
test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")
test_run:cleanup_cluster()
box.space.test:truncate()
test_run:cmd("start server replica")
-- Subscribe the master to the replica.
replica_listen = test_run:cmd("eval replica 'return box.cfg.listen'")
replica_listen ~= nil
box.cfg{replication = replica_listen}
-- Unsubscribe the replica from the master.
test_run:cmd("switch replica")
test_run:wait_upstream(1, {status = 'follow'})
box.cfg{replication = ''}
-- Bump vclock on the master.
test_run:cmd("switch default")
box.space.test:replace{1}

-- Cleanup.
--test_run:cmd("switch default")
box.cfg{replication = ''}
test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")
test_run:cmd("delete server replica")
test_run:cleanup_cluster()
box.space.test:drop()
box.schema.user.revoke('guest', 'replication')

Found that the same issue can be reproduced with:

./test-run.py `for r in {1..100} ; do echo replication/replica_rejoin_short.test.lua ; done` -j70 --force

cat replication/replica_rejoin_short.test.lua

test_run = require('test_run').new()
engine = test_run:get_cfg('engine') 

box.schema.user.grant('guest', 'replication')
_ = box.schema.space.create('test', {engine = engine})
_ = box.space.test:create_index('pk')
_ = box.space.test:insert{1}

test_run:cmd("create server replica with rpl_master=default, script='replication/replica_rejoin.lua'")
test_run:cmd("start server replica")
test_run:wait_cond(function() box.info.replication[2].downstream.vclock ~= nil end)
test_run:cmd("switch replica")
box.space.test:select()
-- issue is on truncate call
box.space.test:truncate()

test_run:cmd("switch default")
test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")
test_run:cmd("delete server replica")
test_run:cleanup_cluster()

The root cause of the issue seems in:
artifacts.zip
full_timelapsed.zip

2021-07-10 16:48:44.719 [5613] main/112/applier/unix/:/home/vagrant/tarantool/test/var/002_replication/master.iproto box.cc:2797 E> ER_CFG: Incorrect value for option 'replication': duplicate connection with the same replica UUID

as found in issue on flaky test #12
The issue the same as tarantool/tarantool#4668
Fix will be provided within tarantool/tarantool#4669
Solution suggested in tarantool/tarantool#4668 (comment) to use replication_*timeout values.

Optional (but very desirable):

  • coredump
  • backtrace
  • netstat

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions