Skip to content

Commit

Permalink
test: fix replication/gc flaky failures
Browse files Browse the repository at this point in the history
Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809 ('replication: update replica
  gc state on subscribe');
* 766cd3e ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162

(cherry picked from commit 35b5095)
  • Loading branch information
avtikhon authored and Totktonada committed May 2, 2019
1 parent 5e04497 commit 8f954a7
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 15 deletions.
16 changes: 7 additions & 9 deletions test/replication/gc.result
Expand Up @@ -252,20 +252,18 @@ wait_xlog(2) or fio.listdir('./master')
---
- true
...
test_run:cmd("switch replica")
-- Imitate the replica crash and, then, wake up.
-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
-- "tarantool" thread wait for paused "wal" thread infinitely.
test_run:cmd("stop server replica with signal=KILL")
---
- true
...
-- Unblock the replica and break replication.
box.error.injection.set("ERRINJ_WAL_DELAY", false)
---
- ok
...
box.cfg{replication = {}}
test_run:cmd("start server replica")
---
- true
...
-- Restart the replica to reestablish replication.
test_run:cmd("restart server replica")
-- Wait for the replica to catch up.
test_run:cmd("switch replica")
---
Expand Down
12 changes: 6 additions & 6 deletions test/replication/gc.test.lua
Expand Up @@ -122,12 +122,12 @@ fiber.sleep(0.1) -- wait for master to relay data
-- the old snapshot.
wait_gc(1) or box.info.gc()
wait_xlog(2) or fio.listdir('./master')
test_run:cmd("switch replica")
-- Unblock the replica and break replication.
box.error.injection.set("ERRINJ_WAL_DELAY", false)
box.cfg{replication = {}}
-- Restart the replica to reestablish replication.
test_run:cmd("restart server replica")
-- Imitate the replica crash and, then, wake up.
-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
-- "tarantool" thread wait for paused "wal" thread infinitely.
test_run:cmd("stop server replica with signal=KILL")
test_run:cmd("start server replica")
-- Wait for the replica to catch up.
test_run:cmd("switch replica")
test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
Expand Down

0 comments on commit 8f954a7

Please sign in to comment.