test: gc.test.lua test fails on *.xlog files cleanup #4162

avtikhon · 2019-04-16T06:26:20Z

Tarantool version:
All

OS version:
All

Bug description:
gc.test.lua test needs default 60 secs timeout for wait_cond routine instead of local 10 secs

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Steps to reproduce:
./test-run.py -j 50 --force replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua

Optional (but very desirable):

coredump
backtrace
netstat

gc.test.lua test cleaned up to use default 60 secs timeout for wait_cond routine instead of local 10 secs, because it doesn't need to have its own special value. Also the diagnostic messages added on wait_* routines fails. [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162

locker · 2019-04-16T17:19:58Z

It might be a dup of #4034, which I reopened recently.

Imitate the replica crash and, then, wake up. Just 'stop server replica' (SIGTERM) is not sufficient to stop a tarantool instance when ERRINJ_WAL_DELAY is set, because "tarantool" thread wait for paused "wal" thread infinitely. Changed server stop routine to to kill routine to be able to use SIGKILL instead of SIGTERM to the replica server. In this way the server replica will be killed immediately and *.xlog files will be removed as it has to be. [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162

Used the signal option set to SIGKILL to stop server replica routine to be able to stop the replica imediately to imitate the replica crash and, then, wake up. Just 'stop server replica' (SIGTERM) is not sufficient to stop a tarantool instance when ERRINJ_WAL_DELAY is set, because "tarantool" thread wait for paused "wal" thread infinitely. Changed server stop routine to to kill routine to be able to use SIGKILL instead of SIGTERM to the replica server. In this way the server replica will be killed immediately and *.xlog files will be removed as it has to be. [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162

Used the signal option set to SIGKILL to stop server replica routine to be able to stop the replica imediately to imitate the replica crash and, then, wake up. Just 'stop server replica' (SIGTERM) is not sufficient to stop a tarantool instance when ERRINJ_WAL_DELAY is set, because "tarantool" thread wait for paused "wal" thread infinitely. Changed server stop routine to to kill routine to be able to use SIGKILL instead of SIGTERM to the replica server. In this way the server replica will be killed immediately and *.xlog files will be removed as it has to be. The logic of the replication was tried to change, but met the new issues, so the suggested fix at commit: b5b4809 was reverted at commit: 766cd3e [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162

Used the signal option set to SIGKILL to stop server replica routine to be able to stop the replica immediately to imitate the replica crash and, then, wake up. The current case happened when we wanted to set ERRINJ_WAL_DELAY for a tarantool instance and then stop it. By default the SIGTERM was used and was not sufficient there, because the main thread stil waited for the stuck WAL thread after the signal. In that case the replica finished reading the *.xlog file, but master server didn't know about it and saved the previous *.xlog file for replica for its restart. When the signal was changed from default to 9 (SIGKILL) replica didn't have a chance to read all data from *.xlog file due to it was killed immediately. So after replica restart it removed the previous *.xlog file after its reading. The logic of the replication was tried to change, but met the new issues, so the suggested fix at commit: b5b4809 was reverted at commit: 766cd3e [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162 u

Used the signal option set to SIGKILL to stop server replica routine to be able to stop the replica immediately to imitate the replica crash and, then, wake up. The current case happened when we wanted to set ERRINJ_WAL_DELAY for a tarantool instance and then stop it. By default the SIGTERM was used and was not sufficient there, because the main thread stil waited for the stuck WAL thread after the signal. In that case the replica finished reading the *.xlog file, but master server didn't know about it and saved the previous *.xlog file for replica for its restart. When the signal was changed from default to 9 (SIGKILL) replica didn't have a chance to read all data from *.xlog file due to it was killed immediately. So after replica restart it removed the previous *.xlog file after its reading. The logic of the replication was tried to change, but met the new issues, so the suggested fix at commit: b5b4809 was reverted at commit: 766cd3e [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162

Added the signal option into 'stop server' command. How to use: | test_run:cmd('stop server foo with signal=KILL') The 'stop server foo' command without the option sends SIGTERM as before. This feature is intended to be used in a fix of #4162 ('test: gc.test.lua test fails on *.xlog files cleanup').

Added the signal option into 'stop server' command. How to use: | test_run:cmd('stop server foo with signal=KILL') The 'stop server foo' command without the option sends SIGTERM as before. This feature is intended to be used in a fix of #4162 ('test: gc.test.lua test fails on *.xlog files cleanup'). (cherry picked from commit 71f7ecf)

Two problems are fixed here. The first one is about correctness of the test case. The second is about flaky failures. About correctness. The test case contains the following lines: | test_run:cmd("switch replica") | -- Unblock the replica and break replication. | box.error.injection.set("ERRINJ_WAL_DELAY", false) | box.cfg{replication = {}} Usually rows are applied and the new vclock is sent to the master before replication will be disabled. So the master removes old xlog before the replica restart and the next case tests nothing. This commit uses the new test-run's ability to stop a tarantool instance with a custom signal and stops the replica with SIGKILL w/o dropping ERRINJ_WAL_DELAY. This change fixes the race between applying rows and disabling replication and so makes the test case correct. About flaky failures. They were look like so: | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 | [029] @@ -290,7 +290,12 @@ | [029] ... | [029] wait_xlog(1) or fio.listdir('./master') | [029] --- | [029] -- true | [029] +- - 00000000000000000305.vylog | [029] + - 00000000000000000305.xlog | [029] + - '512' | [029] + - 00000000000000000310.xlog | [029] + - 00000000000000000310.vylog | [029] + - 00000000000000000310.snap | [029] ... | [029] -- Stop the replica. | [029] test_run:cmd("stop server replica") | <...next cases could have induced mismathes too...> The reason of the fail is that a replica applied all rows from the old xlog, but didn't sent an ACK with a new vclock to a master, because the replication was disabled before that. The master stops relay and keeps the old xlog. When the replica starts again it subscribes with the vclock value that instructs a relay to open the new xlog. Tarantool can remove an old xlog just after a replica's ACK when observes that the xlog was fully read by all replicas. But tarantool does not remove xlogs when a replica is subscribed. This is not a big problem, because such 'stuck' xlog file will be removed with a next xlog removal. There was the attempt to fix this behaviour and remove old xlogs at subscribe, see the following commits: * b5b4809 ('replication: update replica gc state on subscribe'); * 766cd3e ('Revert "replication: update replica gc state on subscribe"'). Anyway, this commit fixes this flaky failures, because stops the replica before applying rows from the old xlog. So when the replica starts it continues reading from the old xlog and the xlog file will be removed when will be fully read. Closes #4162

Two problems are fixed here. The first one is about correctness of the test case. The second is about flaky failures. About correctness. The test case contains the following lines: | test_run:cmd("switch replica") | -- Unblock the replica and break replication. | box.error.injection.set("ERRINJ_WAL_DELAY", false) | box.cfg{replication = {}} Usually rows are applied and the new vclock is sent to the master before replication will be disabled. So the master removes old xlog before the replica restart and the next case tests nothing. This commit uses the new test-run's ability to stop a tarantool instance with a custom signal and stops the replica with SIGKILL w/o dropping ERRINJ_WAL_DELAY. This change fixes the race between applying rows and disabling replication and so makes the test case correct. About flaky failures. They were look like so: | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 | [029] @@ -290,7 +290,12 @@ | [029] ... | [029] wait_xlog(1) or fio.listdir('./master') | [029] --- | [029] -- true | [029] +- - 00000000000000000305.vylog | [029] + - 00000000000000000305.xlog | [029] + - '512' | [029] + - 00000000000000000310.xlog | [029] + - 00000000000000000310.vylog | [029] + - 00000000000000000310.snap | [029] ... | [029] -- Stop the replica. | [029] test_run:cmd("stop server replica") | <...next cases could have induced mismathes too...> The reason of the fail is that a replica applied all rows from the old xlog, but didn't sent an ACK with a new vclock to a master, because the replication was disabled before that. The master stops relay and keeps the old xlog. When the replica starts again it subscribes with the vclock value that instructs a relay to open the new xlog. Tarantool can remove an old xlog just after a replica's ACK when observes that the xlog was fully read by all replicas. But tarantool does not remove xlogs when a replica is subscribed. This is not a big problem, because such 'stuck' xlog file will be removed with a next xlog removal. There was the attempt to fix this behaviour and remove old xlogs at subscribe, see the following commits: * b5b4809 ('replication: update replica gc state on subscribe'); * 766cd3e ('Revert "replication: update replica gc state on subscribe"'). Anyway, this commit fixes this flaky failures, because stops the replica before applying rows from the old xlog. So when the replica starts it continues reading from the old xlog and the xlog file will be removed when will be fully read. Closes #4162 (cherry picked from commit 35b5095)

gc.test.lua test cleaned up to use default 60 secs timeout for wait_cond routine instead of local 10 secs, because it doesn't need to have its own special value. Also the diagnostic messages added on wait_* routines fails. [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162

…com/tarantool/tarantool into avtikhon/gh-4162-gc-default-timeout

gc.test.lua test cleaned up to use default 60 secs timeout for wait_cond routine instead of local 10 secs, because it doesn't need to have its own special value. Also the diagnostic messages added on wait_* routines fails. [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 [029] @@ -290,7 +290,12 @@ [029] ... [029] wait_xlog(1) or fio.listdir('./master') [029] --- [048] replication/gc.test.lua vinyl [ fail ] [048] [048] Test failed! Result content mismatch: [029] -- true [029] +- - 00000000000000000305.vylog [029] + - 00000000000000000305.xlog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000310.vylog [029] + - 00000000000000000310.snap [029] ... [029] -- Stop the replica. [029] test_run:cmd("stop server replica") [029] @@ -326,7 +331,13 @@ [029] ... [029] wait_xlog(2) or fio.listdir('./master') [029] --- [029] -- true [029] +- - 00000000000000000305.xlog [029] + - 00000000000000000316.xlog [029] + - 00000000000000000316.vylog [029] + - '512' [029] + - 00000000000000000310.xlog [029] + - 00000000000000000317.vylog [029] + - 00000000000000000317.snap [029] ... [029] -- The xlog should only be deleted after the replica [029] -- is unregistered. [029] Close #4162

avtikhon · 2019-05-08T14:17:50Z

The issues already resolved and the left clean up patch is not really needed - closing.

Two problems are fixed here. The first one is about correctness of the test case. The second is about flaky failures. About correctness. The test case contains the following lines: | test_run:cmd("switch replica") | -- Unblock the replica and break replication. | box.error.injection.set("ERRINJ_WAL_DELAY", false) | box.cfg{replication = {}} Usually rows are applied and the new vclock is sent to the master before replication will be disabled. So the master removes old xlog before the replica restart and the next case tests nothing. This commit uses the new test-run's ability to stop a tarantool instance with a custom signal and stops the replica with SIGKILL w/o dropping ERRINJ_WAL_DELAY. This change fixes the race between applying rows and disabling replication and so makes the test case correct. About flaky failures. They were look like so: | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 | [029] @@ -290,7 +290,12 @@ | [029] ... | [029] wait_xlog(1) or fio.listdir('./master') | [029] --- | [029] -- true | [029] +- - 00000000000000000305.vylog | [029] + - 00000000000000000305.xlog | [029] + - '512' | [029] + - 00000000000000000310.xlog | [029] + - 00000000000000000310.vylog | [029] + - 00000000000000000310.snap | [029] ... | [029] -- Stop the replica. | [029] test_run:cmd("stop server replica") | <...next cases could have induced mismathes too...> The reason of the fail is that a replica applied all rows from the old xlog, but didn't sent an ACK with a new vclock to a master, because the replication was disabled before that. The master stops relay and keeps the old xlog. When the replica starts again it subscribes with the vclock value that instructs a relay to open the new xlog. Tarantool can remove an old xlog just after a replica's ACK when observes that the xlog was fully read by all replicas. But tarantool does not remove xlogs when a replica is subscribed. This is not a big problem, because such 'stuck' xlog file will be removed with a next xlog removal. There was the attempt to fix this behaviour and remove old xlogs at subscribe, see the following commits: * b5b4809 ('replication: update replica gc state on subscribe'); * 766cd3e ('Revert "replication: update replica gc state on subscribe"'). Anyway, this commit fixes this flaky failures, because stops the replica before applying rows from the old xlog. So when the replica starts it continues reading from the old xlog and the xlog file will be removed when will be fully read. Closes #4162

Two problems are fixed here. The first one is about correctness of the test case. The second is about flaky failures. About correctness. The test case contains the following lines: | test_run:cmd("switch replica") | -- Unblock the replica and break replication. | box.error.injection.set("ERRINJ_WAL_DELAY", false) | box.cfg{replication = {}} Usually rows are applied and the new vclock is sent to the master before replication will be disabled. So the master removes old xlog before the replica restart and the next case tests nothing. This commit uses the new test-run's ability to stop a tarantool instance with a custom signal and stops the replica with SIGKILL w/o dropping ERRINJ_WAL_DELAY. This change fixes the race between applying rows and disabling replication and so makes the test case correct. About flaky failures. They were look like so: | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 | [029] @@ -290,7 +290,12 @@ | [029] ... | [029] wait_xlog(1) or fio.listdir('./master') | [029] --- | [029] -- true | [029] +- - 00000000000000000305.vylog | [029] + - 00000000000000000305.xlog | [029] + - '512' | [029] + - 00000000000000000310.xlog | [029] + - 00000000000000000310.vylog | [029] + - 00000000000000000310.snap | [029] ... | [029] -- Stop the replica. | [029] test_run:cmd("stop server replica") | <...next cases could have induced mismathes too...> The reason of the fail is that a replica applied all rows from the old xlog, but didn't sent an ACK with a new vclock to a master, because the replication was disabled before that. The master stops relay and keeps the old xlog. When the replica starts again it subscribes with the vclock value that instructs a relay to open the new xlog. Tarantool can remove an old xlog just after a replica's ACK when observes that the xlog was fully read by all replicas. But tarantool does not remove xlogs when a replica is subscribed. This is not a big problem, because such 'stuck' xlog file will be removed with a next xlog removal. There was the attempt to fix this behaviour and remove old xlogs at subscribe, see the following commits: * b5b4809 ('replication: update replica gc state on subscribe'); * 766cd3e ('Revert "replication: update replica gc state on subscribe"'). Anyway, this commit fixes this flaky failures, because stops the replica before applying rows from the old xlog. So when the replica starts it continues reading from the old xlog and the xlog file will be removed when will be fully read. Closes #4162 (cherry picked from commit 35b5095)

avtikhon self-assigned this Apr 16, 2019

avtikhon added flaky test qa Issues related to tests or testing subsystem labels Apr 16, 2019

avtikhon changed the title ~~test: gc.test.lua test needs default timeout for wait_cond routine~~ test: gc.test.lua test fails on *.xlog files cleanup Apr 25, 2019

Totktonada mentioned this issue Apr 29, 2019

Add the signal option into stop server routine tarantool/test-run#164

Merged

locker mentioned this issue Apr 30, 2019

replication/gc fails sporadically #4034

Closed

Totktonada closed this as completed in 35b5095 May 2, 2019

avtikhon added a commit that referenced this issue May 6, 2019

Merge branch 'avtikhon/gh-4162-gc-default-timeout' of https://github.…

9fe4057

…com/tarantool/tarantool into avtikhon/gh-4162-gc-default-timeout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: gc.test.lua test fails on *.xlog files cleanup #4162

test: gc.test.lua test fails on *.xlog files cleanup #4162

avtikhon commented Apr 16, 2019

locker commented Apr 16, 2019

avtikhon commented May 8, 2019

test: gc.test.lua test fails on *.xlog files cleanup #4162

test: gc.test.lua test fails on *.xlog files cleanup #4162

Comments

avtikhon commented Apr 16, 2019

locker commented Apr 16, 2019

avtikhon commented May 8, 2019