Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: gc.test.lua test fails on *.xlog files cleanup #4162

Closed
avtikhon opened this issue Apr 16, 2019 · 2 comments
Closed

test: gc.test.lua test fails on *.xlog files cleanup #4162

avtikhon opened this issue Apr 16, 2019 · 2 comments
Assignees
Labels
flaky test qa Issues related to tests or testing subsystem

Comments

@avtikhon
Copy link
Contributor

Tarantool version:
All

OS version:
All

Bug description:
gc.test.lua test needs default 60 secs timeout for wait_cond routine instead of local 10 secs

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Steps to reproduce:
./test-run.py -j 50 --force replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua replication/gc.test.lua

Optional (but very desirable):

  • coredump
  • backtrace
  • netstat
@avtikhon avtikhon self-assigned this Apr 16, 2019
@avtikhon avtikhon added flaky test qa Issues related to tests or testing subsystem labels Apr 16, 2019
avtikhon added a commit that referenced this issue Apr 16, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue Apr 16, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
@locker
Copy link
Member

locker commented Apr 16, 2019

It might be a dup of #4034, which I reopened recently.

avtikhon added a commit that referenced this issue Apr 25, 2019
Imitate the replica crash and, then, wake up.
Just 'stop server replica' (SIGTERM) is not sufficient to stop
a tarantool instance when ERRINJ_WAL_DELAY is set, because
"tarantool" thread wait for paused "wal" thread infinitely.
Changed server stop routine to to kill routine to be able
to use SIGKILL instead of SIGTERM to the replica server. In
this way the server replica will be killed immediately and
*.xlog files will be removed as it has to be.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
@avtikhon avtikhon changed the title test: gc.test.lua test needs default timeout for wait_cond routine test: gc.test.lua test fails on *.xlog files cleanup Apr 25, 2019
avtikhon added a commit that referenced this issue Apr 29, 2019
Used the signal option set to SIGKILL to stop server replica
routine to be able to stop the replica imediately to imitate
the replica crash and, then, wake up.
Just 'stop server replica' (SIGTERM) is not sufficient to stop
a tarantool instance when ERRINJ_WAL_DELAY is set, because
"tarantool" thread wait for paused "wal" thread infinitely.
Changed server stop routine to to kill routine to be able
to use SIGKILL instead of SIGTERM to the replica server. In
this way the server replica will be killed immediately and
*.xlog files will be removed as it has to be.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue Apr 30, 2019
Used the signal option set to SIGKILL to stop server replica
routine to be able to stop the replica imediately to imitate
the replica crash and, then, wake up.
Just 'stop server replica' (SIGTERM) is not sufficient to stop
a tarantool instance when ERRINJ_WAL_DELAY is set, because
"tarantool" thread wait for paused "wal" thread infinitely.
Changed server stop routine to to kill routine to be able
to use SIGKILL instead of SIGTERM to the replica server. In
this way the server replica will be killed immediately and
*.xlog files will be removed as it has to be.
The logic of the replication was tried to change, but met
the new issues, so the suggested fix at commit:
b5b4809
was reverted at commit:
766cd3e

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue May 1, 2019
Used the signal option set to SIGKILL to stop server replica
routine to be able to stop the replica immediately to imitate
the replica crash and, then, wake up.
The current case happened when we wanted to set ERRINJ_WAL_DELAY
for a tarantool instance and then stop it. By default the
SIGTERM was used and was not sufficient there, because the
main thread stil waited for the stuck WAL thread after the
signal. In that case the replica finished reading the *.xlog
file, but master server didn't know about it and saved the
previous *.xlog file for replica for its restart. When the
signal was changed from default to 9 (SIGKILL) replica didn't
have a chance to read all data from *.xlog file due to it was
killed immediately. So after replica restart it removed the previous
*.xlog file after its reading.
The logic of the replication was tried to change, but met
the new issues, so the suggested fix at commit:
b5b4809
was reverted at commit:
766cd3e

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162

u
avtikhon added a commit that referenced this issue May 1, 2019
Used the signal option set to SIGKILL to stop server replica
routine to be able to stop the replica immediately to imitate
the replica crash and, then, wake up.
The current case happened when we wanted to set ERRINJ_WAL_DELAY
for a tarantool instance and then stop it. By default the
SIGTERM was used and was not sufficient there, because the
main thread stil waited for the stuck WAL thread after the
signal. In that case the replica finished reading the *.xlog
file, but master server didn't know about it and saved the
previous *.xlog file for replica for its restart. When the
signal was changed from default to 9 (SIGKILL) replica didn't
have a chance to read all data from *.xlog file due to it was
killed immediately. So after replica restart it removed the previous
*.xlog file after its reading.
The logic of the replication was tried to change, but met
the new issues, so the suggested fix at commit:
b5b4809
was reverted at commit:
766cd3e

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
Totktonada added a commit that referenced this issue May 2, 2019
Added the signal option into 'stop server' command.

How to use:

 | test_run:cmd('stop server foo with signal=KILL')

The 'stop server foo' command without the option sends SIGTERM as
before.

This feature is intended to be used in a fix of #4162 ('test:
gc.test.lua test fails on *.xlog files cleanup').
Totktonada added a commit that referenced this issue May 2, 2019
Added the signal option into 'stop server' command.

How to use:

 | test_run:cmd('stop server foo with signal=KILL')

The 'stop server foo' command without the option sends SIGTERM as
before.

This feature is intended to be used in a fix of #4162 ('test:
gc.test.lua test fails on *.xlog files cleanup').

(cherry picked from commit 71f7ecf)
Totktonada added a commit that referenced this issue May 2, 2019
Added the signal option into 'stop server' command.

How to use:

 | test_run:cmd('stop server foo with signal=KILL')

The 'stop server foo' command without the option sends SIGTERM as
before.

This feature is intended to be used in a fix of #4162 ('test:
gc.test.lua test fails on *.xlog files cleanup').

(cherry picked from commit 71f7ecf)
Totktonada pushed a commit that referenced this issue May 2, 2019
Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809 ('replication: update replica
  gc state on subscribe');
* 766cd3e ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162
Totktonada pushed a commit that referenced this issue May 2, 2019
Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809 ('replication: update replica
  gc state on subscribe');
* 766cd3e ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162

(cherry picked from commit 35b5095)
avtikhon added a commit that referenced this issue May 6, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue May 7, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue May 7, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue May 7, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue May 7, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
avtikhon added a commit that referenced this issue May 7, 2019
gc.test.lua test cleaned up to use default 60 secs timeout for
wait_cond routine instead of local 10 secs, because it doesn't
need to have its own special value. Also the diagnostic messages
added on wait_* routines fails.

[029] --- replication/gc.result Mon Apr 15 14:58:09 2019
[029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
[029] @@ -290,7 +290,12 @@
[029] ...
[029] wait_xlog(1) or fio.listdir('./master')
[029] ---
[048] replication/gc.test.lua vinyl [ fail ]
[048]
[048] Test failed! Result content mismatch:
[029] -- true
[029] +- - 00000000000000000305.vylog
[029] + - 00000000000000000305.xlog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000310.vylog
[029] + - 00000000000000000310.snap
[029] ...
[029] -- Stop the replica.
[029] test_run:cmd("stop server replica")
[029] @@ -326,7 +331,13 @@
[029] ...
[029] wait_xlog(2) or fio.listdir('./master')
[029] ---
[029] -- true
[029] +- - 00000000000000000305.xlog
[029] + - 00000000000000000316.xlog
[029] + - 00000000000000000316.vylog
[029] + - '512'
[029] + - 00000000000000000310.xlog
[029] + - 00000000000000000317.vylog
[029] + - 00000000000000000317.snap
[029] ...
[029] -- The xlog should only be deleted after the replica
[029] -- is unregistered.
[029]

Close #4162
@avtikhon
Copy link
Contributor Author

avtikhon commented May 8, 2019

The issues already resolved and the left clean up patch is not really needed - closing.

avtikhon added a commit that referenced this issue Sep 16, 2019
Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809 ('replication: update replica
  gc state on subscribe');
* 766cd3e ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162
avtikhon added a commit that referenced this issue Oct 31, 2019
Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809 ('replication: update replica
  gc state on subscribe');
* 766cd3e ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162

(cherry picked from commit 35b5095)
avtikhon added a commit that referenced this issue Nov 19, 2019
Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809 ('replication: update replica
  gc state on subscribe');
* 766cd3e ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162

(cherry picked from commit 35b5095)
kyukhin pushed a commit that referenced this issue Nov 26, 2019
Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809 ('replication: update replica
  gc state on subscribe');
* 766cd3e ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162

(cherry picked from commit 35b5095)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test qa Issues related to tests or testing subsystem
Projects
None yet
Development

No branches or pull requests

2 participants