-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: box/net.box.test.lua flaky fails #4273
Comments
To be sure bisected the history and additionally checked that the following commit is the cause of the fail:
|
A bit different error:
|
One more:
|
Iproto already listens for requests during recovery, so yielding at this point of time allows such early requests, which arrived during recovery, be processed while data is in unfinished state. This caused box/net.box test failures, and is potentially harmful. Besides, there is no need to yield during recovery. Closes #4273
The code causing the failure first appeared in 2.2 and was never backported to 2.1. Moving to 2.2. |
The test used to fail occasionally with a following error: ``` [001] box/net.box.test.lua [ fail ] [001] [001] Test failed! Result content mismatch: [001] —- box/net.box.result Thu Jun 13 06:16:34 2019 [001] +++ box/net.box.reject Fri Jun 14 04:50:55 2019 [001] @@ -3774,23 +3774,23 @@ [001] ... [001] test_run:grep_log('default', 'Got a corrupted row.*') [001] —- [001] — 'Got a corrupted row:' [001] +- null [001] ... [001] test_run:grep_log('default', '00000000:.*') [001] —- [001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000010:.*') [001] —- [001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000020:.*') [001] —- [001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000030:.*') [001] —- [001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000040:.*') [001] —- ``` This happened because we used `grep_log` right after `socket:write`, which should cause the expected log messages. Change to `wait_log`. Follow-up #4273
The test used to fail occasionally with a following error: ``` [001] box/net.box.test.lua [ fail ] [001] [001] Test failed! Result content mismatch: [001] —- box/net.box.result Thu Jun 13 06:16:34 2019 [001] +++ box/net.box.reject Fri Jun 14 04:50:55 2019 [001] @@ -3774,23 +3774,23 @@ [001] ... [001] test_run:grep_log('default', 'Got a corrupted row.*') [001] —- [001] — 'Got a corrupted row:' [001] +- null [001] ... [001] test_run:grep_log('default', '00000000:.*') [001] —- [001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000010:.*') [001] —- [001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000020:.*') [001] —- [001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000030:.*') [001] —- [001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000040:.*') [001] —- ``` This happened because we used `grep_log` right after `socket:write`, which should cause the expected log messages. Change to `wait_log`. Follow-up #4273
The test used to fail occasionally with a following error: ``` [001] box/net.box.test.lua [ fail ] [001] [001] Test failed! Result content mismatch: [001] —- box/net.box.result Thu Jun 13 06:16:34 2019 [001] +++ box/net.box.reject Fri Jun 14 04:50:55 2019 [001] @@ -3774,23 +3774,23 @@ [001] ... [001] test_run:grep_log('default', 'Got a corrupted row.*') [001] —- [001] — 'Got a corrupted row:' [001] +- null [001] ... [001] test_run:grep_log('default', '00000000:.*') [001] —- [001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000010:.*') [001] —- [001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000020:.*') [001] —- [001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000030:.*') [001] —- [001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000040:.*') [001] —- ``` This happened because we used `grep_log` right after `socket:write`, which should cause the expected log messages. Change to `wait_log`. Follow-up #4273
The test used to fail occasionally with a following error: ``` [001] box/net.box.test.lua [ fail ] [001] [001] Test failed! Result content mismatch: [001] —- box/net.box.result Thu Jun 13 06:16:34 2019 [001] +++ box/net.box.reject Fri Jun 14 04:50:55 2019 [001] @@ -3774,23 +3774,23 @@ [001] ... [001] test_run:grep_log('default', 'Got a corrupted row.*') [001] —- [001] — 'Got a corrupted row:' [001] +- null [001] ... [001] test_run:grep_log('default', '00000000:.*') [001] —- [001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000010:.*') [001] —- [001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000020:.*') [001] —- [001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000030:.*') [001] —- [001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000040:.*') [001] —- ``` This happened because we used `grep_log` right after `socket:write`, which should cause the expected log messages. Change to `wait_log`. Follow-up #4273
The test used to fail occasionally with a following error: ``` [001] box/net.box.test.lua [ fail ] [001] [001] Test failed! Result content mismatch: [001] —- box/net.box.result Thu Jun 13 06:16:34 2019 [001] +++ box/net.box.reject Fri Jun 14 04:50:55 2019 [001] @@ -3774,23 +3774,23 @@ [001] ... [001] test_run:grep_log('default', 'Got a corrupted row.*') [001] —- [001] — 'Got a corrupted row:' [001] +- null [001] ... [001] test_run:grep_log('default', '00000000:.*') [001] —- [001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000010:.*') [001] —- [001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000020:.*') [001] —- [001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000030:.*') [001] —- [001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000040:.*') [001] —- ``` This happened because we used `grep_log` right after `socket:write`, which should cause the expected log messages. Change to `wait_log`. Follow-up #4273
The test used to fail occasionally with a following error: ``` [001] box/net.box.test.lua [ fail ] [001] [001] Test failed! Result content mismatch: [001] —- box/net.box.result Thu Jun 13 06:16:34 2019 [001] +++ box/net.box.reject Fri Jun 14 04:50:55 2019 [001] @@ -3774,23 +3774,23 @@ [001] ... [001] test_run:grep_log('default', 'Got a corrupted row.*') [001] —- [001] — 'Got a corrupted row:' [001] +- null [001] ... [001] test_run:grep_log('default', '00000000:.*') [001] —- [001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000010:.*') [001] —- [001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000020:.*') [001] —- [001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000030:.*') [001] —- [001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000040:.*') [001] —- ``` This happened because we used `grep_log` right after `socket:write`, which should cause the expected log messages. Change to `wait_log`. Follow-up #4273
I recently observed such fails on 2.1 on Travis-CI (no 'Got a corrupted row' message in a log). Reduced the test to just one case: test_run = require('test_run').new()
LISTEN = require('uri').parse(box.cfg.listen)
-- related to gh-4040: log corrupted rows
--
socket = require("socket")
log_level = box.cfg.log_level
box.cfg{log_level=6}
sock = socket.tcp_connect(LISTEN.host, LISTEN.service)
-- we need to have a packet with correctly encoded length,
-- so that it bypasses iproto length check, but cannot be
-- decoded in xrow_header_decode
-- 0x3C = 60, sha1 digest is 20 bytes long
data = string.fromhex('3C'..string.rep(require('digest').sha1_hex('bcde'), 3))
sock:write(data)
sock:close()
test_run:grep_log('default', 'Got a corrupted row.*')
test_run:grep_log('default', '00000000:.*')
test_run:grep_log('default', '00000010:.*')
test_run:grep_log('default', '00000020:.*')
test_run:grep_log('default', '00000030:.*')
test_run:grep_log('default', '00000040:.*')
box.cfg{log_level=log_level} Replaced first grep_log() with wait_log(), but still observe such fails when run like so: $ ./test-run.py $(yes box/net.box.test.lua | head -n 1000) When the test fails the following line appears in logs instead of 'Got a corrupted row':
Collected a stack backtrace at the point where the error is reported:
So the reason is that iproto cannot write a response. If we'll move So fixed reduced test case is the following: test_run = require('test_run').new()
LISTEN = require('uri').parse(box.cfg.listen)
-- related to gh-4040: log corrupted rows
--
socket = require("socket")
log_level = box.cfg.log_level
box.cfg{log_level=6}
sock = socket.tcp_connect(LISTEN.host, LISTEN.service)
-- we need to have a packet with correctly encoded length,
-- so that it bypasses iproto length check, but cannot be
-- decoded in xrow_header_decode
-- 0x3C = 60, sha1 digest is 20 bytes long
data = string.fromhex('3C'..string.rep(require('digest').sha1_hex('bcde'), 3))
sock:write(data)
test_run:wait_log('default', 'Got a corrupted row.*', nil, 30)
test_run:grep_log('default', '00000000:.*')
test_run:grep_log('default', '00000010:.*')
test_run:grep_log('default', '00000020:.*')
test_run:grep_log('default', '00000030:.*')
test_run:grep_log('default', '00000040:.*')
sock:close()
box.cfg{log_level=log_level} I'll reopen the issue against Serge. |
master, 2.1 and 1.10 seems to be affected, so I changed the milestone to 1.10.4. |
The test case has two problems that appear from time to time and lead to flaky fails. Those fails are look as shown below in a test-run output. | Test failed! Result content mismatch: | --- box/net.box.result Mon Jun 24 17:23:49 2019 | +++ box/net.box.reject Mon Jun 24 17:51:52 2019 | @@ -1404,7 +1404,7 @@ | ... | test_run:grep_log('default', 'ER_INVALID_MSGPACK.*') | --- | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length' | ... | -- gh-983 selecting a lot of data crashes the server or hangs the | -- connection 'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' log message, but if it is not in a log file at a time of grep_log() call (just don't flushed to the file yet) a message produced by another test case can be matched ('ER_INVALID_MSGPACK: Invalid MsgPack - packet length'). The fix here is to match the entire message and check for the message periodically during several seconds (use wait_log() instead of grep_log()). Another problem is the race between writing a response to an iproto socket on a server side and closing the socket on a client end. If tarantool is unable to write a response, it does not produce the warning re invalid msgpack, but shows 'broken pipe' message instead. We need first grep for the message in logs and only then close the socket on a client. The similar problem (with another test case) is described in [1]. [1]: #4273 (comment) Closes: #4311
The test case has two problems that appear from time to time and lead to flaky fails. Those fails are look as shown below in a test-run output. | Test failed! Result content mismatch: | --- box/net.box.result Mon Jun 24 17:23:49 2019 | +++ box/net.box.reject Mon Jun 24 17:51:52 2019 | @@ -1404,7 +1404,7 @@ | ... | test_run:grep_log('default', 'ER_INVALID_MSGPACK.*') | --- | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length' | ... | -- gh-983 selecting a lot of data crashes the server or hangs the | -- connection 'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' log message, but if it is not in a log file at a time of grep_log() call (just don't flushed to the file yet) a message produced by another test case can be matched ('ER_INVALID_MSGPACK: Invalid MsgPack - packet length'). The fix here is to match the entire message and check for the message periodically during several seconds (use wait_log() instead of grep_log()). Another problem is the race between writing a response to an iproto socket on a server side and closing the socket on a client end. If tarantool is unable to write a response, it does not produce the warning re invalid msgpack, but shows 'broken pipe' message instead. We need first grep for the message in logs and only then close the socket on a client. The similar problem (with another test case) is described in [1]. [1]: #4273 (comment) Closes: #4311 (cherry picked from commit 0f9fdd7)
The test case has two problems that appear from time to time and lead to flaky fails. Those fails are look as shown below in a test-run output. | Test failed! Result content mismatch: | --- box/net.box.result Mon Jun 24 17:23:49 2019 | +++ box/net.box.reject Mon Jun 24 17:51:52 2019 | @@ -1404,7 +1404,7 @@ | ... | test_run:grep_log('default', 'ER_INVALID_MSGPACK.*') | --- | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length' | ... | -- gh-983 selecting a lot of data crashes the server or hangs the | -- connection 'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' log message, but if it is not in a log file at a time of grep_log() call (just don't flushed to the file yet) a message produced by another test case can be matched ('ER_INVALID_MSGPACK: Invalid MsgPack - packet length'). The fix here is to match the entire message and check for the message periodically during several seconds (use wait_log() instead of grep_log()). Another problem is the race between writing a response to an iproto socket on a server side and closing the socket on a client end. If tarantool is unable to write a response, it does not produce the warning re invalid msgpack, but shows 'broken pipe' message instead. We need first grep for the message in logs and only then close the socket on a client. The similar problem (with another test case) is described in [1]. [1]: #4273 (comment) Closes: #4311 (cherry picked from commit 0f9fdd7)
The test regarding logging corrupted rows failed occasionally with ``` [016] test_run:grep_log('default', 'Got a corrupted row.*') [016] --- [016] -- 'Got a corrupted row:' [016] +- null [016] ... ``` The logs then had ``` [010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe ``` instead of the expected message. This happened, because we closed a socket before tarantool could write a greeting to the client, the connection was then closed, and execution never got to processing the malformed request and thus printing the desired message to the log. To fix this, actually read the greeting prior to writing new data and closing the socket. Follow-up #4273
The test regarding logging corrupted rows failed occasionally with ``` [016] test_run:grep_log('default', 'Got a corrupted row.*') [016] --- [016] -- 'Got a corrupted row:' [016] +- null [016] ... ``` The logs then had ``` [010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe ``` instead of the expected message. This happened, because we closed a socket before tarantool could write a greeting to the client, the connection was then closed, and execution never got to processing the malformed request and thus printing the desired message to the log. To fix this, actually read the greeting prior to writing new data and closing the socket. Follow-up #4273
The test regarding logging corrupted rows failed occasionally with ``` [016] test_run:grep_log('default', 'Got a corrupted row.*') [016] --- [016] -- 'Got a corrupted row:' [016] +- null [016] ... ``` The logs then had ``` [010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe ``` instead of the expected message. This happened, because we closed a socket before tarantool could write a greeting to the client, the connection was then closed, and execution never got to processing the malformed request and thus printing the desired message to the log. To fix this, actually read the greeting prior to writing new data and closing the socket. Follow-up #4273 (cherry picked from commit eb0cc50)
The test regarding logging corrupted rows failed occasionally with ``` [016] test_run:grep_log('default', 'Got a corrupted row.*') [016] --- [016] -- 'Got a corrupted row:' [016] +- null [016] ... ``` The logs then had ``` [010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe ``` instead of the expected message. This happened, because we closed a socket before tarantool could write a greeting to the client, the connection was then closed, and execution never got to processing the malformed request and thus printing the desired message to the log. To fix this, actually read the greeting prior to writing new data and closing the socket. Follow-up #4273 (cherry picked from commit eb0cc50)
Reproducer:
Test:
|
The issue seems to have the common from #3851 |
An excess error message in logs, which indicates the error causing additional disconnect and connect.
|
This leftover error is caused by #2763. |
This last error ``` [035] ... [035] disconnected_cnt [035] --- [035] -- 1 [035] +- 2 [035] ... [035] conn:close() [035] --- [035] ... [035] disconnected_cnt [035] --- [035] -- 2 [035] +- 3 [035] ... [035] test_run:cmd('stop server connecter') [035] --- [035] ``` Happens because net.box is able to connect to tarantool before it has finished bootstrap. When connecting, net.box tries to fetch schema executing a couple of selects, but fails to pass access check since grants aren't applied yet. This is described in detail in #2763 (comment) So, alter the test so that it tolerates multiple connection failures. Closes #4273
This last error ``` [035] ... [035] disconnected_cnt [035] --- [035] -- 1 [035] +- 2 [035] ... [035] conn:close() [035] --- [035] ... [035] disconnected_cnt [035] --- [035] -- 2 [035] +- 3 [035] ... [035] test_run:cmd('stop server connecter') [035] --- [035] ``` Happens because net.box is able to connect to tarantool before it has finished bootstrap. When connecting, net.box tries to fetch schema executing a couple of selects, but fails to pass access check since grants aren't applied yet. This is described in detail in #2763 (comment) So, alter the test so that it tolerates multiple connection failures. Closes #4273 (cherry picked from commit 1a2addb)
This last error ``` [035] ... [035] disconnected_cnt [035] --- [035] -- 1 [035] +- 2 [035] ... [035] conn:close() [035] --- [035] ... [035] disconnected_cnt [035] --- [035] -- 2 [035] +- 3 [035] ... [035] test_run:cmd('stop server connecter') [035] --- [035] ``` Happens because net.box is able to connect to tarantool before it has finished bootstrap. When connecting, net.box tries to fetch schema executing a couple of selects, but fails to pass access check since grants aren't applied yet. This is described in detail in #2763 (comment) So, alter the test so that it tolerates multiple connection failures. Closes #4273 (cherry picked from commit 1a2addb)
Tarantool version: 2.2.0
OS version: Ubuntu 16.04
Bug description: Test box/net.box.test.lua flaky fails
Steps to reproduce:
This issue seems to be related to commit 527b02a .
Error:
The text was updated successfully, but these errors were encountered: