test: box/net.box.test.lua flaky fails #4273

ImeevMA · 2019-06-05T12:12:35Z

Tarantool version: 2.2.0

OS version: Ubuntu 16.04

Bug description: Test box/net.box.test.lua flaky fails

Steps to reproduce:

TEST_RUN_TESTS=$(for i in $(seq 1 10); do echo -n 'box/net.box. '; done) make test

This issue seems to be related to commit 527b02a .

Error:

[002] box/net.box.test.lua                                            [ fail ]
[002] 
[002] Test failed! Result content mismatch:
[002] --- box/net.box.result	Wed Jun  5 15:00:46 2019
[002] +++ box/net.box.reject	Wed Jun  5 15:06:48 2019
[002] @@ -2339,14 +2339,14 @@
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 1
[002] +- 2
[002]  ...
[002]  conn:close()
[002]  ---
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 2
[002] +- 3
[002]  ...
[002]  test_run:cmd('stop server connecter')
[002]  ---

The text was updated successfully, but these errors were encountered:

This is the regression from a7c855e (net.box: fetch '_vcollation' sysview into the module). The result file was corrupted, because the test sometimes produces incorrect result in the test case for on_connect / on_disconnect triggers. This issue is tracked by #4273. Follow up #3941.

avtikhon · 2019-06-07T10:03:52Z

To be sure bisected the history and additionally checked that the following commit is the cause of the fail:

commit 527b02a2ee6a9a205d8e2c8f38bbb84edf0d6557
Author: Serge Petrenko <sergepetrenko@tarantool.org>
Date:   Tue May 28 14:42:12 2019 +0300

    memtx: add yields during index build
    
    Memtx index build used to stall event loop for all the build period.
    Add occasional yields so that the loop is not blocked for too long.
    Also make index build set on_replace triggers so that concurrent
    replaces are also correctly handled during the build.
    
    Part of #3976
    
    @TarantoolBot document
    Title: memtx indices are now built in background
    
    Memtx engine no longer stalls the event loop during an index build.
    You may insert tuples into a space while an index build is in progress:
    the tuples will be correctly added to the new index. If such tuple
    violates the new indexes unique constraint or doesn't match the new
    index format, the index build will be aborted.

ImeevMA · 2019-06-08T10:23:24Z

A bit different error:

[002] box/net.box.test.lua                                            [ fail ]
[002] 
[002] Test failed! Result content mismatch:
[002] --- box/net.box.result	Sat Jun  8 13:10:58 2019
[002] +++ box/net.box.reject	Sat Jun  8 13:21:17 2019
[002] @@ -1404,7 +1404,7 @@
[002]  ...
[002]  test_run:grep_log('default', 'ER_INVALID_MSGPACK.*')
[002]  ---
[002] -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body'
[002] +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length'
[002]  ...
[002]  -- gh-983 selecting a lot of data crashes the server or hangs the
[002]  -- connection
[002] @@ -2339,14 +2339,14 @@
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 1
[002] +- 2
[002]  ...
[002]  conn:close()
[002]  ---
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 2
[002] +- 3
[002]  ...
[002]  test_run:cmd('stop server connecter')
[002]  ---

ImeevMA · 2019-06-08T11:42:53Z

One more:

[016] box/net.box.test.lua                                            [ fail ]
[016] 
[016] Test failed! Result content mismatch:
[016] --- box/net.box.result	Sat Jun  8 13:10:58 2019
[016] +++ box/net.box.reject	Sat Jun  8 14:39:28 2019
[016] @@ -2339,14 +2339,14 @@
[016]  ...
[016]  disconnected_cnt
[016]  ---
[016] -- 1
[016] +- 2
[016]  ...
[016]  conn:close()
[016]  ---
[016]  ...
[016]  disconnected_cnt
[016]  ---
[016] -- 2
[016] +- 3
[016]  ...
[016]  test_run:cmd('stop server connecter')
[016]  ---
[016] @@ -3774,23 +3774,23 @@
[016]  ...
[016]  test_run:grep_log('default', 'Got a corrupted row.*')
[016]  ---
[016] -- 'Got a corrupted row:'
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000000:.*')
[016]  ---
[016] -- '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000010:.*')
[016]  ---
[016] -- '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000020:.*')
[016]  ---
[016] -- '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000030:.*')
[016]  ---
[016] -- '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000040:.*')
[016]  ---

Iproto already listens for requests during recovery, so yielding at this point of time allows such early requests, which arrived during recovery, be processed while data is in unfinished state. This caused box/net.box test failures, and is potentially harmful. Besides, there is no need to yield during recovery. Closes #4273

locker · 2019-06-13T16:22:59Z

The code causing the failure first appeared in 2.2 and was never backported to 2.1. Moving to 2.2.

The test used to fail occasionally with a following error: ``` [001] box/net.box.test.lua [ fail ] [001] [001] Test failed! Result content mismatch: [001] —- box/net.box.result Thu Jun 13 06:16:34 2019 [001] +++ box/net.box.reject Fri Jun 14 04:50:55 2019 [001] @@ -3774,23 +3774,23 @@ [001] ... [001] test_run:grep_log('default', 'Got a corrupted row.*') [001] —- [001] — 'Got a corrupted row:' [001] +- null [001] ... [001] test_run:grep_log('default', '00000000:.*') [001] —- [001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000010:.*') [001] —- [001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000020:.*') [001] —- [001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000030:.*') [001] —- [001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 ' [001] +- null [001] ... [001] test_run:grep_log('default', '00000040:.*') [001] —- ``` This happened because we used `grep_log` right after `socket:write`, which should cause the expected log messages. Change to `wait_log`. Follow-up #4273

Totktonada · 2019-07-06T16:51:41Z

I recently observed such fails on 2.1 on Travis-CI (no 'Got a corrupted row' message in a log).

Reduced the test to just one case:

test_run = require('test_run').new()
LISTEN = require('uri').parse(box.cfg.listen)

-- related to gh-4040: log corrupted rows
--
socket = require("socket")
log_level = box.cfg.log_level
box.cfg{log_level=6}
sock = socket.tcp_connect(LISTEN.host, LISTEN.service)
-- we need to have a packet with correctly encoded length,
-- so that it bypasses iproto length check, but cannot be
-- decoded in xrow_header_decode
-- 0x3C = 60, sha1 digest is 20 bytes long
data = string.fromhex('3C'..string.rep(require('digest').sha1_hex('bcde'), 3))
sock:write(data)
sock:close()

test_run:grep_log('default', 'Got a corrupted row.*')
test_run:grep_log('default', '00000000:.*')
test_run:grep_log('default', '00000010:.*')
test_run:grep_log('default', '00000020:.*')
test_run:grep_log('default', '00000030:.*')
test_run:grep_log('default', '00000040:.*')

box.cfg{log_level=log_level}

Replaced first grep_log() with wait_log(), but still observe such fails when run like so:

$ ./test-run.py $(yes box/net.box.test.lua | head -n 1000)

When the test fails the following line appears in logs instead of 'Got a corrupted row':

[010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe

Collected a stack backtrace at the point where the error is reported:

[010] #0  0x55ee32c84b4c in print_backtrace+9
[010] #1  0x55ee32c8bdbf in sio_writev+121
[010] #2  0x55ee32b61c30 in _ZL12iproto_flushP17iproto_connection+1f1
[010] #3  0x55ee32b61db9 in _ZL27iproto_connection_on_outputP7ev_loopP5ev_ioi+2c
[010] #4  0x55ee32e52725 in ev_invoke_pending+ce
[010] #5  0x55ee32e5361f in ev_run+590
[010] #6  0x55ee32c835e6 in cord_costart_thread_func+ef
[010] #7  0x55ee32c82852 in cord_thread_func+32b
[010] #8  0x7f0661d27458 in __libpthread_freeres+1788
[010] #9  0x7f066180d71f in clone+3f

So the reason is that iproto cannot write a response. If we'll move sock:close() after a log check, then the test case works fine.

So fixed reduced test case is the following:

test_run = require('test_run').new()
LISTEN = require('uri').parse(box.cfg.listen)

-- related to gh-4040: log corrupted rows
--
socket = require("socket")
log_level = box.cfg.log_level
box.cfg{log_level=6}
sock = socket.tcp_connect(LISTEN.host, LISTEN.service)
-- we need to have a packet with correctly encoded length,
-- so that it bypasses iproto length check, but cannot be
-- decoded in xrow_header_decode
-- 0x3C = 60, sha1 digest is 20 bytes long
data = string.fromhex('3C'..string.rep(require('digest').sha1_hex('bcde'), 3))
sock:write(data)

test_run:wait_log('default', 'Got a corrupted row.*', nil, 30)
test_run:grep_log('default', '00000000:.*')
test_run:grep_log('default', '00000010:.*')
test_run:grep_log('default', '00000020:.*')
test_run:grep_log('default', '00000030:.*')
test_run:grep_log('default', '00000040:.*')

sock:close()

box.cfg{log_level=log_level}

I'll reopen the issue against Serge.

Totktonada · 2019-07-06T16:53:25Z

master, 2.1 and 1.10 seems to be affected, so I changed the milestone to 1.10.4.

The test case has two problems that appear from time to time and lead to flaky fails. Those fails are look as shown below in a test-run output. | Test failed! Result content mismatch: | --- box/net.box.result Mon Jun 24 17:23:49 2019 | +++ box/net.box.reject Mon Jun 24 17:51:52 2019 | @@ -1404,7 +1404,7 @@ | ... | test_run:grep_log('default', 'ER_INVALID_MSGPACK.*') | --- | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length' | ... | -- gh-983 selecting a lot of data crashes the server or hangs the | -- connection 'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' log message, but if it is not in a log file at a time of grep_log() call (just don't flushed to the file yet) a message produced by another test case can be matched ('ER_INVALID_MSGPACK: Invalid MsgPack - packet length'). The fix here is to match the entire message and check for the message periodically during several seconds (use wait_log() instead of grep_log()). Another problem is the race between writing a response to an iproto socket on a server side and closing the socket on a client end. If tarantool is unable to write a response, it does not produce the warning re invalid msgpack, but shows 'broken pipe' message instead. We need first grep for the message in logs and only then close the socket on a client. The similar problem (with another test case) is described in [1]. [1]: #4273 (comment) Closes: #4311

The test case has two problems that appear from time to time and lead to flaky fails. Those fails are look as shown below in a test-run output. | Test failed! Result content mismatch: | --- box/net.box.result Mon Jun 24 17:23:49 2019 | +++ box/net.box.reject Mon Jun 24 17:51:52 2019 | @@ -1404,7 +1404,7 @@ | ... | test_run:grep_log('default', 'ER_INVALID_MSGPACK.*') | --- | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length' | ... | -- gh-983 selecting a lot of data crashes the server or hangs the | -- connection 'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body' log message, but if it is not in a log file at a time of grep_log() call (just don't flushed to the file yet) a message produced by another test case can be matched ('ER_INVALID_MSGPACK: Invalid MsgPack - packet length'). The fix here is to match the entire message and check for the message periodically during several seconds (use wait_log() instead of grep_log()). Another problem is the race between writing a response to an iproto socket on a server side and closing the socket on a client end. If tarantool is unable to write a response, it does not produce the warning re invalid msgpack, but shows 'broken pipe' message instead. We need first grep for the message in logs and only then close the socket on a client. The similar problem (with another test case) is described in [1]. [1]: #4273 (comment) Closes: #4311 (cherry picked from commit 0f9fdd7)

The test regarding logging corrupted rows failed occasionally with ``` [016] test_run:grep_log('default', 'Got a corrupted row.*') [016] --- [016] -- 'Got a corrupted row:' [016] +- null [016] ... ``` The logs then had ``` [010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe ``` instead of the expected message. This happened, because we closed a socket before tarantool could write a greeting to the client, the connection was then closed, and execution never got to processing the malformed request and thus printing the desired message to the log. To fix this, actually read the greeting prior to writing new data and closing the socket. Follow-up #4273

The test regarding logging corrupted rows failed occasionally with ``` [016] test_run:grep_log('default', 'Got a corrupted row.*') [016] --- [016] -- 'Got a corrupted row:' [016] +- null [016] ... ``` The logs then had ``` [010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe ``` instead of the expected message. This happened, because we closed a socket before tarantool could write a greeting to the client, the connection was then closed, and execution never got to processing the malformed request and thus printing the desired message to the log. To fix this, actually read the greeting prior to writing new data and closing the socket. Follow-up #4273 (cherry picked from commit eb0cc50)

avtikhon · 2019-07-11T11:17:46Z

Reproducer:

l=0 ; while ./test-run.py -j40 `for r in {1..100} ; do echo box/net.box.2642.test.lua ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done

Test:

net = require('net.box')
fiber = require 'fiber'
test_run = require('test_run').new()

--
-- On_connect/disconnect triggers.
--
test_run:cmd('create server connecter with script = "box/proxy.lua"')
test_run:cmd('start server connecter')
test_run:cmd("set variable connect_to to 'connecter.listen'")
conn = net.connect(connect_to, { reconnect_after = 0.1 })
conn.state
connected_cnt = 0
disconnected_cnt = 0
function on_connect() connected_cnt = connected_cnt + 1 end
function on_disconnect() disconnected_cnt = disconnected_cnt + 1 end
conn:on_connect(on_connect)
conn:on_disconnect(on_disconnect)
test_run:cmd('stop server connecter')
test_run:cmd('start server connecter')
while conn.state ~= 'active' do fiber.sleep(0.1) end
connected_cnt
disconnected_cnt
conn:close()
disconnected_cnt
test_run:cmd('stop server connecter')

avtikhon · 2019-07-15T07:42:39Z

The issue seems to have the common from #3851

sergepetrenko · 2019-07-18T10:53:32Z

An excess error message in logs, which indicates the error causing additional disconnect and connect.

2019-07-18 13:30:27.738 [17973] main/126/unix/:/Users/s.petrenko/Source/t net_box.lua:927 W> unix/:/Users/s.petrenko/Source/tarantool/test/var/014_box/proxy.socket-iproto: Usage access to universe '' is denied for user 'guest'

sergepetrenko · 2019-07-18T13:06:52Z

[035] --- box/net.box.2642.result	Thu Jul 18 15:58:50 2019
[035] +++ box/net.box.2642.reject	Thu Jul 18 15:59:21 2019
[035] @@ -64,14 +64,14 @@
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 1
[035] +- 2
[035]  ...
[035]  conn:close()
[035]  ---
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 2
[035] +- 3
[035]  ...
[035]  test_run:cmd('stop server connecter')
[035]  ---
[035]

This leftover error is caused by #2763.

This last error ``` [035] ... [035] disconnected_cnt [035] --- [035] -- 1 [035] +- 2 [035] ... [035] conn:close() [035] --- [035] ... [035] disconnected_cnt [035] --- [035] -- 2 [035] +- 3 [035] ... [035] test_run:cmd('stop server connecter') [035] --- [035] ``` Happens because net.box is able to connect to tarantool before it has finished bootstrap. When connecting, net.box tries to fetch schema executing a couple of selects, but fails to pass access check since grants aren't applied yet. This is described in detail in #2763 (comment) So, alter the test so that it tolerates multiple connection failures. Closes #4273

This last error ``` [035] ... [035] disconnected_cnt [035] --- [035] -- 1 [035] +- 2 [035] ... [035] conn:close() [035] --- [035] ... [035] disconnected_cnt [035] --- [035] -- 2 [035] +- 3 [035] ... [035] test_run:cmd('stop server connecter') [035] --- [035] ``` Happens because net.box is able to connect to tarantool before it has finished bootstrap. When connecting, net.box tries to fetch schema executing a couple of selects, but fails to pass access check since grants aren't applied yet. This is described in detail in #2763 (comment) So, alter the test so that it tolerates multiple connection failures. Closes #4273 (cherry picked from commit 1a2addb)

sergepetrenko self-assigned this Jun 5, 2019

sergepetrenko added the flaky test label Jun 5, 2019

kyukhin added this to the 2.1.3 milestone Jun 6, 2019

locker closed this as completed in f824d09 Jun 13, 2019

locker modified the milestones: 2.1.3, 2.2.0 Jun 13, 2019

Totktonada reopened this Jul 6, 2019

Totktonada modified the milestones: 2.2.0, 1.10.4 Jul 6, 2019

kyukhin closed this as completed in 1a2addb Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: box/net.box.test.lua flaky fails #4273

test: box/net.box.test.lua flaky fails #4273

ImeevMA commented Jun 5, 2019

avtikhon commented Jun 7, 2019 •

edited

Loading

ImeevMA commented Jun 8, 2019

ImeevMA commented Jun 8, 2019 •

edited

Loading

locker commented Jun 13, 2019

Totktonada commented Jul 6, 2019

Totktonada commented Jul 6, 2019

avtikhon commented Jul 11, 2019

avtikhon commented Jul 15, 2019

sergepetrenko commented Jul 18, 2019

sergepetrenko commented Jul 18, 2019 •

edited

Loading

test: box/net.box.test.lua flaky fails #4273

test: box/net.box.test.lua flaky fails #4273

Comments

ImeevMA commented Jun 5, 2019

avtikhon commented Jun 7, 2019 • edited Loading

ImeevMA commented Jun 8, 2019

ImeevMA commented Jun 8, 2019 • edited Loading

locker commented Jun 13, 2019

Totktonada commented Jul 6, 2019

Totktonada commented Jul 6, 2019

avtikhon commented Jul 11, 2019

avtikhon commented Jul 15, 2019

sergepetrenko commented Jul 18, 2019

sergepetrenko commented Jul 18, 2019 • edited Loading

avtikhon commented Jun 7, 2019 •

edited

Loading

ImeevMA commented Jun 8, 2019 •

edited

Loading

sergepetrenko commented Jul 18, 2019 •

edited

Loading