Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: box/net.box.test.lua flaky fails #4273

Closed
ImeevMA opened this issue Jun 5, 2019 · 10 comments
Closed

test: box/net.box.test.lua flaky fails #4273

ImeevMA opened this issue Jun 5, 2019 · 10 comments
Assignees
Milestone

Comments

@ImeevMA
Copy link
Collaborator

ImeevMA commented Jun 5, 2019

Tarantool version: 2.2.0

OS version: Ubuntu 16.04

Bug description: Test box/net.box.test.lua flaky fails

Steps to reproduce:

TEST_RUN_TESTS=$(for i in $(seq 1 10); do echo -n 'box/net.box. '; done) make test

This issue seems to be related to commit 527b02a .

Error:

[002] box/net.box.test.lua                                            [ fail ]
[002] 
[002] Test failed! Result content mismatch:
[002] --- box/net.box.result	Wed Jun  5 15:00:46 2019
[002] +++ box/net.box.reject	Wed Jun  5 15:06:48 2019
[002] @@ -2339,14 +2339,14 @@
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 1
[002] +- 2
[002]  ...
[002]  conn:close()
[002]  ---
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 2
[002] +- 3
[002]  ...
[002]  test_run:cmd('stop server connecter')
[002]  ---
@sergepetrenko sergepetrenko self-assigned this Jun 5, 2019
@kyukhin kyukhin added this to the 2.1.3 milestone Jun 6, 2019
Totktonada added a commit that referenced this issue Jun 7, 2019
This is the regression from a7c855e
(net.box: fetch '_vcollation' sysview into the module). The result file
was corrupted, because the test sometimes produces incorrect result in
the test case for on_connect / on_disconnect triggers. This issue is
tracked by #4273.

Follow up #3941.
Totktonada added a commit that referenced this issue Jun 7, 2019
This is the regression from a7c855e
(net.box: fetch '_vcollation' sysview into the module). The result file
was corrupted, because the test sometimes produces incorrect result in
the test case for on_connect / on_disconnect triggers. This issue is
tracked by #4273.

Follow up #3941.
@avtikhon
Copy link
Contributor

avtikhon commented Jun 7, 2019

To be sure bisected the history and additionally checked that the following commit is the cause of the fail:

commit 527b02a2ee6a9a205d8e2c8f38bbb84edf0d6557
Author: Serge Petrenko <sergepetrenko@tarantool.org>
Date:   Tue May 28 14:42:12 2019 +0300

    memtx: add yields during index build
    
    Memtx index build used to stall event loop for all the build period.
    Add occasional yields so that the loop is not blocked for too long.
    Also make index build set on_replace triggers so that concurrent
    replaces are also correctly handled during the build.
    
    Part of #3976
    
    @TarantoolBot document
    Title: memtx indices are now built in background
    
    Memtx engine no longer stalls the event loop during an index build.
    You may insert tuples into a space while an index build is in progress:
    the tuples will be correctly added to the new index. If such tuple
    violates the new indexes unique constraint or doesn't match the new
    index format, the index build will be aborted.

@ImeevMA
Copy link
Collaborator Author

ImeevMA commented Jun 8, 2019

A bit different error:

[002] box/net.box.test.lua                                            [ fail ]
[002] 
[002] Test failed! Result content mismatch:
[002] --- box/net.box.result	Sat Jun  8 13:10:58 2019
[002] +++ box/net.box.reject	Sat Jun  8 13:21:17 2019
[002] @@ -1404,7 +1404,7 @@
[002]  ...
[002]  test_run:grep_log('default', 'ER_INVALID_MSGPACK.*')
[002]  ---
[002] -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body'
[002] +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length'
[002]  ...
[002]  -- gh-983 selecting a lot of data crashes the server or hangs the
[002]  -- connection
[002] @@ -2339,14 +2339,14 @@
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 1
[002] +- 2
[002]  ...
[002]  conn:close()
[002]  ---
[002]  ...
[002]  disconnected_cnt
[002]  ---
[002] -- 2
[002] +- 3
[002]  ...
[002]  test_run:cmd('stop server connecter')
[002]  ---

@ImeevMA
Copy link
Collaborator Author

ImeevMA commented Jun 8, 2019

One more:

[016] box/net.box.test.lua                                            [ fail ]
[016] 
[016] Test failed! Result content mismatch:
[016] --- box/net.box.result	Sat Jun  8 13:10:58 2019
[016] +++ box/net.box.reject	Sat Jun  8 14:39:28 2019
[016] @@ -2339,14 +2339,14 @@
[016]  ...
[016]  disconnected_cnt
[016]  ---
[016] -- 1
[016] +- 2
[016]  ...
[016]  conn:close()
[016]  ---
[016]  ...
[016]  disconnected_cnt
[016]  ---
[016] -- 2
[016] +- 3
[016]  ...
[016]  test_run:cmd('stop server connecter')
[016]  ---
[016] @@ -3774,23 +3774,23 @@
[016]  ...
[016]  test_run:grep_log('default', 'Got a corrupted row.*')
[016]  ---
[016] -- 'Got a corrupted row:'
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000000:.*')
[016]  ---
[016] -- '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000010:.*')
[016]  ---
[016] -- '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000020:.*')
[016]  ---
[016] -- '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000030:.*')
[016]  ---
[016] -- '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[016] +- null
[016]  ...
[016]  test_run:grep_log('default', '00000040:.*')
[016]  ---

sergepetrenko added a commit that referenced this issue Jun 13, 2019
Iproto already listens for requests during recovery, so yielding at this
point of time allows such early requests, which arrived during recovery,
be processed while data is in unfinished state. This caused box/net.box
test failures, and is potentially harmful.
Besides, there is no need to yield during recovery.

Closes #4273
@locker locker closed this as completed in f824d09 Jun 13, 2019
@locker
Copy link
Member

locker commented Jun 13, 2019

The code causing the failure first appeared in 2.2 and was never backported to 2.1. Moving to 2.2.

@locker locker modified the milestones: 2.1.3, 2.2.0 Jun 13, 2019
sergepetrenko added a commit that referenced this issue Jun 14, 2019
The test used to fail occasionally with a following error:

```
[001] box/net.box.test.lua                                            [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] —- box/net.box.result  Thu Jun 13 06:16:34 2019
[001] +++ box/net.box.reject  Fri Jun 14 04:50:55 2019
[001] @@ -3774,23 +3774,23 @@
[001]  ...
[001]  test_run:grep_log('default', 'Got a corrupted row.*')
[001]  —-
[001] — 'Got a corrupted row:'
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000000:.*')
[001]  —-
[001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000010:.*')
[001]  —-
[001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000020:.*')
[001]  —-
[001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000030:.*')
[001]  —-
[001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000040:.*')
[001]  —-
```

This happened because we used `grep_log` right after `socket:write`,
which should cause the expected log messages. Change to `wait_log`.

Follow-up #4273
avtikhon pushed a commit that referenced this issue Jun 14, 2019
The test used to fail occasionally with a following error:

```
[001] box/net.box.test.lua                                            [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] —- box/net.box.result  Thu Jun 13 06:16:34 2019
[001] +++ box/net.box.reject  Fri Jun 14 04:50:55 2019
[001] @@ -3774,23 +3774,23 @@
[001]  ...
[001]  test_run:grep_log('default', 'Got a corrupted row.*')
[001]  —-
[001] — 'Got a corrupted row:'
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000000:.*')
[001]  —-
[001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000010:.*')
[001]  —-
[001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000020:.*')
[001]  —-
[001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000030:.*')
[001]  —-
[001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000040:.*')
[001]  —-
```

This happened because we used `grep_log` right after `socket:write`,
which should cause the expected log messages. Change to `wait_log`.

Follow-up #4273
avtikhon pushed a commit that referenced this issue Jun 14, 2019
The test used to fail occasionally with a following error:

```
[001] box/net.box.test.lua                                            [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] —- box/net.box.result  Thu Jun 13 06:16:34 2019
[001] +++ box/net.box.reject  Fri Jun 14 04:50:55 2019
[001] @@ -3774,23 +3774,23 @@
[001]  ...
[001]  test_run:grep_log('default', 'Got a corrupted row.*')
[001]  —-
[001] — 'Got a corrupted row:'
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000000:.*')
[001]  —-
[001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000010:.*')
[001]  —-
[001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000020:.*')
[001]  —-
[001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000030:.*')
[001]  —-
[001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000040:.*')
[001]  —-
```

This happened because we used `grep_log` right after `socket:write`,
which should cause the expected log messages. Change to `wait_log`.

Follow-up #4273
avtikhon pushed a commit that referenced this issue Jun 14, 2019
The test used to fail occasionally with a following error:

```
[001] box/net.box.test.lua                                            [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] —- box/net.box.result  Thu Jun 13 06:16:34 2019
[001] +++ box/net.box.reject  Fri Jun 14 04:50:55 2019
[001] @@ -3774,23 +3774,23 @@
[001]  ...
[001]  test_run:grep_log('default', 'Got a corrupted row.*')
[001]  —-
[001] — 'Got a corrupted row:'
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000000:.*')
[001]  —-
[001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000010:.*')
[001]  —-
[001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000020:.*')
[001]  —-
[001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000030:.*')
[001]  —-
[001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000040:.*')
[001]  —-
```

This happened because we used `grep_log` right after `socket:write`,
which should cause the expected log messages. Change to `wait_log`.

Follow-up #4273
locker pushed a commit that referenced this issue Jun 17, 2019
The test used to fail occasionally with a following error:

```
[001] box/net.box.test.lua                                            [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] —- box/net.box.result  Thu Jun 13 06:16:34 2019
[001] +++ box/net.box.reject  Fri Jun 14 04:50:55 2019
[001] @@ -3774,23 +3774,23 @@
[001]  ...
[001]  test_run:grep_log('default', 'Got a corrupted row.*')
[001]  —-
[001] — 'Got a corrupted row:'
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000000:.*')
[001]  —-
[001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000010:.*')
[001]  —-
[001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000020:.*')
[001]  —-
[001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000030:.*')
[001]  —-
[001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000040:.*')
[001]  —-
```

This happened because we used `grep_log` right after `socket:write`,
which should cause the expected log messages. Change to `wait_log`.

Follow-up #4273
avtikhon pushed a commit that referenced this issue Jun 18, 2019
The test used to fail occasionally with a following error:

```
[001] box/net.box.test.lua                                            [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] —- box/net.box.result  Thu Jun 13 06:16:34 2019
[001] +++ box/net.box.reject  Fri Jun 14 04:50:55 2019
[001] @@ -3774,23 +3774,23 @@
[001]  ...
[001]  test_run:grep_log('default', 'Got a corrupted row.*')
[001]  —-
[001] — 'Got a corrupted row:'
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000000:.*')
[001]  —-
[001] — '00000000: A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 60 5F 20 3F '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000010:.*')
[001]  —-
[001] — '00000010: D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 A1 53 8D 53 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000020:.*')
[001]  —-
[001] — '00000020: 60 5F 20 3F D8 E2 D6 E2 A3 02 D6 5A E4 D9 E7 68 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000030:.*')
[001]  —-
[001] — '00000030: A1 53 8D 53 60 5F 20 3F D8 E2 D6 E2 '
[001] +- null
[001]  ...
[001]  test_run:grep_log('default', '00000040:.*')
[001]  —-
```

This happened because we used `grep_log` right after `socket:write`,
which should cause the expected log messages. Change to `wait_log`.

Follow-up #4273
@Totktonada
Copy link
Member

I recently observed such fails on 2.1 on Travis-CI (no 'Got a corrupted row' message in a log).

Reduced the test to just one case:

test_run = require('test_run').new()
LISTEN = require('uri').parse(box.cfg.listen)

-- related to gh-4040: log corrupted rows
--
socket = require("socket")
log_level = box.cfg.log_level
box.cfg{log_level=6}
sock = socket.tcp_connect(LISTEN.host, LISTEN.service)
-- we need to have a packet with correctly encoded length,
-- so that it bypasses iproto length check, but cannot be
-- decoded in xrow_header_decode
-- 0x3C = 60, sha1 digest is 20 bytes long
data = string.fromhex('3C'..string.rep(require('digest').sha1_hex('bcde'), 3))
sock:write(data)
sock:close()

test_run:grep_log('default', 'Got a corrupted row.*')
test_run:grep_log('default', '00000000:.*')
test_run:grep_log('default', '00000010:.*')
test_run:grep_log('default', '00000020:.*')
test_run:grep_log('default', '00000030:.*')
test_run:grep_log('default', '00000040:.*')

box.cfg{log_level=log_level}

Replaced first grep_log() with wait_log(), but still observe such fails when run like so:

$ ./test-run.py $(yes box/net.box.test.lua | head -n 1000)

When the test fails the following line appears in logs instead of 'Got a corrupted row':

[010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1), called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe

Collected a stack backtrace at the point where the error is reported:

[010] #0  0x55ee32c84b4c in print_backtrace+9
[010] #1  0x55ee32c8bdbf in sio_writev+121
[010] #2  0x55ee32b61c30 in _ZL12iproto_flushP17iproto_connection+1f1
[010] #3  0x55ee32b61db9 in _ZL27iproto_connection_on_outputP7ev_loopP5ev_ioi+2c
[010] #4  0x55ee32e52725 in ev_invoke_pending+ce
[010] #5  0x55ee32e5361f in ev_run+590
[010] #6  0x55ee32c835e6 in cord_costart_thread_func+ef
[010] #7  0x55ee32c82852 in cord_thread_func+32b
[010] #8  0x7f0661d27458 in __libpthread_freeres+1788
[010] #9  0x7f066180d71f in clone+3f

So the reason is that iproto cannot write a response. If we'll move sock:close() after a log check, then the test case works fine.

So fixed reduced test case is the following:

test_run = require('test_run').new()
LISTEN = require('uri').parse(box.cfg.listen)

-- related to gh-4040: log corrupted rows
--
socket = require("socket")
log_level = box.cfg.log_level
box.cfg{log_level=6}
sock = socket.tcp_connect(LISTEN.host, LISTEN.service)
-- we need to have a packet with correctly encoded length,
-- so that it bypasses iproto length check, but cannot be
-- decoded in xrow_header_decode
-- 0x3C = 60, sha1 digest is 20 bytes long
data = string.fromhex('3C'..string.rep(require('digest').sha1_hex('bcde'), 3))
sock:write(data)

test_run:wait_log('default', 'Got a corrupted row.*', nil, 30)
test_run:grep_log('default', '00000000:.*')
test_run:grep_log('default', '00000010:.*')
test_run:grep_log('default', '00000020:.*')
test_run:grep_log('default', '00000030:.*')
test_run:grep_log('default', '00000040:.*')

sock:close()

box.cfg{log_level=log_level}

I'll reopen the issue against Serge.

@Totktonada Totktonada reopened this Jul 6, 2019
@Totktonada
Copy link
Member

master, 2.1 and 1.10 seems to be affected, so I changed the milestone to 1.10.4.

@Totktonada Totktonada modified the milestones: 2.2.0, 1.10.4 Jul 6, 2019
Totktonada pushed a commit that referenced this issue Jul 9, 2019
The test case has two problems that appear from time to time and lead to
flaky fails. Those fails are look as shown below in a test-run output.

 | Test failed! Result content mismatch:
 | --- box/net.box.result	Mon Jun 24 17:23:49 2019
 | +++ box/net.box.reject	Mon Jun 24 17:51:52 2019
 | @@ -1404,7 +1404,7 @@
 |  ...
 |  test_run:grep_log('default', 'ER_INVALID_MSGPACK.*')
 | ---
 | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body'
 | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length'
 | ...
 | -- gh-983 selecting a lot of data crashes the server or hangs the
 | -- connection

'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid
MsgPack - packet body' log message, but if it is not in a log file at a
time of grep_log() call (just don't flushed to the file yet) a message
produced by another test case can be matched ('ER_INVALID_MSGPACK:
Invalid MsgPack - packet length'). The fix here is to match the entire
message and check for the message periodically during several seconds
(use wait_log() instead of grep_log()).

Another problem is the race between writing a response to an iproto
socket on a server side and closing the socket on a client end. If
tarantool is unable to write a response, it does not produce the warning
re invalid msgpack, but shows 'broken pipe' message instead. We need
first grep for the message in logs and only then close the socket on a
client. The similar problem (with another test case) is described in
[1].

[1]: #4273 (comment)

Closes: #4311
Totktonada pushed a commit that referenced this issue Jul 9, 2019
The test case has two problems that appear from time to time and lead to
flaky fails. Those fails are look as shown below in a test-run output.

 | Test failed! Result content mismatch:
 | --- box/net.box.result	Mon Jun 24 17:23:49 2019
 | +++ box/net.box.reject	Mon Jun 24 17:51:52 2019
 | @@ -1404,7 +1404,7 @@
 |  ...
 |  test_run:grep_log('default', 'ER_INVALID_MSGPACK.*')
 | ---
 | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body'
 | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length'
 | ...
 | -- gh-983 selecting a lot of data crashes the server or hangs the
 | -- connection

'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid
MsgPack - packet body' log message, but if it is not in a log file at a
time of grep_log() call (just don't flushed to the file yet) a message
produced by another test case can be matched ('ER_INVALID_MSGPACK:
Invalid MsgPack - packet length'). The fix here is to match the entire
message and check for the message periodically during several seconds
(use wait_log() instead of grep_log()).

Another problem is the race between writing a response to an iproto
socket on a server side and closing the socket on a client end. If
tarantool is unable to write a response, it does not produce the warning
re invalid msgpack, but shows 'broken pipe' message instead. We need
first grep for the message in logs and only then close the socket on a
client. The similar problem (with another test case) is described in
[1].

[1]: #4273 (comment)

Closes: #4311

(cherry picked from commit 0f9fdd7)
Totktonada pushed a commit that referenced this issue Jul 9, 2019
The test case has two problems that appear from time to time and lead to
flaky fails. Those fails are look as shown below in a test-run output.

 | Test failed! Result content mismatch:
 | --- box/net.box.result	Mon Jun 24 17:23:49 2019
 | +++ box/net.box.reject	Mon Jun 24 17:51:52 2019
 | @@ -1404,7 +1404,7 @@
 |  ...
 |  test_run:grep_log('default', 'ER_INVALID_MSGPACK.*')
 | ---
 | -- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet body'
 | +- 'ER_INVALID_MSGPACK: Invalid MsgPack - packet length'
 | ...
 | -- gh-983 selecting a lot of data crashes the server or hangs the
 | -- connection

'ER_INVALID_MSGPACK.*' regexp should match 'ER_INVALID_MSGPACK: Invalid
MsgPack - packet body' log message, but if it is not in a log file at a
time of grep_log() call (just don't flushed to the file yet) a message
produced by another test case can be matched ('ER_INVALID_MSGPACK:
Invalid MsgPack - packet length'). The fix here is to match the entire
message and check for the message periodically during several seconds
(use wait_log() instead of grep_log()).

Another problem is the race between writing a response to an iproto
socket on a server side and closing the socket on a client end. If
tarantool is unable to write a response, it does not produce the warning
re invalid msgpack, but shows 'broken pipe' message instead. We need
first grep for the message in logs and only then close the socket on a
client. The similar problem (with another test case) is described in
[1].

[1]: #4273 (comment)

Closes: #4311

(cherry picked from commit 0f9fdd7)
sergepetrenko added a commit that referenced this issue Jul 9, 2019
The test regarding logging corrupted rows failed occasionally with
```
[016]  test_run:grep_log('default', 'Got a corrupted row.*')
[016]  ---
[016] -- 'Got a corrupted row:'
[016] +- null
[016]  ...
```
The logs then had
```
[010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1),
called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe
```
instead of the expected message.

This happened, because we closed a socket before tarantool could write a
greeting to the client, the connection was then closed, and execution
never got to processing the malformed request and thus printing the
desired message to the log.

To fix this, actually read the greeting prior to writing new data and
closing the socket.

Follow-up #4273
locker pushed a commit that referenced this issue Jul 9, 2019
The test regarding logging corrupted rows failed occasionally with
```
[016]  test_run:grep_log('default', 'Got a corrupted row.*')
[016]  ---
[016] -- 'Got a corrupted row:'
[016] +- null
[016]  ...
```
The logs then had
```
[010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1),
called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe
```
instead of the expected message.

This happened, because we closed a socket before tarantool could write a
greeting to the client, the connection was then closed, and execution
never got to processing the malformed request and thus printing the
desired message to the log.

To fix this, actually read the greeting prior to writing new data and
closing the socket.

Follow-up #4273
locker pushed a commit that referenced this issue Jul 9, 2019
The test regarding logging corrupted rows failed occasionally with
```
[016]  test_run:grep_log('default', 'Got a corrupted row.*')
[016]  ---
[016] -- 'Got a corrupted row:'
[016] +- null
[016]  ...
```
The logs then had
```
[010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1),
called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe
```
instead of the expected message.

This happened, because we closed a socket before tarantool could write a
greeting to the client, the connection was then closed, and execution
never got to processing the malformed request and thus printing the
desired message to the log.

To fix this, actually read the greeting prior to writing new data and
closing the socket.

Follow-up #4273

(cherry picked from commit eb0cc50)
locker pushed a commit that referenced this issue Jul 9, 2019
The test regarding logging corrupted rows failed occasionally with
```
[016]  test_run:grep_log('default', 'Got a corrupted row.*')
[016]  ---
[016] -- 'Got a corrupted row:'
[016] +- null
[016]  ...
```
The logs then had
```
[010] 2019-07-06 19:36:16.857 [13046] iproto sio.c:261 !> SystemError writev(1),
called on fd 23, aka unix/:(socket), peer of unix/:(socket): Broken pipe
```
instead of the expected message.

This happened, because we closed a socket before tarantool could write a
greeting to the client, the connection was then closed, and execution
never got to processing the malformed request and thus printing the
desired message to the log.

To fix this, actually read the greeting prior to writing new data and
closing the socket.

Follow-up #4273

(cherry picked from commit eb0cc50)
@avtikhon
Copy link
Contributor

Reproducer:

l=0 ; while ./test-run.py -j40 `for r in {1..100} ; do echo box/net.box.2642.test.lua ; done` 2>/dev/null ; do l=$(($l+1)) ; echo ======== $l ============= ; done

Test:

net = require('net.box')
fiber = require 'fiber'
test_run = require('test_run').new()

--
-- On_connect/disconnect triggers.
--
test_run:cmd('create server connecter with script = "box/proxy.lua"')
test_run:cmd('start server connecter')
test_run:cmd("set variable connect_to to 'connecter.listen'")
conn = net.connect(connect_to, { reconnect_after = 0.1 })
conn.state
connected_cnt = 0
disconnected_cnt = 0
function on_connect() connected_cnt = connected_cnt + 1 end
function on_disconnect() disconnected_cnt = disconnected_cnt + 1 end
conn:on_connect(on_connect)
conn:on_disconnect(on_disconnect)
test_run:cmd('stop server connecter')
test_run:cmd('start server connecter')
while conn.state ~= 'active' do fiber.sleep(0.1) end
connected_cnt
disconnected_cnt
conn:close()
disconnected_cnt
test_run:cmd('stop server connecter')

@avtikhon
Copy link
Contributor

The issue seems to have the common from #3851

@sergepetrenko
Copy link
Collaborator

An excess error message in logs, which indicates the error causing additional disconnect and connect.

2019-07-18 13:30:27.738 [17973] main/126/unix/:/Users/s.petrenko/Source/t net_box.lua:927 W> unix/:/Users/s.petrenko/Source/tarantool/test/var/014_box/proxy.socket-iproto: Usage access to universe '' is denied for user 'guest'

@sergepetrenko
Copy link
Collaborator

sergepetrenko commented Jul 18, 2019

[035] --- box/net.box.2642.result	Thu Jul 18 15:58:50 2019
[035] +++ box/net.box.2642.reject	Thu Jul 18 15:59:21 2019
[035] @@ -64,14 +64,14 @@
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 1
[035] +- 2
[035]  ...
[035]  conn:close()
[035]  ---
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 2
[035] +- 3
[035]  ...
[035]  test_run:cmd('stop server connecter')
[035]  ---
[035] 

This leftover error is caused by #2763.

sergepetrenko added a commit that referenced this issue Jul 19, 2019
This last error
```
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 1
[035] +- 2
[035]  ...
[035]  conn:close()
[035]  ---
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 2
[035] +- 3
[035]  ...
[035]  test_run:cmd('stop server connecter')
[035]  ---
[035]
```
Happens because net.box is able to connect to tarantool before it has
finished bootstrap. When connecting, net.box tries to fetch schema
executing a couple of selects, but fails to pass access check since
grants aren't applied yet. This is described in detail in
#2763 (comment)
So, alter the test so that it tolerates multiple connection failures.

Closes #4273
kyukhin pushed a commit that referenced this issue Jul 19, 2019
This last error
```
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 1
[035] +- 2
[035]  ...
[035]  conn:close()
[035]  ---
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 2
[035] +- 3
[035]  ...
[035]  test_run:cmd('stop server connecter')
[035]  ---
[035]
```
Happens because net.box is able to connect to tarantool before it has
finished bootstrap. When connecting, net.box tries to fetch schema
executing a couple of selects, but fails to pass access check since
grants aren't applied yet. This is described in detail in
#2763 (comment)
So, alter the test so that it tolerates multiple connection failures.

Closes #4273

(cherry picked from commit 1a2addb)
kyukhin pushed a commit that referenced this issue Jul 19, 2019
This last error
```
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 1
[035] +- 2
[035]  ...
[035]  conn:close()
[035]  ---
[035]  ...
[035]  disconnected_cnt
[035]  ---
[035] -- 2
[035] +- 3
[035]  ...
[035]  test_run:cmd('stop server connecter')
[035]  ---
[035]
```
Happens because net.box is able to connect to tarantool before it has
finished bootstrap. When connecting, net.box tries to fetch schema
executing a couple of selects, but fails to pass access check since
grants aren't applied yet. This is described in detail in
#2763 (comment)
So, alter the test so that it tolerates multiple connection failures.

Closes #4273

(cherry picked from commit 1a2addb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants