test-run worker hangs, when an instance does not enter to the event loop #276

ligurio · 2021-03-19T15:17:49Z

How to reproduce

test reproducer (test path xlog/hang.test.lua):

test_run = require('test_run').new()
test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
test_run:cmd('start server replica')
test_run:cmd('cleanup server replica')
test_run:cmd('delete server replica')

test hang due to absence read access (see log var/001_xlog/replica.log):

...
2021-03-19 18:11:44.865 [1987357] main/103/replica I> connected to 1 replicas
2021-03-19 18:11:44.865 [1987357] main/103/replica I> bootstrapping replica from 8a89e9f8-7364-4d6a-96cd-d2c8fe5b93bb at unix/:/home/sergeyb/sources/MRG/tarantool/build/test/var/001_xlog/xlog.socket-iproto
2021-03-19 18:11:44.865 [1987357] main/112/applier/unix/:/home/sergeyb/sources/MRG/tarantool/build/test/var/001_xlog/xlog.socket-iproto I> can't read row
2021-03-19 18:11:44.865 [1987357] main/112/applier/unix/:/home/sergeyb/sources/MRG/tarantool/build/test/var/001_xlog/xlog.socket-iproto session.cc:332 E> ER_ACCESS_DENIED: Read access to universe '' is denied for user 'guest'
2021-03-19 18:11:44.865 [1987357] main/112/applier/unix/:/home/sergeyb/sources/MRG/tarantool/build/test/var/001_xlog/xlog.socket-iproto I> will retry every 1.00 second

test is passed if add line box.schema.user.grant('guest', 'replication') on top of test

Versions

test-run 5941741

tarantool --version:

Tarantool 2.8.0-134-g81c663335
Target: Linux-x86_64-Debug
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/bin/c++
C_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror
CXX_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror

The text was updated successfully, but these errors were encountered:

Totktonada · 2021-03-19T15:41:38Z

This test should hang: we unable to bootstrap the replica, because it is unable to join to the master (because of lack of the grant).

However the problem is that a test timeout does not work and the test-run worker 'hangs'. It seems, it is because of time.sleep(<...>) usage, which does not work well with gevent cooperative greenlets. At least, the following patch seems to fix the problem:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index f611dba..ee83489 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -449,7 +449,7 @@ class TarantoolLog(object):
         while True:
             if os.path.exists(self.path):
                 break
-            time.sleep(0.001)
+            gevent.sleep(0.001)
 
         with open(self.path, 'r') as f:
             f.seek(self.log_begin, os.SEEK_SET)
@@ -460,7 +460,7 @@ class TarantoolLog(object):
                         raise TarantoolStartError(name)
                 log_str = f.readline()
                 if not log_str:
-                    time.sleep(0.001)
+                    gevent.sleep(0.001)
                     f.seek(cur_pos, os.SEEK_SET)
                     continue
                 if re.findall(msg, log_str):

I guess we can reproduce the similar behaviour around other time.sleep(<...>) usages.

Totktonada · 2021-03-19T16:03:37Z

However it would be good to have some timeout for tarantool instance startup and report a meaningful error in the case.

This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96

This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)

Checking that tarantool server is started by finding pattern 'entering the event loop|will retry binding|hot standby mode' in the xlog. If server is hanging it could be killed after test timeout. Was added start-server-timeout. Now the pattern is searching until this timeout. If there is no pattern functions wait_until_started returns False (else True) and TarantoolServer.start() returns same. If there is hanging instance preprocessor kills this test. Default value of start-server-timeout is 90 sec. Fixes: #276

Checking that tarantool server is started by finding pattern 'entering the event loop|will retry binding|hot standby mode' in the xlog. If server is hanging it could be killed after test timeout. Was added start-server-timeout. Now the pattern is searching until this timeout. If there is no pattern functions wait_until_started returns False (else True) and TarantoolServer.start() returns same. Default value of start-server-timeout is 90 sec. Fixes: #276

Found ASAN error: [001] + ok 206 - ================================================================= [001] +==6889==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x604000000031 at pc 0x0000005a72e7 bp 0x7ffe47c30c80 sp 0x7ffe47c30c78 [001] +WRITE of size 1 at 0x604000000031 thread T0 [001] + #0 0x5a72e6 in mp_store_u8 /tarantool/src/lib/msgpuck/msgpuck.h:258:1 [001] + #1 0x5a72e6 in mp_encode_uint /tarantool/src/lib/msgpuck/msgpuck.h:1768 [001] + #2 0x4fa657 in test_mp_print /tarantool/src/lib/msgpuck/test/msgpuck.c:957:16 [001] + #3 0x509024 in main /tarantool/src/lib/msgpuck/test/msgpuck.c:1331:2 [001] + #4 0x7f3658fd909a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) [001] + #5 0x41f339 in _start (/tnt/test/unit/msgpack.test+0x41f339) [001] + [001] +0x604000000031 is located 0 bytes to the right of 33-byte region [0x604000000010,0x604000000031) [001] +allocated by thread T0 here: [001] + #0 0x4cace3 in malloc (/tnt/test/unit/msgpack.test+0x4cace3) [001] + #1 0x4fa5db in test_mp_print /tarantool/src/lib/msgpuck/test/msgpuck.c:945:18 [001] + #2 0x509024 in main /tarantool/src/lib/msgpuck/test/msgpuck.c:1331:2 [001] + #3 0x7f3658fd909a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) [001] + [001] +SUMMARY: AddressSanitizer: heap-buffer-overflow /tarantool/src/lib/msgpuck/msgpuck.h:258:1 in mp_store_u8 [001] +Shadow bytes around the buggy address: [001] + 0x0c087fff7fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [001] + 0x0c087fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [001] + 0x0c087fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [001] + 0x0c087fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [001] + 0x0c087fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [001] +=>0x0c087fff8000: fa fa 00 00 00 00[01]fa fa fa fa fa fa fa fa fa [001] + 0x0c087fff8010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa [001] + 0x0c087fff8020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa [001] + 0x0c087fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa [001] + 0x0c087fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa [001] + 0x0c087fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa [001] +Shadow byte legend (one shadow byte represents 8 application bytes): [001] + Addressable: 00 [001] + Partially addressable: 01 02 03 04 05 06 07 [001] + Heap left redzone: fa [001] + Freed heap region: fd [001] + Stack left redzone: f1 [001] + Stack mid redzone: f2 [001] + Stack right redzone: f3 [001] + Stack after return: f5 [001] + Stack use after scope: f8 [001] + Global redzone: f9 [001] + Global init order: f6 [001] + Poisoned by user: f7 [001] + Container overflow: fc [001] + Array cookie: ac [001] + Intra object redzone: bb [001] + ASan internal: fe [001] + Left alloca redzone: ca Investigated the buffer size that was allocated was 33 bytes, but it needed 34. The fix was to increase this buffer for another mp_encode_array(1). Part of tarantool/tarantool#4360 Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> test: obuf test refactoring Added slab_arena_destroy for graceful resources release, removed global seed value, removed unused value from enum. Merge pull request #136 from tbeu/patch-1 Update README.rst test: move unit/ to test/ This virtually reverts commit 436218defd4c284134f59975d4642405bdf2d918 ('move unit tests to unit'), that was made in the scope of #106. Despite the fact that testing of the connector uses `unittest` framework, it is functional (and integration) testing by its nature: most of the test cases verify that public API of the connector properly works with tarantool. In seems meaningful to locate such kind of test cases in the `test/` directory, not `unit/`, disregarding of used framework. Follows up #106. Add timeout for starting tarantool server Checking that tarantool server is started by finding pattern 'entering the event loop|will retry binding|hot standby mode' in the xlog. If server is hanging it could be killed after test timeout. Was added start-server-timeout. Now the pattern is searching until this timeout. If there is no pattern functions wait_until_started returns False (else True) and TarantoolServer.start() returns same. Default value of start-server-timeout is 90 sec. Fixes: #276 RELEASE-NOTES: synced curl 7.76.0 release Use rawset() when exporting functions to _G test: fix directory detection in lua-Harness suite A test <314-regex.t> uses `arg[0]:find'314'` to determine the name of the directory where rx_* files are located. This leads to the test failure, when lua-Harness suite runs in a directory containing "314" in its name, because the found path doesn't contain the required files. This patch fixes directory name detection. Follows up tarantool/tarantool#5844 Reviewed-by: Igor Munkin <imun@tarantool.org> Reviewed-by: Sergey Ostanevich <sergos@tarantool.org> Signed-off-by: Igor Munkin <imun@tarantool.org> Add the chdir option for make Flag --chdir for make command (with help) has been added. It's add possibility to specify a source directory of the rock when make. Merge pull request #2435 from facebook/dev v1.4.8 hotfix

Function wait_until_started in TarantoolServer contains seek_wait, which waits pattern in logfile. If there is no pattern, server is hanging. Was added start-server-time (by default equals to 90 secs). The pattern is sought until the time runs out and wait_until_started returns True if the pattern was found (else False) and TarantoolServer.start() returns same. Was added new logging that the instance wasn't started. Fixes: #276

Was changed `time.sleep` to `gevent.sleep` to allow current greenlet to sleep and others to run. If it uses time.sleep() greenlet's context is not changed from the main process to test greenlet. As a result, there was no data received by the main process during hanging tarantool and the suite was fallen down by common timeout (no output timeout). Using greenlet timeout allows to fall down by test timeout. Fixes: #276

Function wait_until_started in TarantoolServer contains seek_wait, which waits pattern in logfile. If there is no pattern, server is hanging. Was added start-server-time (by default equals to 90 secs). The pattern is sought until the time runs out and wait_until_started returns True if the pattern was found (else False). Was added new logging that the instance wasn't started. Fixes: #276

Was changed `time.sleep` to `gevent.sleep` to allow current greenlet to sleep and others to run. If it uses time.sleep() greenlet's context is not changed from the main process to test greenlet. As a result, there was no data received by the main process during hanging tarantool and the suite was fallen down by common timeout (no output timeout). Using greenlet timeout allows to fall down by test timeout. Part of #276

Function wait_until_started in TarantoolServer contains seek_wait, which waits pattern in logfile. If there is no pattern, server is hanging. Was added start-server-time (by default equals to 90 secs). The pattern is sought until the time runs out and wait_until_started returns True if the pattern was found (else False). Was added new logging that the instance wasn't started. Fixes: #276

Was added another way with checking patterns in log. Where test sequentially finds expected patterns and checks that there are no unexpected lines. For this approach was changed TarantoolLog.seek_wait() function. Now it is able to find patterns not from beginning (for sequence of patterns). If a pattern was found its last symbol position is saved as position for start point for next searching with start not from beginning. This approach helps for comparing hanging result. Pytest script test_hanging_xlog.py executes hang.test.lua and test-run log is comparing by expected patterns in log. Also, there is check about all subprocesses were killed. Otherwise, raise an exception with existed processes info. Follows up #276

Function wait_until_started in TarantoolServer contains seek_wait, which waits pattern in logfile. If there is no pattern, server is hanging. Was added start-server-time (by default equals to 90 secs). The pattern is sought until the time runs out and wait_until_started returns True if the pattern was found (else False). Was added new logging that the instance wasn't started. Fixes: #276

`time.sleep` was changed to `gevent.sleep` to allow current greenlet to sleep and others to run. When `time.sleep` is used, greenlet's context is not changed from the main process to the test greenlet. As a result of this, there is no data received by the main process while hanging the tarantool server process and the test is fallen down by the common timeout (NO_OUTPUT_TIMEOUT). Moreover, the process is not killed by test-run. Using `gevent.sleep` makes the test fall down by the test timeout and kill the farantool server process. Part of #276

When a tarantool server starts, it waits for a special pattern in the log file to proceed. If there is no pattern present, the server hangs. After the test timeout runs out, the test will fail. So this patch adds the `--start-server-timeout` option (by default equals to 90 secs). Now when the server hangs and the time runs out, a comprehensible exception is raised with the message that the server failed to start within the timeout. Fixes: #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case the test will fail by the server start timeout and kill the main server process only. This patch fixes that. Follows up #276

`time.sleep` was changed to `gevent.sleep` to allow current greenlet to sleep and others to run. When `time.sleep` is used, greenlet's context is not changed from the main process to test's greenlet. As a result of this, there is no data received by the main process while hanging the tarantool server process and the test is fallen down by the common timeout (NO_OUTPUT_TIMEOUT). Even worse, the tarantool server process is not killed by test-run. Using `gevent.sleep` makes the test fail by the test timeout and kill the farantool server process. Part of #276

When a tarantool server starts, it waits for a special pattern in the log file to proceed. If there is no pattern present, the server hangs. After the test timeout (TEST_TIMEOUT) runs out, the test fails. This patch adds the `--server-start-timeout` option to test-run (by default it equals to 90 seconds). Now when the server hangs and the time (SERVER_START_TIMEOUT) runs out, a comprehensible exception is raised with the message that the server failed to start within the timeout. Fixes: #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Follows up #276

`time.sleep` was changed to `gevent.sleep` to allow current greenlet to sleep and others to run. When `time.sleep` is used, greenlet's context is not changed from the main process to test's greenlet. As a result of this, there is no data received by the main process while hanging the tarantool server process and the test is fallen down by the common timeout (NO_OUTPUT_TIMEOUT). Even worse, the tarantool server process is not killed by test-run. Using `gevent.sleep` makes the test fail by the test timeout and kill the farantool server process. Part of #276

When a tarantool server starts, it waits for a special pattern in the log file to proceed. If there is no pattern present, the server hangs. After the test timeout (TEST_TIMEOUT) runs out, the test fails. This patch adds the `--server-start-timeout` option to test-run (by default it equals to 90 seconds). Now when the server hangs and the time (SERVER_START_TIMEOUT) runs out, a comprehensible exception is raised with the message that the server failed to start within the timeout. Fixes: #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Follows up #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276

When a tarantool server starts, it waits for a special pattern in the log file to proceed. If there is no pattern present, the server hangs. After the test timeout (TEST_TIMEOUT) runs out, the test fails. This patch adds the `--server-start-timeout` option to test-run (by default it equals to 90 seconds). Now when the server hangs and the time (SERVER_START_TIMEOUT) runs out, a comprehensible exception is raised with the message that the server failed to start within the timeout. Fixes: #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276

This patch adds a simple unit test checking that if a tarantool server failed to start within a certain amount of seconds, test-tun raises a comprehensible exception and kills the server process. Follows up #256 Follows up #276