test-run does not kill server that hangs in an instance script #256

Totktonada · 2020-12-15T20:07:03Z

How to reproduce

Mangle a supplementary (non-default) instance like so:

diff --git a/test/app-tap/debug/server.lua b/test/app-tap/debug/server.lua
index 3362d46cc..8dbd74b66 100644
--- a/test/app-tap/debug/server.lua
+++ b/test/app-tap/debug/server.lua
@@ -9,3 +9,5 @@ box.schema.func.create('debug.sourcefile')
 box.schema.func.create('debug.sourcedir')
 box.schema.user.grant('guest','execute','function','debug.sourcefile')
 box.schema.user.grant('guest','execute','function','debug.sourcedir')
+
+while true do end

And run a test (remove --test-timeout 5 for test-run prior PR #244):

./test/test-run.py --test-timeout 5 -j1 app-tap/debug.test.lua

Expected behaviour: the test will fail at reaching the test timeout and all relevant instances will be terminated / killed.

Observed behaviour: the test timeout works, but the instance is not stopped. If the test is marked as fragile and we make attempt to rerun it, we'll get the 'The daemon is already running' error.

Investigation

It seems, the reason is the following check in the TarantoolServer.stop() method:

test-run/lib/tarantool_server.py

Lines 845 to 849 in 1f6d7ba

    
           if self.status == 'started': 
        
               if not silent: 
        
                   color_stdout('The server is already started.\n', 
        
                                schema='lerror') 
        
               return

Given server is not considered as 'started', so test-run does not stop it. However the process exists and should be killed.

The attempt to just remove the check fails:

The entire output (without this check)

$ ./test/test-run.py --test-timeout 5 -j1 --force app-tap/debug.test.lua
Started ./test/test-run.py --test-timeout 5 -j1 --force app-tap/debug.test.lua
Running in parallel with 1 workers

Collecting tests in 'app'            (Found 0   tests): application server tests.
Collecting tests in 'app-tap'        (Found 1   tests): application server tests (TAP).
Collecting tests in 'box'            (Found 0   tests): Database tests.
Collecting tests in 'box-py'         (Found 0   tests): legacy python tests.
Collecting tests in 'box-tap'        (Found 0   tests): Database tests with #! using TAP.
Collecting tests in 'engine'         (Found 0   tests): tarantool multiengine tests.
Collecting tests in 'engine_long'    (Found 0   tests): tarantool engine stress tests.
Collecting tests in 'long_run-py'    (Found 0   tests): long running tests.
Collecting tests in 'luajit-tap'     (Found 0   tests): Luajit tests.
Collecting tests in 'replication'    (Found 0   tests): tarantool/box, replication.
Collecting tests in 'replication-py' (Found 0   tests): tarantool/box, replication.
Collecting tests in 'small'          (Found 0   tests): libsmall unit tests.
Collecting tests in 'sql'            (Found 0   tests): sql tests.
Collecting tests in 'sql-tap'        (Found 0   tests): Database tests with #! using TAP.
Collecting tests in 'swim'           (Found 0   tests): SWIM tests.
Collecting tests in 'unit'           (Found 0   tests): unit tests.
Collecting tests in 'vinyl'          (Found 0   tests): vinyl integration tests.
Collecting tests in 'wal_off'        (Found 0   tests): tarantool/box, wal_mode = none.
Collecting tests in 'xlog'           (Found 0   tests): tarantool write ahead log tests.
Collecting tests in 'xlog-py'        (Found 0   tests): legacy python tests.

Tarantool server information
 | Found executable at /home/alex/projects/tarantool-meta/r/t-3/src/tarantool
 | Found tarantoolctl at /home/alex/projects/tarantool-meta/r/t-3/extra/dist/tarantoolctl

 | Tarantool 2.7.0-98-g54667f6fa
 | Target: Linux-x86_64-Debug
 | Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
 | Compiler: /usr/bin/cc /usr/bin/c++
 | C_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror
 | CXX_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror


======================================================================================
WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
[001] app-tap/debug.test.lua                                          Test timeout of 5 secs reached	
[001] [Instance "server"] Tarantool server failed to start
[001] 
[001] Last 15 lines of Tarantool Log file [Instance "server"][/home/alex/projects/tarantool-meta/r/t-3/test/var/001_app-tap/server.log]:
[001] 2020-12-15 22:37:46.033 [15737] main/103/server I> mapping 134217728 bytes for vinyl tuple arena...
[001] 2020-12-15 22:37:46.038 [15737] main/103/server I> instance uuid 69e0dc32-5dee-4d07-ae65-e7753c86689d
[001] 2020-12-15 22:37:46.038 [15737] iproto/101/main I> binary: bound to unix/:(socket)
[001] 2020-12-15 22:37:46.039 [15737] main/103/server I> initializing an empty data directory
[001] 2020-12-15 22:37:46.067 [15737] main/103/server I> assigned id 1 to replica 69e0dc32-5dee-4d07-ae65-e7753c86689d
[001] 2020-12-15 22:37:46.067 [15737] main/103/server I> cluster uuid d4a453ad-6d80-49bd-8350-c7ec495db6ef
[001] 2020-12-15 22:37:46.068 [15737] snapshot/101/main I> saving snapshot `/home/alex/projects/tarantool-meta/r/t-3/test/var/001_app-tap/server/00000000000000000000.snap.inprogress'
[001] 2020-12-15 22:37:46.071 [15737] snapshot/101/main I> done
 2 l/tarantool_server.py  2 l/app_server.py  l/server.py  l/test.py                                                                                                                          X
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> ready to accept requests
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> set 'log_level' configuration option to 5
[001] 2020-12-15 22:37:46.072 [15737] main/106/checkpoint_daemon I> scheduled next checkpoint for Wed Dec 16 00:35:02 2020
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> set 'listen' configuration option to "\/home\/alex\/projects\/tarantool-meta\/r\/t-3\/test\/var\/001_app-tap\/server.socket-iproto"
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> set 'log_format' configuration option to "plain"
[001] 2020-12-15 22:37:46.073 [15737] main/103/server I> set 'replication_sync_timeout' configuration option to 100
[001] 2020-12-15 22:37:46.073 [15737] main/117/console/::1:8670 I> started
[001] 
[001] Worker "001_app-tap" received the following error; stopping...
[001] Traceback (most recent call last):
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/worker.py", line 316, in run_task
[001]     task, self.server, self.inspector)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/test_suite.py", line 261, in run_test
[001]     short_status, result_checksum = test.run(server)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/test.py", line 182, in run
[001]     self.execute(server)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/app_server.py", line 77, in execute
[001]     self.finish(server, ts)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/app_server.py", line 87, in finish
[001]     ts.stop_nondefault(signal=signal.SIGKILL)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 442, in stop_nondefault
[001]     v.stop(silent=True, signal=signal)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 1059, in stop
[001]     save_join(self.crash_detector)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 52, in save_join
[001]     green_obj.get(timeout=timeout)
[001]   File "src/gevent/greenlet.py", line 790, in gevent._gevent_cgreenlet.Greenlet.get
[001]   File "src/gevent/greenlet.py", line 778, in gevent._gevent_cgreenlet.Greenlet.get
[001]   File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[001]   File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[001]   File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[001]   File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
[001]   File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run
[001]   File "/usr/lib64/python2.7/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
[001]     return handle(*args_tuple)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/inspector.py", line 101, in handle
[001]     result = self.parser.parse_preprocessor(line)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 122, in parse_preprocessor
[001]     return self.server(stype, sname, options)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 348, in server
[001]     return getattr(self, attr)(ctype, sname, opts)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 209, in server_start
[001]     wait_load=wait_load, args=args)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 881, in start
[001]     self.wait_until_started(wait_load)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 1106, in wait_until_started
[001]     self.logfile_pos.seek_wait(msg, p, self.name)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 455, in seek_wait
[001]     raise TarantoolStartError(name)
[001] TarantoolStartError
[001] 
[001] Exception: 
---------------------------------------------------------------------------------
[Internal test-run error] The following tasks were dispatched to some worker task queue, but were not reported as done (does not matters success or fail):
- [app-tap/debug.test.lua, null]

Background information

I doubt that it is possible to get this situation in a real testing, so let's consider this issue with lowest priority. Take care to it if you'll perform some refactoring around stopping / killing tarantool instances.

Follows up #65.
Follows up #157.

The text was updated successfully, but these errors were encountered:

See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276

This patch adds a simple unit test checking that if a tarantool server failed to start within a certain amount of seconds, test-tun raises a comprehensible exception and kills the server process. Follows up #256 Follows up #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276

This patch adds a simple unit test checking that if a tarantool server failed to start within a certain amount of seconds, test-tun raises a comprehensible exception and kills the server process. Follows up #256 Follows up #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276

This patch adds a simple unit test checking that if a tarantool server failed to start within a certain amount of seconds, test-tun raises a comprehensible exception and kills the server process. Follows up #256 Follows up #276

It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276

This patch adds a simple unit test checking that if a tarantool server failed to start within a certain amount of seconds, test-tun raises a comprehensible exception and kills the server process. Follows up #256 Follows up #276

ylobankov · 2022-03-17T14:35:06Z

The provided fix in 9c2cc2d is wrong. Many tests in tarantool just install a server (without starting) and in such tests I can see the following errors:

[001] [2022-03-17 18:20:36.437767] DEBUG: [Instance proxy] Stopping the server...
[001]
[001] Test.run() received the following error:
[001] Traceback (most recent call last):
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/test.py", line 195, in run
[001]     self.execute(server)
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/tarantool_server.py", line 389, in execute
[001]     ts.stop_nondefault(signal=signal.SIGKILL)
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/preprocessor.py", line 443, in stop_nondefault
[001]     v.stop(silent=True, signal=signal)
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/tarantool_server.py", line 1026, in stop
[001]     if self.process is not None and self.process.returncode is None:
[001] AttributeError: 'TarantoolServer' object has no attribute 'process'
[001]

If the server doesn't start, the server object doesn't have the process attribute.

It was found that hanging processes of not started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276

Totktonada added the bug Something isn't working label Dec 15, 2020

Totktonada mentioned this issue Dec 15, 2020

Stop instances properly #257

Merged

kyukhin added backlog teamQ labels Apr 28, 2021

kyukhin added this to the Q2-21 milestone Apr 28, 2021

kyukhin assigned VitaliyaIoffe Apr 28, 2021

Totktonada mentioned this issue Feb 18, 2022

Add timeout for starting tarantool server #302

Merged

ylobankov assigned ylobankov and unassigned VitaliyaIoffe Feb 18, 2022

ylobankov added 1sp and removed backlog labels Feb 18, 2022

Totktonada mentioned this issue Feb 23, 2022

test: fix flaky 'test_ignore_with_force_recovery' tarantool/tarantool#6843

Merged

NickVolynkin changed the title ~~test-run does not kill server that hungs in an instance script~~ test-run does not kill server that hangs in an instance script Feb 28, 2022

ylobankov closed this as completed in #302 Mar 15, 2022

ylobankov reopened this Mar 17, 2022

ylobankov mentioned this issue Mar 17, 2022

Kill hanging processes of not started servers (reworked fix) #332

Merged

ylobankov closed this as completed in #332 Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test-run does not kill server that hangs in an instance script #256

test-run does not kill server that hangs in an instance script #256

Totktonada commented Dec 15, 2020

ylobankov commented Mar 17, 2022

test-run does not kill server that hangs in an instance script #256

test-run does not kill server that hangs in an instance script #256

Comments

Totktonada commented Dec 15, 2020

How to reproduce

Investigation

Background information

ylobankov commented Mar 17, 2022