Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-run does not kill server that hangs in an instance script #256

Closed
Totktonada opened this issue Dec 15, 2020 · 1 comment · Fixed by #302 or #332
Closed

test-run does not kill server that hangs in an instance script #256

Totktonada opened this issue Dec 15, 2020 · 1 comment · Fixed by #302 or #332
Assignees
Labels
bug Something isn't working
Milestone

Comments

@Totktonada
Copy link
Member

How to reproduce

Mangle a supplementary (non-default) instance like so:

diff --git a/test/app-tap/debug/server.lua b/test/app-tap/debug/server.lua
index 3362d46cc..8dbd74b66 100644
--- a/test/app-tap/debug/server.lua
+++ b/test/app-tap/debug/server.lua
@@ -9,3 +9,5 @@ box.schema.func.create('debug.sourcefile')
 box.schema.func.create('debug.sourcedir')
 box.schema.user.grant('guest','execute','function','debug.sourcefile')
 box.schema.user.grant('guest','execute','function','debug.sourcedir')
+
+while true do end

And run a test (remove --test-timeout 5 for test-run prior PR #244):

./test/test-run.py --test-timeout 5 -j1 app-tap/debug.test.lua

Expected behaviour: the test will fail at reaching the test timeout and all relevant instances will be terminated / killed.

Observed behaviour: the test timeout works, but the instance is not stopped. If the test is marked as fragile and we make attempt to rerun it, we'll get the 'The daemon is already running' error.

Investigation

It seems, the reason is the following check in the TarantoolServer.stop() method:

if self.status == 'started':
if not silent:
color_stdout('The server is already started.\n',
schema='lerror')
return

Given server is not considered as 'started', so test-run does not stop it. However the process exists and should be killed.

The attempt to just remove the check fails:

The entire output (without this check)
$ ./test/test-run.py --test-timeout 5 -j1 --force app-tap/debug.test.lua
Started ./test/test-run.py --test-timeout 5 -j1 --force app-tap/debug.test.lua
Running in parallel with 1 workers

Collecting tests in 'app'            (Found 0   tests): application server tests.
Collecting tests in 'app-tap'        (Found 1   tests): application server tests (TAP).
Collecting tests in 'box'            (Found 0   tests): Database tests.
Collecting tests in 'box-py'         (Found 0   tests): legacy python tests.
Collecting tests in 'box-tap'        (Found 0   tests): Database tests with #! using TAP.
Collecting tests in 'engine'         (Found 0   tests): tarantool multiengine tests.
Collecting tests in 'engine_long'    (Found 0   tests): tarantool engine stress tests.
Collecting tests in 'long_run-py'    (Found 0   tests): long running tests.
Collecting tests in 'luajit-tap'     (Found 0   tests): Luajit tests.
Collecting tests in 'replication'    (Found 0   tests): tarantool/box, replication.
Collecting tests in 'replication-py' (Found 0   tests): tarantool/box, replication.
Collecting tests in 'small'          (Found 0   tests): libsmall unit tests.
Collecting tests in 'sql'            (Found 0   tests): sql tests.
Collecting tests in 'sql-tap'        (Found 0   tests): Database tests with #! using TAP.
Collecting tests in 'swim'           (Found 0   tests): SWIM tests.
Collecting tests in 'unit'           (Found 0   tests): unit tests.
Collecting tests in 'vinyl'          (Found 0   tests): vinyl integration tests.
Collecting tests in 'wal_off'        (Found 0   tests): tarantool/box, wal_mode = none.
Collecting tests in 'xlog'           (Found 0   tests): tarantool write ahead log tests.
Collecting tests in 'xlog-py'        (Found 0   tests): legacy python tests.

Tarantool server information
 | Found executable at /home/alex/projects/tarantool-meta/r/t-3/src/tarantool
 | Found tarantoolctl at /home/alex/projects/tarantool-meta/r/t-3/extra/dist/tarantoolctl

 | Tarantool 2.7.0-98-g54667f6fa
 | Target: Linux-x86_64-Debug
 | Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
 | Compiler: /usr/bin/cc /usr/bin/c++
 | C_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror
 | CXX_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror


======================================================================================
WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
[001] app-tap/debug.test.lua                                          Test timeout of 5 secs reached	
[001] [Instance "server"] Tarantool server failed to start
[001] 
[001] Last 15 lines of Tarantool Log file [Instance "server"][/home/alex/projects/tarantool-meta/r/t-3/test/var/001_app-tap/server.log]:
[001] 2020-12-15 22:37:46.033 [15737] main/103/server I> mapping 134217728 bytes for vinyl tuple arena...
[001] 2020-12-15 22:37:46.038 [15737] main/103/server I> instance uuid 69e0dc32-5dee-4d07-ae65-e7753c86689d
[001] 2020-12-15 22:37:46.038 [15737] iproto/101/main I> binary: bound to unix/:(socket)
[001] 2020-12-15 22:37:46.039 [15737] main/103/server I> initializing an empty data directory
[001] 2020-12-15 22:37:46.067 [15737] main/103/server I> assigned id 1 to replica 69e0dc32-5dee-4d07-ae65-e7753c86689d
[001] 2020-12-15 22:37:46.067 [15737] main/103/server I> cluster uuid d4a453ad-6d80-49bd-8350-c7ec495db6ef
[001] 2020-12-15 22:37:46.068 [15737] snapshot/101/main I> saving snapshot `/home/alex/projects/tarantool-meta/r/t-3/test/var/001_app-tap/server/00000000000000000000.snap.inprogress'
[001] 2020-12-15 22:37:46.071 [15737] snapshot/101/main I> done
 2 l/tarantool_server.py  2 l/app_server.py  l/server.py  l/test.py                                                                                                                          X
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> ready to accept requests
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> set 'log_level' configuration option to 5
[001] 2020-12-15 22:37:46.072 [15737] main/106/checkpoint_daemon I> scheduled next checkpoint for Wed Dec 16 00:35:02 2020
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> set 'listen' configuration option to "\/home\/alex\/projects\/tarantool-meta\/r\/t-3\/test\/var\/001_app-tap\/server.socket-iproto"
[001] 2020-12-15 22:37:46.072 [15737] main/103/server I> set 'log_format' configuration option to "plain"
[001] 2020-12-15 22:37:46.073 [15737] main/103/server I> set 'replication_sync_timeout' configuration option to 100
[001] 2020-12-15 22:37:46.073 [15737] main/117/console/::1:8670 I> started
[001] 
[001] Worker "001_app-tap" received the following error; stopping...
[001] Traceback (most recent call last):
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/worker.py", line 316, in run_task
[001]     task, self.server, self.inspector)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/test_suite.py", line 261, in run_test
[001]     short_status, result_checksum = test.run(server)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/test.py", line 182, in run
[001]     self.execute(server)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/app_server.py", line 77, in execute
[001]     self.finish(server, ts)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/app_server.py", line 87, in finish
[001]     ts.stop_nondefault(signal=signal.SIGKILL)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 442, in stop_nondefault
[001]     v.stop(silent=True, signal=signal)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 1059, in stop
[001]     save_join(self.crash_detector)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 52, in save_join
[001]     green_obj.get(timeout=timeout)
[001]   File "src/gevent/greenlet.py", line 790, in gevent._gevent_cgreenlet.Greenlet.get
[001]   File "src/gevent/greenlet.py", line 778, in gevent._gevent_cgreenlet.Greenlet.get
[001]   File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[001]   File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[001]   File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[001]   File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
[001]   File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run
[001]   File "/usr/lib64/python2.7/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
[001]     return handle(*args_tuple)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/inspector.py", line 101, in handle
[001]     result = self.parser.parse_preprocessor(line)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 122, in parse_preprocessor
[001]     return self.server(stype, sname, options)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 348, in server
[001]     return getattr(self, attr)(ctype, sname, opts)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/preprocessor.py", line 209, in server_start
[001]     wait_load=wait_load, args=args)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 881, in start
[001]     self.wait_until_started(wait_load)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 1106, in wait_until_started
[001]     self.logfile_pos.seek_wait(msg, p, self.name)
[001]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/tarantool_server.py", line 455, in seek_wait
[001]     raise TarantoolStartError(name)
[001] TarantoolStartError
[001] 
[001] Exception: 
---------------------------------------------------------------------------------
[Internal test-run error] The following tasks were dispatched to some worker task queue, but were not reported as done (does not matters success or fail):
- [app-tap/debug.test.lua, null]

Background information

I doubt that it is possible to get this situation in a real testing, so let's consider this issue with lowest priority. Take care to it if you'll perform some refactoring around stopping / killing tarantool instances.

Follows up #65.
Follows up #157.

@Totktonada Totktonada added the bug Something isn't working label Dec 15, 2020
Totktonada added a commit that referenced this issue Dec 15, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
avtikhon pushed a commit that referenced this issue Dec 16, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
avtikhon pushed a commit that referenced this issue Dec 16, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
Totktonada added a commit that referenced this issue Dec 18, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
@kyukhin kyukhin added this to the Q2-21 milestone Apr 28, 2021
ylobankov added a commit that referenced this issue Feb 18, 2022
It was found that processes of non-started tarantool servers are not
killed by test-run and leave to hang. This situation can be reproduced
by creating the main server, then creating a replica server, but the
replica server is unable to join the master, for example, due to lack
of user permissions. In this case, the test fails by the server start
timeout and test-run kills the main server process only. This patch
fixes the issue.

Fixes #256
Follows up #276
@ylobankov ylobankov assigned ylobankov and unassigned VitaliyaIoffe Feb 18, 2022
@ylobankov ylobankov added 1sp and removed backlog labels Feb 18, 2022
@NickVolynkin NickVolynkin changed the title test-run does not kill server that hungs in an instance script test-run does not kill server that hangs in an instance script Feb 28, 2022
ylobankov added a commit that referenced this issue Mar 10, 2022
It was found that processes of non-started tarantool servers are not
killed by test-run and leave to hang. This situation can be reproduced
by creating the main server, then creating a replica server, but the
replica server is unable to join the master, for example, due to lack
of user permissions. In this case, the test fails by the server start
timeout and test-run kills the main server process only. This patch
fixes the issue.

Fixes #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 10, 2022
This patch adds a simple unit test checking that if a tarantool server
failed to start within a certain amount of seconds, test-tun raises a
comprehensible exception and kills the server process.

Follows up #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 10, 2022
It was found that processes of non-started tarantool servers are not
killed by test-run and leave to hang. This situation can be reproduced
by creating the main server, then creating a replica server, but the
replica server is unable to join the master, for example, due to lack
of user permissions. In this case, the test fails by the server start
timeout and test-run kills the main server process only. This patch
fixes the issue.

Fixes #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 10, 2022
This patch adds a simple unit test checking that if a tarantool server
failed to start within a certain amount of seconds, test-tun raises a
comprehensible exception and kills the server process.

Follows up #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 10, 2022
It was found that processes of non-started tarantool servers are not
killed by test-run and leave to hang. This situation can be reproduced
by creating the main server, then creating a replica server, but the
replica server is unable to join the master, for example, due to lack
of user permissions. In this case, the test fails by the server start
timeout and test-run kills the main server process only. This patch
fixes the issue.

Fixes #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 10, 2022
This patch adds a simple unit test checking that if a tarantool server
failed to start within a certain amount of seconds, test-tun raises a
comprehensible exception and kills the server process.

Follows up #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 11, 2022
This patch adds a simple unit test checking that if a tarantool server
failed to start within a certain amount of seconds, test-tun raises a
comprehensible exception and kills the server process.

Follows up #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 15, 2022
It was found that processes of non-started tarantool servers are not
killed by test-run and leave to hang. This situation can be reproduced
by creating the main server, then creating a replica server, but the
replica server is unable to join the master, for example, due to lack
of user permissions. In this case, the test fails by the server start
timeout and test-run kills the main server process only. This patch
fixes the issue.

Fixes #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 15, 2022
This patch adds a simple unit test checking that if a tarantool server
failed to start within a certain amount of seconds, test-tun raises a
comprehensible exception and kills the server process.

Follows up #256
Follows up #276
@ylobankov
Copy link
Contributor

The provided fix in 9c2cc2d is wrong. Many tests in tarantool just install a server (without starting) and in such tests I can see the following errors:

[001] [2022-03-17 18:20:36.437767] DEBUG: [Instance proxy] Stopping the server...
[001]
[001] Test.run() received the following error:
[001] Traceback (most recent call last):
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/test.py", line 195, in run
[001]     self.execute(server)
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/tarantool_server.py", line 389, in execute
[001]     ts.stop_nondefault(signal=signal.SIGKILL)
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/preprocessor.py", line 443, in stop_nondefault
[001]     v.stop(silent=True, signal=signal)
[001]   File "/Users/y.lobankov/Workspace/tarantool/test-run/lib/tarantool_server.py", line 1026, in stop
[001]     if self.process is not None and self.process.returncode is None:
[001] AttributeError: 'TarantoolServer' object has no attribute 'process'
[001]

If the server doesn't start, the server object doesn't have the process attribute.

@ylobankov ylobankov reopened this Mar 17, 2022
ylobankov added a commit that referenced this issue Mar 17, 2022
It was found that hanging processes of not started tarantool servers
are not killed by test-run and leave to hang. This situation can be
reproduced by creating the main server, then creating a replica server,
but the replica server is unable to join the master, for example, due
to lack of user permissions. In this case, the test fails by the server
start timeout and test-run kills the main server process only.
This patch fixes the issue.

Fixes #256
Follows up #276
ylobankov added a commit that referenced this issue Mar 17, 2022
It was found that hanging processes of not started tarantool servers
are not killed by test-run and leave to hang. This situation can be
reproduced by creating the main server, then creating a replica server,
but the replica server is unable to join the master, for example, due
to lack of user permissions. In this case, the test fails by the server
start timeout and test-run kills the main server process only.
This patch fixes the issue.

Fixes #256
Follows up #276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants