-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-run does not kill server that hangs in an instance script #256
Comments
Totktonada
added a commit
that referenced
this issue
Dec 15, 2020
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
Merged
avtikhon
pushed a commit
that referenced
this issue
Dec 16, 2020
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
avtikhon
pushed a commit
that referenced
this issue
Dec 16, 2020
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
Totktonada
added a commit
that referenced
this issue
Dec 18, 2020
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
ylobankov
added a commit
that referenced
this issue
Feb 18, 2022
It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276
NickVolynkin
changed the title
test-run does not kill server that hungs in an instance script
test-run does not kill server that hangs in an instance script
Feb 28, 2022
ylobankov
added a commit
that referenced
this issue
Mar 10, 2022
It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276
ylobankov
added a commit
that referenced
this issue
Mar 10, 2022
It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276
ylobankov
added a commit
that referenced
this issue
Mar 10, 2022
It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276
ylobankov
added a commit
that referenced
this issue
Mar 15, 2022
It was found that processes of non-started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276
The provided fix in 9c2cc2d is wrong. Many tests in tarantool just install a server (without starting) and in such tests I can see the following errors:
If the server doesn't start, the server object doesn't have the |
ylobankov
added a commit
that referenced
this issue
Mar 17, 2022
It was found that hanging processes of not started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276
ylobankov
added a commit
that referenced
this issue
Mar 17, 2022
It was found that hanging processes of not started tarantool servers are not killed by test-run and leave to hang. This situation can be reproduced by creating the main server, then creating a replica server, but the replica server is unable to join the master, for example, due to lack of user permissions. In this case, the test fails by the server start timeout and test-run kills the main server process only. This patch fixes the issue. Fixes #256 Follows up #276
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
How to reproduce
Mangle a supplementary (non-default) instance like so:
And run a test (remove
--test-timeout 5
for test-run prior PR #244):Expected behaviour: the test will fail at reaching the test timeout and all relevant instances will be terminated / killed.
Observed behaviour: the test timeout works, but the instance is not stopped. If the test is marked as fragile and we make attempt to rerun it, we'll get the 'The daemon is already running' error.
Investigation
It seems, the reason is the following check in the
TarantoolServer.stop()
method:test-run/lib/tarantool_server.py
Lines 845 to 849 in 1f6d7ba
Given server is not considered as 'started', so test-run does not stop it. However the process exists and should be killed.
The attempt to just remove the check fails:
The entire output (without this check)
Background information
I doubt that it is possible to get this situation in a real testing, so let's consider this issue with lowest priority. Take care to it if you'll perform some refactoring around stopping / killing tarantool instances.
Follows up #65.
Follows up #157.
The text was updated successfully, but these errors were encountered: