Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional fail of zproc test, reading stdout #2007

Closed
jimklimov opened this issue May 21, 2019 · 15 comments
Closed

Occasional fail of zproc test, reading stdout #2007

jimklimov opened this issue May 21, 2019 · 15 comments
Labels

Comments

@jimklimov
Copy link
Member

From https://api.travis-ci.org/v3/job/533238563/log.txt :

/bin/sh ./libtool --mode=execute ./src/czmq_selftest
Assertion failed: (stdout_read), function zproc_test, file src/zproc.c, line 1242.
Running czmq selftests...
...
 * zargs: OK

 * zproc: make[2]: *** [check-local] Abort trap: 6

make[1]: *** [check-am] Error 2
make: *** [check-recursive] Error 1

travis_time:end:01892487:start=1558002346156497000,finish=1558002550603141000,duration=204446644000
@mbakke
Copy link

mbakke commented Jul 19, 2019

This test consistently fails for me on a 3.5 GHz Haswell Xeon system, but I am unable to reproduce it on a 2.5GHz Sandy Bridge i5.

[...]
 * zauth: OK
 * zbeacon: OK (skipping test, no UDP broadcasting)                                                
E: (czmq_selftest) 19-07-19 14:53:18 No broadcast interface found, (ZSYS_INTERFACE=)               
./src/zsp: line 150: ls: command not found                                                         
czmq_selftest: src/zproc.c:1263: zproc_test: Assertion `!zproc_running (self)' failed.
make[2]: *** [Makefile:2295: check-local] Aborted

In this particular environment, ls is only available deep in PATH, so it looks like there is some kind of race going on.

@stale
Copy link

stale bot commented Oct 17, 2019

This issue has been automatically marked as stale because it has not had recent activity for 90 days. It will be closed if no further activity occurs within 21 days. Thank you for your contributions.

@stale stale bot added the stale label Oct 17, 2019
@vyskocilm
Copy link
Contributor

Hi @jimklimov is it still the case? I've checked latest build failures at TravisCI, however zproject was not a problem.

@mbakke this is weird, AFAIK ls is not used anywhere inside.

Anyway output of make check-verbose would be nice to have.

@stale stale bot removed the stale label Oct 17, 2019
@jimklimov
Copy link
Member Author

@mvyskocil : gotta check, did not look at this for a while. However a link between faults and system speed (in test or possibly production too?) sou ds worrysome and un-21st century like ;\

@vyskocilm
Copy link
Contributor

I don't have 3.5GHz system to test. However if there is really a race condition, it must happen on slower systems as well. I wrote simple tool running zproc test million times szproc.go. So far I figured out that testcase #7 is timing senstitive and I got very weird failure on line 1239

Anyway any hints about the bug would be nice

  1. Either make check-verbose output
  2. Or even better run make debug after a failure and print me the stack trace and frames

I'll keep running the test to see what will happen

@vyskocilm
Copy link
Contributor

Hi,

I've asked a friend with an access to more CPU powerful machine to make a test. I did a small Alpine based Docker container for him, so @mbakke if you are around, feel free to run it as well

docker run --ulimit core=99999999999:99999999999 --cap-add=SYS_PTRACE vyskocilm/zsdk:alpine-3.8 make check

If reproduced, I will need the core file stored inside the container. The exact naming and location depends on a kernel.

@stale
Copy link

stale bot commented Feb 2, 2020

This issue has been automatically marked as stale because it has not had recent activity for 90 days. It will be closed if no further activity occurs within 21 days. Thank you for your contributions.

@stale stale bot added the stale label Feb 2, 2020
@bluca bluca removed the stale label Feb 2, 2020
@stale
Copy link

stale bot commented May 2, 2020

This issue has been automatically marked as stale because it has not had recent activity for 90 days. It will be closed if no further activity occurs within 21 days. Thank you for your contributions.

@stale stale bot added the stale label May 2, 2020
@bluca bluca removed the stale label May 2, 2020
@pyhalov
Copy link
Contributor

pyhalov commented Jun 23, 2020

Have 100% fails on zproc test on illumos (OpenIndiana).
It looks like timer checking for process exit status is not run. Look - there's no waitid() calls

$ truss -t waitid,kill -f ./libtool --mode=execute ./src/czmq_selftest -t zproc -v
....
27187/7:        waitid(P_PID, 27320, 0xFE54EE20, WEXITED|WTRAPPED|WNOHANG) = 0
D: 20-06-23 18:43:07 zproc_wait [27320]:        WIFEXITED, self->return_code=0
D: 20-06-23 18:43:07 zproc_wait [27320]: timeout=0
D: 20-06-23 18:43:07 zproc_wait [27320]:        !self->running=false
D: 20-06-23 18:43:07 zproc_wait [27320]: timeout=0
D: 20-06-23 18:43:07 zproc_wait [27320]:        !self->running=false
D: 20-06-23 18:43:07 zproc_test() : got stdout from helper, 10000 msec was remaining to retry
D: 20-06-23 18:43:07 zproc_test[1170] czmq is great

czmq is great

czmq is great

czmq is great

czmq is gr...
D: 20-06-23 18:43:08 zproc_test() : did not get stdout from helper, but we already have some (9200 msec remaining to retry)
D: 20-06-23 18:43:08 zproc_wait [27320]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27320]:        !self->running=false
D: 20-06-23 18:43:08 zproc_wait [27320]: timeout=5000
D: 20-06-23 18:43:08 zproc_wait [27320]:        !self->running=false
D: 20-06-23 18:43:08 zproc_wait [27320]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27320]:        !self->running=false
D: 20-06-23 18:43:08 zproc_test() : launching helper './src/zsp'
D: 20-06-23 18:43:08 API command=RUN
D: 20-06-23 18:43:08 process ./src/zsp with pid 27360 started
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=1000
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
27360:      Received signal #15, SIGTERM [default]
27360:        siginfo: SIGTERM pid=27187 uid=101
27187/1:        kill(27360, SIGTERM)                            = 0
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=1000
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
27187/1:            Received signal #18, SIGCLD [default]
27187/1:              siginfo: SIGCLD CLD_KILLED pid=27360 status=0x000F
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
27187/1:        kill(27360, SIGKILL)                            = 0
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=1000
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
D: 20-06-23 18:43:08 zproc_wait [27360]: timeout=0
D: 20-06-23 18:43:08 zproc_wait [27360]:        !self->running=true
D: 20-06-23 18:43:08 zproc_wait [27360]:        waitpid
Assertion failed: !zproc_running (self), file src/zproc.c, line 1263, function zproc_test
27187/1:            Received signal #6, SIGABRT [default]
27187/1:              siginfo: SIGABRT pid=27187 uid=101 code=-1

From core file we see that actor thread is still waiting for something:

> ::walk thread | ::findstack 
stack pointer for thread 1: 80472e8
[ 080472e8 libc.so.1`_lwp_kill+0x15() ]
  08047308 libc.so.1`raise+0x2b()
  08047358 libc.so.1`abort+0xed()
  08047588 0xfee52a44()
  08047a18 libczmq.so.4.2.0`zproc_test+0xfcf()
  08047a68 main+0x1e8()
  08047aa8 _start_crt+0x96()
  08047ac8 _start+0x1a()
stack pointer for thread 2 [ZMQbg/Reaper]: fe78eed8
[ fe78eed8 libc.so.1`__pollsys+0x15() ]
  fe78ef08 libc.so.1`poll+0x61()
  fe78ef48 libzmq.so.5.2.2`_ZN3zmq6poll_t4loopEv+0x77()
  fe78efc8 libzmq.so.5.2.2`thread_routine+0xf6()
  fe78efe8 libc.so.1`_thrp_setup+0x81()
  fe78eff8 libc.so.1`_lwp_start()
stack pointer for thread 3 [ZMQbg/IO/0]: fe67eed8
[ fe67eed8 libc.so.1`__pollsys+0x15() ]
  fe67ef08 libc.so.1`poll+0x61()
  fe67ef48 libzmq.so.5.2.2`_ZN3zmq6poll_t4loopEv+0x77()
  fe67efc8 libzmq.so.5.2.2`thread_routine+0xf6()
  fe67efe8 libc.so.1`_thrp_setup+0x81()
  fe67eff8 libc.so.1`_lwp_start()
stack pointer for thread 8: fe54edb8
[ fe54edb8 libc.so.1`__pollsys+0x15() ]
  fe54ede8 libc.so.1`poll+0x61()
  fe54eee8 libzmq.so.5.2.2`zmq_poll+0x276()
  fe54ef48 libczmq.so.4.2.0`zloop_start+0x303()
  fe54ef98 libczmq.so.4.2.0`s_zproc_actor+0x6c()
  fe54efc8 libczmq.so.4.2.0`s_thread_shim+0x2c()
  fe54efe8 libc.so.1`_thrp_setup+0x81()
  fe54eff8 libc.so.1`_lwp_start()

@pyhalov
Copy link
Contributor

pyhalov commented Jun 26, 2020

I found an issue:
https://github.com/zeromq/czmq/blob/master/src/zproc.c#L847 has

  int quit = zclock_mono () + timeout;
        while (true) {
            if (! zproc_running (self))
                break;
            if (zclock_mono () >= quit)
                break;
            zclock_sleep (200);
        }

but zclock_mono() is int64_t , so we have integer overflow here.

@sappo
Copy link
Member

sappo commented Jun 26, 2020 via email

@vyskocilm
Copy link
Contributor

@pyhalov are you sure this is the case? overflow is indeed possible, but it'd require zclock_mono() to return value close to INT64_MAX or a huge timeout value. And even if the value of quit will overflow, the zclock_mono() >= quit will end the loop immediately.

What might be a problem is the zclock_mono () go above INT64_MAX itself 🤷

@vyskocilm
Copy link
Contributor

@pyhalov sorry - I did not notice your PR 🤦 so was testing code you've fixed. Sorry.

@stale
Copy link

stale bot commented Dec 5, 2020

This issue has been automatically marked as stale because it has not had recent activity for 90 days. It will be closed if no further activity occurs within 21 days. Thank you for your contributions.

@stale stale bot added the stale label Dec 5, 2020
@stale stale bot closed this as completed Dec 26, 2020
@Apteryks
Copy link

Apteryks commented Jun 6, 2021

Confirming this as fixed in 4.2.1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants