PyPy3 tests break #32

icemac · 2024-02-01T08:32:59Z

PROBLEM REPORT

What I did:

Run the tests for PyPy3 on GHA.

What I expect to happen:

green tests

What actually happened:

one test fails, see https://github.com/zopefoundation/zdaemon/actions/runs/7738159321/job/21098442355

What version of Python and Zope/Addons I am using:

PyPy 3.9
The difference in the dependencies are:
- cffi==1.15.1 -> 1.16.0
- hpy==0.0.4 -> 0.9.0

Any ideas what is happening here?

icemac · 2024-03-11T08:42:54Z

@zopefoundation/developers – Sorry for the broad notification: I need help maintaining zdaemon.

Is someone actually using zdaemon on PyPy?
Would someone volunteer to look into the PyPy test failure? (It did not go away by waiting for some weeks.)

Otherwise I'll have to drop PyPy support to get the tests green again.

malthe · 2024-03-11T09:03:03Z

Seems weird that this should break based on those dependency changes.

d-maurer · 2024-03-11T09:50:18Z

Michael Howitz wrote at 2024-3-11 01:43 -0700:

@zopefoundation/developers – Sorry for the broad notification: I need help maintaining `zdaemon`. Is someone actually using `zdaemon` on PyPy? Would someone volunteer to look into the PyPy test failure? (It did not go away by waiting for some weeks.) Otherwise I'll have to drop PyPy support to get the tests green again.

I assume this is a race condition. I suggest to introduce a `sleep(...)` to give the started process more time to do its job before we check that it has done so.

do3cc · 2024-03-11T10:17:30Z

This is not a race condition.
This testcase tests a function where zdaemon waits for the file x to exist.
This is a real bug, somewhere.
But I also have no solution to offer :-(

jensens · 2024-03-11T11:33:59Z

Looking at the test-code it does not make much sense.

The generated t.py in the test does an initial wait of 1sec - whats the reason?.
The system call OTOH return immediately after the zdaemon is executed.
Given now, for some reason, the system-call returns faster than the 1sec wait, the x-file does not exist.

I am not using nor have experience with PyPI, so I can not tell why.

I would try to sleep(2) in the test after the system call.

zdaemon/src/zdaemon/tests/tests.py

Lines 273 to 296 in adb0dc9

    
               >>> write('t.py', 
        
               ... ''' 
        
               ... import time 
        
               ... time.sleep(1) 
        
               ... open('x', 'w').close() 
        
               ... time.sleep(99) 
        
               ... ''') 
        
               >>> write('conf', 
        
               ... ''' 
        
               ... <runner> 
        
               ...   program %s t.py 
        
               ...   start-test-program cat x 
        
               ... </runner> 
        
               ... ''' % sys.executable) 
        
               >>> import os 
        
               >>> system("./zdaemon -Cconf start") 
        
               . . 
        
               daemon process started, pid=21446 
        
               >>> os.path.exists('x') 
        
               True

malthe · 2024-03-11T11:51:43Z

The reason seems to be that PyPy either starts too slow, or doesn't start properly at all – that would explain why the file x doesn't exist.

I think the test needs to check that the process has not exited early, or otherwise signaled an error.

d-maurer · 2024-03-12T09:25:37Z

Jens W. Klein wrote at 2024-3-11 04:34 -0700:

Looking at the test-code it does not make much sense. - The generated `t.py` in the test does an initial wait of 1sec - whats the reason?. - The `system` call OTOH return immediately after the `zdaemon` is executed. - Given now, for some reason, the system-call returns faster than the 1sec wait, the x-file does not exist. I am not using nor have experience with PyPI, so I can not tell why. I would try to `sleep(2)` in the test after the system call.

It is definitely a race condition: I have executed 4 runs: all tests, test_start_test_program, test_start_test_program, all tests Only the first run showed the observed ``` File ".../zdaemon/src/zdaemon/tests/tests.py", line 295, in zdaemon.tests.tests.test_start_test_program Failed example: os.path.exists('x') Expected: True Got: False ``` All following runs succeeded without any errors or failures. I thinks Jens' suggestion (adding a `sleep`) is good and should reduce the race risk. As an alternative, we might introduce a function `wait_until(predicate, timeout=10)` which waits at most *timeout* seconds for *predicate* to get true, implemented via a `sleep` in a loop checking the predicate. Something like this is used in `ZEO` to give test a high race condition resilience but avoiding long test runs in the normal case. I could come up with a corresponding PR.

do3cc · 2024-03-12T10:30:57Z

I shouldn't answer any more as I can't offer a solution, but I can't stop.

The test case wants to test the option start-test-program
which:

zdaemon/src/zdaemon/README.txt

Lines 296 to 298 in f299b77

    
           You can optionally supply a test program, via the ``start-test-program`` 
        
           configuration option, that is called repeatedly until it returns a 0 
        
           exit status or until a time limit, ``start-timeout``, has been reached.

So this zdaemon config should make the daemon start command not return while the file x does not exist.

zdaemon/src/zdaemon/tests/tests.py

Lines 189 to 192 in f299b77

    
               ... <runner> 
        
               ...   program %s t.py 
        
               ...   start-test-program cat x 
        
               ... </runner>

https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/tests/tests.py#L285

There is already a sleep in the code. It runs cat x, if this fails, it sleeps 1 second and runs it again. Only if cat x does not fail any more, will it return from the 'zdaemon start` command.

This is also the reason for the sleep(1) in the daemon code before creating the file, this is on purpose so that it is likely the first cat x fails, and the loop actually gets tested.

zdaemon/src/zdaemon/README.txt

Lines 296 to 298 in f299b77

    
           You can optionally supply a test program, via the ``start-test-program`` 
        
           configuration option, that is called repeatedly until it returns a 0 
        
           exit status or until a time limit, ``start-timeout``, has been reached.

d-maurer · 2024-03-12T11:22:28Z

Patrick Gerken wrote at 2024-3-12 03:31 -0700:

I shouldn't answer any more as I can't offer a solution, but I can't stop. The test case wants to test the option `start-test-program` which: https://github.com/zopefoundation/zdaemon/blob/f299b77eaa0cbe7512657188bef8ae14220dae2b/src/zdaemon/README.txt#L296-L298 So this zdaemon config should make the `daemon start` command not return while the file `x` does not exist. https://github.com/zopefoundation/zdaemon/blob/f299b77eaa0cbe7512657188bef8ae14220dae2b/src/zdaemon/tests/tests.py#L189-L192 https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/tests/tests.py#L285 There is already a sleep in the code. It runs `cat x`, if this fails, it sleeps 1 second and runs it again. Only if `cat x` does not fail any more, will it return from the 'zdaemon start` command. This is also the reason for the `sleep(1)` in the daemon code before creating the file, this is on purpose so that it is likely the first `cat x` fails, and the loop actually gets tested. https://github.com/zopefoundation/zdaemon/blob/f299b77eaa0cbe7512657188bef8ae14220dae2b/src/zdaemon/README.txt#L296-L298

Occasional failures (as I have observed them) is a clear indication for some kind of race condition. The race might be elsewhere and the proposed `sleep` might not resolve the problem. But I assume, the "1 s" sleep already present is simply not enough: under special conditions, things take more than 1 s. Unfortunately, only occasional (in contrast to deterministic) failures are difficult to analyze. I suggest, you play with the already present `sleep`. If my assumption is correct, reducing the sleep time should increase the failure probability, increasing it should reduce it. It has been this kind of approach @navytux used to track down some `ZEO' race conditions. It may help with `zdaemon` as well.

d-maurer · 2024-03-12T14:16:31Z

***@***.*** wrote at 2024-3-12 12:22 +0100:

Patrick Gerken wrote at 2024-3-12 03:31 -0700: > ... ... Occasional failures (as I have observed them) is a clear indication for some kind of race condition. The race might be elsewhere and the proposed `sleep` might not resolve the problem. But I assume, the "1 s" sleep already present is simply not enough: under special conditions, things take more than 1 s.

I have now looked in detail at the references you have provided and I begin to understand your point. You expect that the `system` call only returns after "cat x" has been successful. However, we see that this is ocasionally not the case. One explanation could be that `zdaemon` does not run in the foreground but in the background. Then the `system` call could return as soon as the daemon has been spawned. The documentation tells us that daemon (aka background) mode is the default. The configuration file creation you reference does not set "daemon off".

d-maurer · 2024-03-12T14:50:21Z

***@***.*** wrote at 2024-3-12 15:16 +0100:

***@***.*** wrote at 2024-3-12 12:22 +0100: > Patrick Gerken wrote at 2024-3-12 03:31 -0700: >> ... > ... >Occasional failures (as I have observed them) is a clear indication >for some kind of race condition. >The race might be elsewhere and the proposed `sleep` might not resolve the >problem. >But I assume, the "1 s" sleep already present is simply not enough: >under special conditions, things take more than 1 s. I have now looked in detail at the references you have provided and I begin to understand your point. You expect that the `system` call only returns after "cat x" has been successful. However, we see that this is ocasionally not the case. One explanation could be that `zdaemon` does not run in the foreground but in the background. Then the `system` call could return as soon as the daemon has been spawned. The documentation tells us that daemon (aka background) mode is the default. The configuration file creation you reference does not set "daemon off".

In a local copy, I have added "daemon off" to the created configuration and the failure is now deterministic: an exception is raised after 60 s (likely the default "start-timeout"). I can see that the file `x` has been created but apparently the "start-test-program" check nevertheless was not successful. I will investigate.

do3cc · 2024-03-12T15:24:44Z

I have also tried to reproduce it. In my case, there is no failure however. I injected various faults into the code, but nothing created a similar problem. Also because the daemon restart part of the test does not fail any more. There the x file is already created. Locally, on ubuntu 20.04 in wsl on windows, the code works as expected. A manual "logging" to a temporary file shows that the test gets run three times and on the third time, the x file exists. The code waits for this.

d-maurer · 2024-03-13T09:37:33Z

***@***.*** wrote at 2024-3-12 15:50 +0100:

... In a local copy, I have added "daemon off" to the created configuration and the failure is now deterministic: an exception is raised after 60 s (likely the default "start-timeout"). I can see that the file `x` has been created but apparently the "start-test-program" check nevertheless was not successful. I will investigate.

Explanation: `test-start-program` did not work for non deamon mode due to a bug in `zdaemon.zdctl.ZDCmd.do_start`. `do_start` essentially had this structure: ``` get_status() # determine the `zdrun` state if `zdrun` does not run: flag = P_NOWAIT id daemon_mode else P_WAIT spawn `zdrun` with flag else: send_action("start") if daemon_mode: await_start_condition() ``` This obviously is wrong because in non deamon_mode we wait for `zdrun` to derminate which will happen only after it has been explicitly shut down or it gave up to start the program. I changed this locally to: ``` get_status() # determine the `zdrun` state if `zdrun` does not run: spawn `zdrun` with flag P_NOWAIT else: send_action("start") await_start_condition() ``` i.e. we no longer wait for `zdrun` to terminate and always check the start condition. It is possible that we want to spawn `zdrun` with flag P_WAIT in daemon_mode and with flag P_NOWAIT in non daemon_mode, because in daemon_mode, `zdrun` respawns itself as a true daemon and then exits. In non daemon_mode, the files are not redirected to "/dev/null" and therefore I can observe the `cat` output. I see: ``` . cat: x: Datei oder Verzeichnis nicht gefunden . cat: x: Datei oder Verzeichnis nicht gefunden . daemon process started, pid=4877 ``` The "test-start-program" logic has worked correctly for the observation above. Thus, we still do not have an explanation for the occasional test failures. The most likely assupmtion is still a race condition -- but not one we should work around in the test. The "test-program-logic" works as follows: `zdrun` maintains a variable `testing`, telling whether the "start-program" test has not yet been successful. `zdctl` sends a `get_status` request to `zdrun` to learn about its state, among them the program's pid and the value of `testing`. `zdctl` assumes the start condition is satisfied if it gets a non zero pid and `testing` is false. Our occasional test failures show that this logic can break sometimes. Unfortunately, I do not yet see how this is possible. I will try to use the `transcript` option (`-t`) to get information about the activity of a daemonize `zdrun`.

do3cc · 2024-03-13T09:51:03Z

@d-maurer do you see a sensible way to transmit exception information from the test code to reach the logs?
I suspect we sometimes get an exception in the test code, the finally clause still empties the testing variable irrespective of the cause we hit the finally clause.

d-maurer · 2024-03-13T10:40:15Z

***@***.*** wrote at 2024-3-13 10:37 +0100:

... Our occasional test failures show that this logic can break sometimes. Unfortunately, I do not yet see how this is possible. I will try to use the `transcript` option (`-t`) to get information about the activity of a daemonize `zdrun`.

The transcript for a failing test run contains: ``` cat: x: Datei oder Verzeichnis nicht gefunden unknown pid 5654: exit status 1 cat: x: Datei oder Verzeichnis nicht gefunden cat: x: Datei oder Verzeichnis nicht gefunden ``` Almost surely, the problem is related to the "unknown pid" message. It is created in the `zdrun.Daemonizer.reportstatus` method when a child with an unknown pid (different from `self.proc.pid`) has terminated. I will add more debug output in the hope to find out what goes on in this case.

d-maurer · 2024-03-13T10:55:24Z

Patrick Gerken wrote at 2024-3-13 02:51 -0700:

@d-maurer do you see a sensible way to transmit exception information from the test code to reach the logs?

I do not know which exception information you speak of. Exceptions and log records can come from the test code, `zdctl` and `zdrun`. Exception information from the test code is typically printed. Likely, the same happens for exception information from `zdctl`. Exception information from `zdrun` (the daemon manager) is lost unless it has written a log entry. `zope.testrunner` supports logging. The configuration can come from the file "log.ini" in the current working directory or from a file named via the envvor "ZOPE_TESTRUNNER_LOG_INI". Test code can then use normal logging; processing is controlled by the configuration above. The logging configuration above only affects test code; logging for `zdrun` is controlled by an `eventlog` section in its configuration file.

do3cc · 2024-03-13T11:01:42Z

Are we stumbling over unknown PID from the cat command?
Disregard my previous message, it seems the transcript would solve this. I tested logging locally and the test methods log messages do not get shown. Likely because it is in the daemon that killed off stdout? That was the log target I defined.

d-maurer · 2024-03-13T11:11:07Z

Patrick Gerken wrote at 2024-3-13 04:02 -0700:

Are we stumbling over unknown PID from the cat command?

That might be possible. The 'cat` command is activated via `subprocess.call`. I assume that it contains a `wait` by itself. However, we might have a race between this `wait` and the `sigchild` handler. Potentially, sometimes the `sigchild` handler is activated before of the local `wait` is reached. A fix could be that we only handle the known pid in `sigchild`.

d-maurer · 2024-03-13T13:43:12Z

Patrick Gerken wrote at 2024-3-13 04:02 -0700:

Are we stumbling over unknown PID from the cat command?

You are right! The erroneous behavior can occur in the following situation: The "SIGCHLD" signal handler is activated for a "test-start-program" run before its `subprocess.call` has reached its `wait`. In this case, the signal handler records the status for an unknown pid which is later reported as a warning in `reportstatue`. When `subprocess.call` reaches its `wait`, the child status has already been fetched by the signal handler. The `waitpid` call (in `subprocess.Popen._try_wait`) will fail with a `ChildProcessError` which `try_wait` maps (erroneously) to a successful call. Therefore, the test ends even though `x` has not yet been created. I have verified this hypothesis by changing `subprocess.call` as follows: ``` def call(*popenargs, timeout=None, **kwargs): """Run command with arguments. Wait for command to complete or timeout, then return the returncode attribute. The arguments are the same as for the Popen constructor. Example: retcode = call(["ls", "-l"]) """ with Popen(*popenargs, **kwargs) as p: try: import time; time.sleep(0.01) return p.wait(timeout=timeout) except: # Including KeyboardInterrupt, wait handled that. p.kill() # We don't call p.wait() again as p.__exit__ does that for us. raise ``` i.e. I increased the likelyhood that the `SIGCHLD` signal handler is activated before the `wait` by introducing `import time; time.sleep(0.01)`. This makes the failure deterministic. My idea to let the signal handler only handle the pid it knows about did not work: in this case, the signal handler does not remove the SIGCHLD condition and is activated in an endless loop. My next idea is to let the SIGCHLD signal handler record all child exit statuses in a global structure, indexed by pid. `zdrun.Subprocess.test` no longer uses `subprocess.call` but instead `subprocess.Popen`. If its `wait` returns "success", the new structure is consulted whether the true status has been taken out by the SIGCHLD handler and then this one is used.

icemac added the question label Feb 1, 2024

icemac changed the title ~~PyPy tests break~~ PyPy3 tests break Feb 1, 2024

d-maurer mentioned this issue Mar 14, 2024

Fix start test program race condition #33

Merged

d-maurer closed this as completed in #33 Mar 14, 2024

d-maurer mentioned this issue Mar 15, 2024

Bug or feature?: daemon off mode handling by zdctl start #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPy3 tests break #32

PyPy3 tests break #32

icemac commented Feb 1, 2024 •

edited

icemac commented Mar 11, 2024

malthe commented Mar 11, 2024

d-maurer commented Mar 11, 2024 via email

do3cc commented Mar 11, 2024

jensens commented Mar 11, 2024

malthe commented Mar 11, 2024 •

edited

d-maurer commented Mar 12, 2024 via email

do3cc commented Mar 12, 2024

d-maurer commented Mar 12, 2024 via email

d-maurer commented Mar 12, 2024 via email

d-maurer commented Mar 12, 2024 via email

do3cc commented Mar 12, 2024

d-maurer commented Mar 13, 2024 via email

do3cc commented Mar 13, 2024

d-maurer commented Mar 13, 2024 via email

d-maurer commented Mar 13, 2024 via email

do3cc commented Mar 13, 2024

d-maurer commented Mar 13, 2024 via email

d-maurer commented Mar 13, 2024 via email

PyPy3 tests break #32

PyPy3 tests break #32

Comments

icemac commented Feb 1, 2024 • edited

PROBLEM REPORT

What I did:

What I expect to happen:

What actually happened:

What version of Python and Zope/Addons I am using:

icemac commented Mar 11, 2024

malthe commented Mar 11, 2024

d-maurer commented Mar 11, 2024 via email

do3cc commented Mar 11, 2024

jensens commented Mar 11, 2024

malthe commented Mar 11, 2024 • edited

d-maurer commented Mar 12, 2024 via email

do3cc commented Mar 12, 2024

d-maurer commented Mar 12, 2024 via email

d-maurer commented Mar 12, 2024 via email

d-maurer commented Mar 12, 2024 via email

do3cc commented Mar 12, 2024

d-maurer commented Mar 13, 2024 via email

do3cc commented Mar 13, 2024

d-maurer commented Mar 13, 2024 via email

d-maurer commented Mar 13, 2024 via email

do3cc commented Mar 13, 2024

d-maurer commented Mar 13, 2024 via email

d-maurer commented Mar 13, 2024 via email

icemac commented Feb 1, 2024 •

edited

malthe commented Mar 11, 2024 •

edited