-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyPy3 tests break #32
Comments
@zopefoundation/developers – Sorry for the broad notification: I need help maintaining Is someone actually using Otherwise I'll have to drop PyPy support to get the tests green again. |
Seems weird that this should break based on those dependency changes. |
Michael Howitz wrote at 2024-3-11 01:43 -0700:
@zopefoundation/developers – Sorry for the broad notification: I need help maintaining `zdaemon`.
Is someone actually using `zdaemon` on PyPy?
Would someone volunteer to look into the PyPy test failure? (It did not go away by waiting for some weeks.)
Otherwise I'll have to drop PyPy support to get the tests green again.
I assume this is a race condition.
I suggest to introduce a `sleep(...)` to give the started process
more time to do its job before we check that it has done so.
|
This is not a race condition. |
Looking at the test-code it does not make much sense.
I am not using nor have experience with PyPI, so I can not tell why. I would try to zdaemon/src/zdaemon/tests/tests.py Lines 273 to 296 in adb0dc9
|
The reason seems to be that PyPy either starts too slow, or doesn't start properly at all – that would explain why the file I think the test needs to check that the process has not exited early, or otherwise signaled an error. |
Jens W. Klein wrote at 2024-3-11 04:34 -0700:
Looking at the test-code it does not make much sense.
- The generated `t.py` in the test does an initial wait of 1sec - whats the reason?.
- The `system` call OTOH return immediately after the `zdaemon` is executed.
- Given now, for some reason, the system-call returns faster than the 1sec wait, the x-file does not exist.
I am not using nor have experience with PyPI, so I can not tell why.
I would try to `sleep(2)` in the test after the system call.
It is definitely a race condition:
I have executed 4 runs:
all tests, test_start_test_program, test_start_test_program, all tests
Only the first run showed the observed
```
File ".../zdaemon/src/zdaemon/tests/tests.py", line 295, in zdaemon.tests.tests.test_start_test_program
Failed example:
os.path.exists('x')
Expected:
True
Got:
False
```
All following runs succeeded without any errors or failures.
I thinks Jens' suggestion (adding a `sleep`) is good and should
reduce the race risk.
As an alternative, we might introduce a function
`wait_until(predicate, timeout=10)` which waits at most
*timeout* seconds for *predicate* to get true,
implemented via a `sleep` in a loop checking the predicate.
Something like this is used in `ZEO` to give test a high
race condition resilience but avoiding long test runs in the normal case.
I could come up with a corresponding PR.
|
I shouldn't answer any more as I can't offer a solution, but I can't stop. The test case wants to test the option zdaemon/src/zdaemon/README.txt Lines 296 to 298 in f299b77
So this zdaemon config should make the zdaemon/src/zdaemon/tests/tests.py Lines 189 to 192 in f299b77
https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/tests/tests.py#L285 There is already a sleep in the code. It runs This is also the reason for the zdaemon/src/zdaemon/README.txt Lines 296 to 298 in f299b77
|
Patrick Gerken wrote at 2024-3-12 03:31 -0700:
I shouldn't answer any more as I can't offer a solution, but I can't stop.
The test case wants to test the option `start-test-program`
which:
https://github.com/zopefoundation/zdaemon/blob/f299b77eaa0cbe7512657188bef8ae14220dae2b/src/zdaemon/README.txt#L296-L298
So this zdaemon config should make the `daemon start` command not return while the file `x` does not exist.
https://github.com/zopefoundation/zdaemon/blob/f299b77eaa0cbe7512657188bef8ae14220dae2b/src/zdaemon/tests/tests.py#L189-L192
https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/tests/tests.py#L285
There is already a sleep in the code. It runs `cat x`, if this fails, it sleeps 1 second and runs it again. Only if `cat x` does not fail any more, will it return from the 'zdaemon start` command.
This is also the reason for the `sleep(1)` in the daemon code before creating the file, this is on purpose so that it is likely the first `cat x` fails, and the loop actually gets tested.
https://github.com/zopefoundation/zdaemon/blob/f299b77eaa0cbe7512657188bef8ae14220dae2b/src/zdaemon/README.txt#L296-L298
Occasional failures (as I have observed them) is a clear indication
for some kind of race condition.
The race might be elsewhere and the proposed `sleep` might not resolve the
problem.
But I assume, the "1 s" sleep already present is simply not enough:
under special conditions, things take more than 1 s.
Unfortunately, only occasional (in contrast to deterministic)
failures are difficult to analyze.
I suggest, you play with the already present `sleep`.
If my assumption is correct, reducing the sleep time should
increase the failure probability, increasing it should reduce it.
It has been this kind of approach @navytux used to track down
some `ZEO' race conditions. It may help with `zdaemon` as well.
|
***@***.*** wrote at 2024-3-12 12:22 +0100:
Patrick Gerken wrote at 2024-3-12 03:31 -0700:
> ...
...
Occasional failures (as I have observed them) is a clear indication
for some kind of race condition.
The race might be elsewhere and the proposed `sleep` might not resolve the
problem.
But I assume, the "1 s" sleep already present is simply not enough:
under special conditions, things take more than 1 s.
I have now looked in detail at the references you have provided
and I begin to understand your point.
You expect that the `system` call only returns after "cat x" has
been successful.
However, we see that this is ocasionally not the case.
One explanation could be that `zdaemon` does not run in the foreground
but in the background. Then the `system` call could return as soon as
the daemon has been spawned. The documentation tells us that
daemon (aka background) mode is the default.
The configuration file creation you reference does not set
"daemon off".
|
***@***.*** wrote at 2024-3-12 15:16 +0100:
***@***.*** wrote at 2024-3-12 12:22 +0100:
> Patrick Gerken wrote at 2024-3-12 03:31 -0700:
>> ...
> ...
>Occasional failures (as I have observed them) is a clear indication
>for some kind of race condition.
>The race might be elsewhere and the proposed `sleep` might not resolve the
>problem.
>But I assume, the "1 s" sleep already present is simply not enough:
>under special conditions, things take more than 1 s.
I have now looked in detail at the references you have provided
and I begin to understand your point.
You expect that the `system` call only returns after "cat x" has
been successful.
However, we see that this is ocasionally not the case.
One explanation could be that `zdaemon` does not run in the foreground
but in the background. Then the `system` call could return as soon as
the daemon has been spawned. The documentation tells us that
daemon (aka background) mode is the default.
The configuration file creation you reference does not set
"daemon off".
In a local copy, I have added "daemon off" to the created configuration
and the failure is now deterministic:
an exception is raised after 60 s (likely the default "start-timeout").
I can see that the file `x` has been created but apparently
the "start-test-program" check nevertheless was not successful.
I will investigate.
|
I have also tried to reproduce it. In my case, there is no failure however. I injected various faults into the code, but nothing created a similar problem. Also because the daemon restart part of the test does not fail any more. There the |
***@***.*** wrote at 2024-3-12 15:50 +0100:
...
In a local copy, I have added "daemon off" to the created configuration
and the failure is now deterministic:
an exception is raised after 60 s (likely the default "start-timeout").
I can see that the file `x` has been created but apparently
the "start-test-program" check nevertheless was not successful.
I will investigate.
Explanation:
`test-start-program` did not work for non deamon mode
due to a bug in `zdaemon.zdctl.ZDCmd.do_start`.
`do_start` essentially had this structure:
```
get_status() # determine the `zdrun` state
if `zdrun` does not run:
flag = P_NOWAIT id daemon_mode else P_WAIT
spawn `zdrun` with flag
else:
send_action("start")
if daemon_mode:
await_start_condition()
```
This obviously is wrong because in non deamon_mode
we wait for `zdrun` to derminate which will happen only
after it has been explicitly shut down or it gave up to start the program.
I changed this locally to:
```
get_status() # determine the `zdrun` state
if `zdrun` does not run:
spawn `zdrun` with flag P_NOWAIT
else:
send_action("start")
await_start_condition()
```
i.e. we no longer wait for `zdrun` to terminate
and always check the start condition.
It is possible that we want to spawn `zdrun` with flag P_WAIT in
daemon_mode and with flag P_NOWAIT in non daemon_mode,
because in daemon_mode, `zdrun` respawns itself as a true daemon
and then exits.
In non daemon_mode, the files are not redirected to "/dev/null"
and therefore I can observe the `cat` output.
I see:
```
. cat: x: Datei oder Verzeichnis nicht gefunden
. cat: x: Datei oder Verzeichnis nicht gefunden
.
daemon process started, pid=4877
```
The "test-start-program" logic has worked correctly for
the observation above.
Thus, we still do not have an explanation for the occasional test
failures.
The most likely assupmtion is still a race condition -- but not
one we should work around in the test.
The "test-program-logic" works as follows:
`zdrun` maintains a variable `testing`, telling whether the
"start-program" test has not yet been successful.
`zdctl` sends a `get_status` request to `zdrun` to learn
about its state, among them the program's pid and the value of `testing`.
`zdctl` assumes the start condition is satisfied if it gets
a non zero pid and `testing` is false.
Our occasional test failures show that this logic can break
sometimes.
Unfortunately, I do not yet see how this is possible.
I will try to use the `transcript` option (`-t`) to
get information about the activity of a daemonize `zdrun`.
|
@d-maurer do you see a sensible way to transmit exception information from the test code to reach the logs? |
***@***.*** wrote at 2024-3-13 10:37 +0100:
...
Our occasional test failures show that this logic can break
sometimes.
Unfortunately, I do not yet see how this is possible.
I will try to use the `transcript` option (`-t`) to
get information about the activity of a daemonize `zdrun`.
The transcript for a failing test run contains:
```
cat: x: Datei oder Verzeichnis nicht gefunden
unknown pid 5654: exit status 1
cat: x: Datei oder Verzeichnis nicht gefunden
cat: x: Datei oder Verzeichnis nicht gefunden
```
Almost surely, the problem is related to the "unknown pid" message.
It is created in the `zdrun.Daemonizer.reportstatus` method
when a child with an unknown pid (different from `self.proc.pid`)
has terminated.
I will add more debug output in the hope to find out what goes on
in this case.
|
Patrick Gerken wrote at 2024-3-13 02:51 -0700:
@d-maurer do you see a sensible way to transmit exception information from the test code to reach the logs?
I do not know which exception information you speak of.
Exceptions and log records can come from
the test code, `zdctl` and `zdrun`.
Exception information from the test code is typically printed.
Likely, the same happens for exception information from `zdctl`.
Exception information from `zdrun` (the daemon manager) is
lost unless it has written a log entry.
`zope.testrunner` supports logging.
The configuration can come from the file "log.ini" in the current
working directory or from a file named via the envvor
"ZOPE_TESTRUNNER_LOG_INI".
Test code can then use normal logging; processing is
controlled by the configuration above.
The logging configuration above only affects test code;
logging for `zdrun` is controlled by an `eventlog` section
in its configuration file.
|
Are we stumbling over unknown PID from the cat command? |
Patrick Gerken wrote at 2024-3-13 04:02 -0700:
Are we stumbling over unknown PID from the cat command?
That might be possible.
The 'cat` command is activated via `subprocess.call`.
I assume that it contains a `wait` by itself.
However, we might have a race between this `wait`
and the `sigchild` handler. Potentially, sometimes
the `sigchild` handler is activated before of the local `wait` is reached.
A fix could be that we only handle the known pid in `sigchild`.
|
Patrick Gerken wrote at 2024-3-13 04:02 -0700:
Are we stumbling over unknown PID from the cat command?
You are right!
The erroneous behavior can occur in the following situation:
The "SIGCHLD" signal handler is activated for a
"test-start-program" run before its `subprocess.call` has
reached its `wait`.
In this case, the signal handler records the status for
an unknown pid which is later reported as a warning in `reportstatue`.
When `subprocess.call` reaches its `wait`, the child status has
already been fetched by the signal handler.
The `waitpid` call (in `subprocess.Popen._try_wait`) will fail
with a `ChildProcessError` which `try_wait` maps (erroneously) to a successful
call.
Therefore, the test ends even though `x` has not yet been created.
I have verified this hypothesis by changing `subprocess.call`
as follows:
```
def call(*popenargs, timeout=None, **kwargs):
"""Run command with arguments. Wait for command to complete or
timeout, then return the returncode attribute.
The arguments are the same as for the Popen constructor. Example:
retcode = call(["ls", "-l"])
"""
with Popen(*popenargs, **kwargs) as p:
try:
import time; time.sleep(0.01)
return p.wait(timeout=timeout)
except: # Including KeyboardInterrupt, wait handled that.
p.kill()
# We don't call p.wait() again as p.__exit__ does that for us.
raise
```
i.e. I increased the likelyhood that the `SIGCHLD` signal handler
is activated before the `wait` by introducing
`import time; time.sleep(0.01)`.
This makes the failure deterministic.
My idea to let the signal handler only handle the pid it knows about
did not work:
in this case, the signal handler does not remove the SIGCHLD condition
and is activated in an endless loop.
My next idea is to let the SIGCHLD signal handler record all
child exit statuses in a global structure, indexed by pid.
`zdrun.Subprocess.test` no longer uses `subprocess.call`
but instead `subprocess.Popen`. If its `wait` returns "success",
the new structure is consulted whether the true status has been
taken out by the SIGCHLD handler and then this one is used.
|
PROBLEM REPORT
What I did:
Run the tests for PyPy3 on GHA.
What I expect to happen:
green tests
What actually happened:
one test fails, see https://github.com/zopefoundation/zdaemon/actions/runs/7738159321/job/21098442355
What version of Python and Zope/Addons I am using:
Any ideas what is happening here?
The text was updated successfully, but these errors were encountered: