New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEST-10-ISSUE-2467 constantly fails on Bionic #19154
Comments
On my local machine (Fedora 33 x86_64, socat-1.7.4.1-1.fc33.x86_64), the test succeeds with the following log.
|
No |
Right. |
Ubuntu CI uses socat-1.7.3.2-2ubuntu2, it seems released almost three years ago. See http://changelogs.ubuntu.com/changelogs/pool/main/s/socat/socat_1.7.3.2-2ubuntu2/changelog. |
Hmpf, this is strange, it doesn't happen every time, but it happens quite often.
|
have you ever seen it on centos? |
Nope, never seen it before. I'll try to investigate it further though. |
I don't see it on Debian either |
The only suspicious thing I see is the
|
I have The only workaround which comes to mind would be to use the diff --git a/test/units/testsuite-10.service b/test/units/testsuite-10.service
index 24f0da3..72f19f1 100644
--- a/test/units/testsuite-10.service
+++ b/test/units/testsuite-10.service
@@ -4,4 +4,4 @@ Description=TEST-10-ISSUE-2467
[Service]
ExecStartPre=rm -f /failed /testok
Type=oneshot
-ExecStart=sh -e -x -c 'rm -f /tmp/nonexistent; systemctl start test10.socket; printf x >test.file; socat -t20 OPEN:test.file UNIX-CONNECT:/run/test.ctl; >/testok'
+ExecStart=sh -e -x -c 'rm -f /tmp/nonexistent; systemctl start test10.socket; printf x >test.file; socat -t20 OPEN:test.file UNIX-CONNECT:/run/test.ctl,cool-write; >/testok' This option makes socat treat |
It depends on whether it's one of our changes that introduced the regression or if it is environmental - if it's the latter, then it's probably ok to mask it, given the repro environment seems very narrow? |
I'll try to bisect it later today to see whether it's the former. |
So, after a painfully long bisecting session everything points to e355fb6. The related PR even mentions the TEST-10 fails: #19129 (comment) - however, I'm puzzled as well why this particular commit causes the issue. I did a bisect between tags v247 and v248 (since I couldn't reproduce the issue on v247) and on every bisected commit I ran TEST-10 ~100 times (the issue is usually reproducible in the first ~60 runs). After reverting e355fb6 I can't reproduce the issue anymore. bisect log
|
Thanks for that - yes, I have no idea how log_trace could affect a socket unit and socat. Timing issue? Is LOG_TRACE enabled in these builds? @keszybz any clue? |
Looks like a timing issue. Just to be sure, I tested the latest master with |
Oh boy... Shall we set -Dlog-trace=true in the CI then, so that at least we get it back to green? |
I'd give it a go to see if it indeed resolves the issue (I might've been just extremely lucky). |
Ok, pushed: https://salsa.debian.org/systemd-team/systemd/-/commit/b3938f0a1d3ca4d0fc19b941655c0e9a905b9c82 We should see soon if it makes a difference in the next few PRs |
Hah, I was just browsing through CentOS CI journals (looking for a completely different bug) and noticed that this issue is present on Arch Linux as well, but it's far less frequent (and mitigated thanks to the task retry mechanism):
At least one mystery solved :-) |
Ah that's good to know - CIs are back to green now, so our suspicions were correct. Still no idea about the root cause though... |
That'd be my guess too.
So the test is broken, right? |
I am unsure of what's broken here - the test is quite simple and it seems to me it's not doing anything weird, just running socat. I do not know why the pipe gets broken without tracing. |
Yeah, I don't grok this.
This last write triggers the "Incoming data" thing and pid1 starts the service. In the failing case, the write fails with EPIPE. I don't see any race here, because it's the first write that fails. @poettering, any ideas? |
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail. Enable it while we figure it out. Also probably it is helpful to have it always turned on for the CI, to get more info out of test runs. See: systemd#19154
as per @mbiebl it looks like (without tracing) this is still an issue - but I still haven't got the faintest clue why socat fails like that |
I ran with A "good" run
A "bad" run:
The main difference:
vs
Which basically matches the current findings. On this particular system, the failure rate is I'd say about 20%. |
What I don't quite understand what this test is supposed to check. Shouldn't the test rather be something like this:
I honestly don't understand the test. |
I'm not sure, to check that the socket unit actually did its job? Want to send a PR to implement the change you suggested? |
@evverx ^ could you chime in here. It's quite possible that I'm missing some finer details here and why the return code of socat is important and why the test was written the way it is now. |
What does "did its job" mean in this particular case? |
Fwiw, if you want to make the test fail (100%) reliably, you can replace |
Since the default |
As far as I can remember, both |
My guess would be that |
@evverx
|
I don't think they were. They just kind of made the test work at the time by probably relying on some implementation details that shouldn't be relied on apparently.
Yes, I would but I think that |
this seems the most likely explanation, since the test intentionally is triggering test10.service failure, which will then cause test10.socket failure (and stopping), it's just a race condition; when socat performs the connect() to the unit socket, if systemd does all test10.service restart attempts and then stops test10.socket before socat reaches write(), then socat will get EPIPE from systemd, since it's already closed its end of the socket. That would seem to match 'passing' case when log-trace is enabled, which presumably would slow down systemd so the socket unit didn't close before socat reached write().
yes, since we know that the socket will be closing at some point after the connect(), i don't think we should care about anything that happens to socat/nc after that first connect(). In fact even if the connect() fails we shouldn't care about that.
I'm always somewhat concerned about arbitrary delays in the test code, since testbed performance can vary so much, but this test really should move very quickly so a few seconds sleep should be enough.
yep LGTM |
Nod. Fwiw in this test run it took about 30ms, until the trigger-limit was hit. We could start with something like 10s to be extra safe and increase it should we still run into issues. |
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE from systemd. The latter will cause it to exit(1) and subsequently the test to fail. We are not actually interested in the return code of socat though. The test is supposed to check, whether rate limiting of a socket unit works properly. So ignore any failures from the socat invocation and instead check, if test10.socket is in state "failed" with result "trigger-limit-hit" after it has been triggered. TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should give systemd enough time even on slower machines, to reach the trigger limit. For better readability, break the test into separate ExecStart lines. Fixes systemd#19154.
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE from systemd. The latter will cause it to exit(1) and subsequently the test to fail. We are not actually interested in the return code of socat though. The test is supposed to check, whether rate limiting of a socket unit works properly. So ignore any failures from the socat invocation and instead check, if test10.socket is in state "failed" with result "trigger-limit-hit" after it has been triggered. TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should give systemd enough time even on slower machines, to reach the trigger limit. For better readability, break the test into separate ExecStart lines. Fixes systemd#19154.
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE from systemd. The latter will cause it to exit(1) and subsequently the test to fail. We are not actually interested in the return code of socat though. The test is supposed to check, whether rate limiting of a socket unit works properly. So ignore any failures from the socat invocation and instead check, if test10.socket is in state "failed" with result "trigger-limit-hit" after it has been triggered. TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should give systemd enough time even on slower machines, to reach the trigger limit. For better readability, break the test into separate ExecStart lines. Fixes systemd#19154.
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE from systemd. The latter will cause it to exit(1) and subsequently the test to fail. We are not actually interested in the return code of socat though. The test is supposed to check, whether rate limiting of a socket unit works properly. So ignore any failures from the socat invocation and instead check, if test10.socket is in state "failed" with result "trigger-limit-hit" after it has been triggered. TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should give systemd enough time even on slower machines, to reach the trigger limit. For better readability, break the test into separate ExecStart lines. Fixes #19154.
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail. Enable it while we figure it out. Also probably it is helpful to have it always turned on for the CI, to get more info out of test runs. See: systemd#19154
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail. Enable it while we figure it out. Also probably it is helpful to have it always turned on for the CI, to get more info out of test runs. See: systemd#19154
Writing a byte to test10.socket is actually the root cause of issue systemd#19154: depending on the timing, it's possible that PID1 closes the socket before socat (or nc, it doesn't matter which tool is actually used) tries to write that one byte to the socket. In this case writing to the socket returns EPIPE, which causes socat to exit(1) and subsequently make the test fail. Since we're only interested in connecting to the socket and triggering the rate limit of the socket, this patch removes the parts that write the single byte to the socket, which should remove the race for good. Since it shouldn't matter whether the test uses socat or nc, let's switch back to nc and hence remove the sole user of socat.
Writing a byte to test10.socket is actually the root cause of issue systemd#19154: depending on the timing, it's possible that PID1 closes the socket before socat (or nc, it doesn't matter which tool is actually used) tries to write that one byte to the socket. In this case writing to the socket returns EPIPE, which causes socat to exit(1) and subsequently make the test fail. Since we're only interested in connecting to the socket and triggering the rate limit of the socket, this patch removes the parts that write the single byte to the socket, which should remove the race for good. Since it shouldn't matter whether the test uses socat or nc, let's switch back to nc and hence remove the sole user of socat. The exit status of nc is however ignored because some versions might choke when the socket is closed unexpectedly.
Writing a byte to test10.socket is actually the root cause of issue #19154: depending on the timing, it's possible that PID1 closes the socket before socat (or nc, it doesn't matter which tool is actually used) tries to write that one byte to the socket. In this case writing to the socket returns EPIPE, which causes socat to exit(1) and subsequently make the test fail. Since we're only interested in connecting to the socket and triggering the rate limit of the socket, this patch removes the parts that write the single byte to the socket, which should remove the race for good. Since it shouldn't matter whether the test uses socat or nc, let's switch back to nc and hence remove the sole user of socat. The exit status of nc is however ignored because some versions might choke when the socket is closed unexpectedly.
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE from systemd. The latter will cause it to exit(1) and subsequently the test to fail. We are not actually interested in the return code of socat though. The test is supposed to check, whether rate limiting of a socket unit works properly. So ignore any failures from the socat invocation and instead check, if test10.socket is in state "failed" with result "trigger-limit-hit" after it has been triggered. TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should give systemd enough time even on slower machines, to reach the trigger limit. For better readability, break the test into separate ExecStart lines. Fixes systemd#19154. (cherry picked from commit d84f316)
Writing a byte to test10.socket is actually the root cause of issue systemd#19154: depending on the timing, it's possible that PID1 closes the socket before socat (or nc, it doesn't matter which tool is actually used) tries to write that one byte to the socket. In this case writing to the socket returns EPIPE, which causes socat to exit(1) and subsequently make the test fail. Since we're only interested in connecting to the socket and triggering the rate limit of the socket, this patch removes the parts that write the single byte to the socket, which should remove the race for good. Since it shouldn't matter whether the test uses socat or nc, let's switch back to nc and hence remove the sole user of socat. The exit status of nc is however ignored because some versions might choke when the socket is closed unexpectedly. (cherry picked from commit 051ea71)
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail. Enable it while we figure it out. Also probably it is helpful to have it always turned on for the CI, to get more info out of test runs. See: systemd/systemd#19154
TEST-10-ISSUE-2467 has been failing for the past week, and as far as I can see it first started in this PR (I might be wrong):
#19116
But I fail to see the correlations.
The journal says:
The test is:
https://github.com/systemd/systemd/blob/main/test/units/testsuite-10.service
https://github.com/systemd/systemd/tree/main/test/testsuite-10.units
As far as I can see, everything is working as intended: the test ensures that with a false condition, socket activation doesn't loop forever.
I think what's new is that
socat
is failing and returning an error, while before it was not?The text was updated successfully, but these errors were encountered: