Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST-10-ISSUE-2467 constantly fails on Bionic #19154

Closed
bluca opened this issue Mar 30, 2021 · 41 comments · Fixed by #20407
Closed

TEST-10-ISSUE-2467 constantly fails on Bionic #19154

bluca opened this issue Mar 30, 2021 · 41 comments · Fixed by #20407
Labels

Comments

@bluca
Copy link
Member

bluca commented Mar 30, 2021

TEST-10-ISSUE-2467 has been failing for the past week, and as far as I can see it first started in this PR (I might be wrong):

#19116

But I fail to see the correlations.

The journal says:

Mar 26 22:13:19 systemd-testsuite sh[29]: + socat -t20 OPEN:test.file UNIX-CONNECT:/run/test.ctl
Mar 26 22:13:19 systemd-testsuite systemd[1]: Bus private-bus-connection: changing state RUNNING → CLOSING
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Failed to send unit change signal for test10.socket: Connection reset by peer
Mar 26 22:13:19 systemd-testsuite systemd[1]: Bus private-bus-connection: changing state CLOSING → CLOSED
Mar 26 22:13:19 systemd-testsuite systemd[1]: Got disconnect on private connection.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 122
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 122
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 122 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 170
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 170
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 170 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 218
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 218
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 218 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 266
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 266
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 266 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 314
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 314
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 314 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 362
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 362
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 362 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 410
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 410
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 410 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 458
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 458
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 458 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 506
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 506
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 506 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 554
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 554
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 554 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 602
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 602
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 602 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 650
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 650
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 650 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 698
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 698
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 698 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 746
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 746
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 746 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 794
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 794
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 794 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 842
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 842
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 842 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite sh[33]: 2021/03/26 22:13:19 socat[33] E write(6, 0x561db83cd3e0, 1): Broken pipe
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 890
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 890
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 890 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 938
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 938
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 938 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 986
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 986
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 986 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Trying to enqueue job test10.service/start/replace
Mar 26 22:13:19 systemd-testsuite systemd[1]: systemd-hwdb-update.service: Cannot add dependency job, ignoring: Unit systemd-hwdb-update.service is masked.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Installed new job test10.service/start as 1034
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Enqueued job test10.service/start as 1034
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> running
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: ConditionPathExistsGlob=/tmp/nonexistent failed.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Starting requested but condition failed. Not starting unit.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.service: Job 1034 test10.service/start finished, result=done
Mar 26 22:13:19 systemd-testsuite systemd[1]: Condition check resulted in test10.service being skipped.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed running -> listening
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Trigger limit hit, refusing further activation.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Failed with result 'trigger-limit-hit'.
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Changed listening -> failed
Mar 26 22:13:19 systemd-testsuite systemd[1]: test10.socket: Unit entered failed state.
Mar 26 22:13:19 systemd-testsuite systemd[1]: Received SIGCHLD from PID 29 (sh).
Mar 26 22:13:19 systemd-testsuite systemd[1]: Child 29 (sh) died (code=exited, status=1/FAILURE)
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Child 29 belongs to testsuite-10.service.
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Main process exited, code=exited, status=1/FAILURE
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Failed with result 'exit-code'.
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Service will not restart (restart setting)
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Changed start -> failed
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Job 74 testsuite-10.service/start finished, result=failed
Mar 26 22:13:19 systemd-testsuite systemd[1]: Failed to start TEST-10-ISSUE-2467.
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Unit entered failed state.
Mar 26 22:13:19 systemd-testsuite systemd[1]: testsuite-10.service: Control group is empty.

The test is:

https://github.com/systemd/systemd/blob/main/test/units/testsuite-10.service
https://github.com/systemd/systemd/tree/main/test/testsuite-10.units

As far as I can see, everything is working as intended: the test ensures that with a false condition, socket activation doesn't loop forever.

I think what's new is that socat is failing and returning an error, while before it was not?

socat[33] E write(6, 0x561db83cd3e0, 1): Broken pipe
@yuwata
Copy link
Member

yuwata commented Mar 30, 2021

On my local machine (Fedora 33 x86_64, socat-1.7.4.1-1.fc33.x86_64), the test succeeds with the following log.

Mar 30 20:36:28 systemd-testsuite systemd[1]: test10.socket: Incoming traffic
Mar 30 20:36:28 systemd-testsuite systemd[1]: test10.socket: Trigger limit hit, refusing further activation.
Mar 30 20:36:28 systemd-testsuite systemd[1]: test10.socket: Failed with result 'trigger-limit-hit'.
Mar 30 20:36:28 systemd-testsuite systemd[1]: test10.socket: Changed listening -> failed
Mar 30 20:36:28 systemd-testsuite systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=187 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
Mar 30 20:36:28 systemd-testsuite systemd[1]: test10.socket: Unit entered failed state.
Mar 30 20:36:28 systemd-testsuite systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/test10_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=188 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
Mar 30 20:36:28 systemd-testsuite systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/test10_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=189 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
Mar 30 20:36:28 systemd-testsuite systemd[1]: systemd-journald.service: Received EPOLLHUP on stored fd 53 (stored), closing.
Mar 30 20:36:28 systemd-testsuite systemd[1]: Received SIGCHLD from PID 458 (sh).
Mar 30 20:36:28 systemd-testsuite systemd[1]: Child 458 (sh) died (code=exited, status=0/SUCCESS)
Mar 30 20:36:28 systemd-testsuite systemd[1]: testsuite-10.service: Child 458 belongs to testsuite-10.service.
Mar 30 20:36:28 systemd-testsuite systemd[1]: testsuite-10.service: Main process exited, code=exited, status=0/SUCCESS
Mar 30 20:36:28 systemd-testsuite systemd[1]: testsuite-10.service: Deactivated successfully.
Mar 30 20:36:28 systemd-testsuite systemd[1]: testsuite-10.service: Service will not restart (restart setting)
Mar 30 20:36:28 systemd-testsuite systemd[1]: testsuite-10.service: Changed start -> dead
Mar 30 20:36:28 systemd-testsuite systemd[1]: testsuite-10.service: Job 189 testsuite-10.service/start finished, result=done

@bluca
Copy link
Member Author

bluca commented Mar 30, 2021

No Broken pipe message I assume?

@yuwata
Copy link
Member

yuwata commented Mar 30, 2021

No Broken pipe message I assume?

Right.

@yuwata
Copy link
Member

yuwata commented Mar 30, 2021

Ubuntu CI uses socat-1.7.3.2-2ubuntu2, it seems released almost three years ago. See http://changelogs.ubuntu.com/changelogs/pool/main/s/socat/socat_1.7.3.2-2ubuntu2/changelog.

@mrc0mmand
Copy link
Member

Hmpf, this is strange, it doesn't happen every time, but it happens quite often. socat -dddd wasn't helpful:

[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D malloc() -> 0x559df20e39f0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D stat64(test.file, 0x7fff4c321eb0)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D stat64(, {66304,2012,0100644,1,0,0,0,1,4096,8,...}) -> 0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] N opening regular file "test.file" for reading and writing
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D open("test.file", 02, 0666)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] I open("test.file", 02, 0666) -> 5
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D isatty(5)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D isatty() -> 0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D fcntl(5, 2, 1)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D fcntl() -> 0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D calloc(1, 848)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D calloc() -> 0x559df20e3e00
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D malloc(1024)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D malloc() -> 0x559df20e44f0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D malloc(128)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D malloc() -> 0x559df20e41a0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] N opening connection to AF=1 "/run/test.ctl"
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D socket(1, 1, 0)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] I socket(1, 1, 0) -> 6
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D fcntl(6, 2, 1)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D fcntl() -> 0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D connect(6, {1,AF=1 "/run/test.ctl"}, 15)
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D connect() -> 0
[ 4992.318560] sh[31]: 2021/04/04 19:15:43 socat[31] D getsockname(6, 0x7fff4c321ae0, 0x7fff4c321ac4{112})
[ 4992.327850] systemd[1]: Bus private-bus-connection: changing state AUTHENTICATING → RUNNING
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D getsockname(, {AF=1 "<anon>"}, {2}) -> 0
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] N successfully connected from local address AF=1 "\0\0\0\0\0\0\0\0\0\0\0\0\0"
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] I resolved and opened all sock addresses
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D malloc(16385)
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D malloc() -> 0x559df20e4900
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] N starting data transfer loop with FDs [5,5] and [6,6]
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D select(7, &0x60, &0x60, &0x0, NULL/0.000000)
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D select -> (, 0x60, 0x60, 0x0, NULL/0.000000), 4
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D read(5, 0x559df20e4900, 8192)
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D read -> 1
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D write(6, 0x559df20e4900, 1)
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D write -> -1
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] E write(6, 0x559df20e4900, 1): Broken pipe
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] N exit(1)
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D starting xioexit()
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] I shutdown(6, 2)
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D shutdown()  -> 0
[ 4992.327954] sh[31]: 2021/04/04 19:15:43 socat[31] D finished xioexit()

@bluca
Copy link
Member Author

bluca commented Apr 4, 2021

have you ever seen it on centos?

@mrc0mmand
Copy link
Member

have you ever seen it on centos?

Nope, never seen it before. I'll try to investigate it further though.

@bluca
Copy link
Member Author

bluca commented Apr 4, 2021

I don't see it on Debian either

@mrc0mmand
Copy link
Member

The only suspicious thing I see is the SO_BSDCOMPAT kernel warning, but after reading the description in the socket(7) man page it doesn't seem to be relevant to the issue:

[    4.143094] systemd[1]: Condition check resulted in test10.service being skipped.
[    4.143188] sh[354]: 2021/04/04 19:46:53.094487 systemd-testsuite socat[354] D connect() -> 0
[    4.143188] sh[354]: 2021/04/04 19:46:53.094508 systemd-testsuite socat[354] D getsockname(6, 0x7ffdcf9b6170, 0x7ffdcf9b6154{112})
[    4.143188] sh[354]: 2021/04/04 19:46:53.094517 systemd-testsuite socat[354] D getsockname(, {AF=1 "<anon>"}, {2}) -> 0
[    4.143188] sh[354]: 2021/04/04 19:46:53.094521 systemd-testsuite socat[354] N successfully connected from local address AF=1 "\0\0\0\0\0\0\0\0\0\0\0\0\0"
[    4.143188] sh[354]: 2021/04/04 19:46:53.094527 systemd-testsuite socat[354] I resolved and opened all sock addresses
[    4.143188] sh[354]:   FD  type        device        inode        mode        links        uid        gid        rdev        size        blksize        blocks        atime                                mt>
[    4.143605] systemd[1]: test10.socket: Changed running -> listening
[    4.143691] sh[354]:    5: file        8,1        2012        0100644        1        0        0        0,0        1        4096        8        Sun Apr  4 19:46:48 2021        Sun Apr  4 19:46:53 2021    >
[    4.143691] sh[354]:    6: socket        0,8        20555        0140777        1        0        0        0,0        0        4096        0        Thu Jan  1 00:00:00 1970        Thu Jan  1 00:00:00 1970 >
[    4.143691] sh[354]: 2021/04/04 19:46:53.094942 systemd-testsuite socat[354] D malloc(16385)
[    4.143691] sh[354]: 2021/04/04 19:46:53.094967 systemd-testsuite socat[354] D malloc() -> 0x556b5844c920
[    4.143691] sh[354]: 2021/04/04 19:46:53.094972 systemd-testsuite socat[354] N starting data transfer loop with FDs [5,5] and [6,6]
[    4.143691] sh[354]: 2021/04/04 19:46:53.094976 systemd-testsuite socat[354] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[    4.143691] sh[354]: 2021/04/04 19:46:53.094986 systemd-testsuite socat[354] D select(7, &0x60, &0x60, &0x0, NULL/0.000000)
[    4.143691] sh[354]: 2021/04/04 19:46:53.094999 systemd-testsuite socat[354] D select -> (, 0x60, 0x60, 0x0, NULL/0.000000), 4
[    4.143691] sh[354]: 2021/04/04 19:46:53.095005 systemd-testsuite socat[354] D read(5, 0x556b5844c920, 8192)
[    4.143691] sh[354]: 2021/04/04 19:46:53.095019 systemd-testsuite socat[354] D read -> 1
[    4.143691] sh[354]: > 2021/04/04 19:46:53.095023  length=1 from=0 to=0
[    4.143691] sh[354]: x2021/04/04 19:46:53.095081 systemd-testsuite socat[354] D write(6, 0x556b5844c920, 1)
[    4.143691] sh[354]: 2021/04/04 19:46:53.095089 systemd-testsuite socat[354] D write -> -1
[    4.143691] sh[354]: 2021/04/04 19:46:53.095093 systemd-testsuite socat[354] E write(6, 0x556b5844c920, 1): Broken pipe
[    4.143691] sh[354]: 2021/04/04 19:46:53.095093 systemd-testsuite socat[354] N exit(1)
[    4.212447] kernel: sock: process `socat' is using obsolete getsockopt SO_BSDCOMPAT
[    4.148852] systemd[1]: test10.socket: Incoming traffic
[    4.149054] sh[354]: 2021/04/04 19:46:53.095135 systemd-testsuite socat[354] D starting xioexit()
[    4.149054] sh[354]: 2021/04/04 19:46:53.095142 systemd-testsuite socat[354] I shutdown(6, 2)
[    4.149054] sh[354]: 2021/04/04 19:46:53.095147 systemd-testsuite socat[354] D shutdown()  -> 0
[    4.149054] sh[354]: 2021/04/04 19:46:53.095150 systemd-testsuite socat[354] D finished xioexit()

@mrc0mmand
Copy link
Member

mrc0mmand commented Apr 4, 2021

I have socat 1.7.3.4 on both Fedora 32 and Ubuntu Focal and the only difference between those versions is that socat on Fedora is compiled with readline support and without libwrap support, whereas the Ubuntu's version is compiled without readline and with libwrap support. However, I can reproduce the issue only on Ubuntu (after couple hundreds of retries).

The only workaround which comes to mind would be to use the cool-write socat option, i.e.:

diff --git a/test/units/testsuite-10.service b/test/units/testsuite-10.service
index 24f0da3..72f19f1 100644
--- a/test/units/testsuite-10.service
+++ b/test/units/testsuite-10.service
@@ -4,4 +4,4 @@ Description=TEST-10-ISSUE-2467
 [Service]
 ExecStartPre=rm -f /failed /testok
 Type=oneshot
-ExecStart=sh -e -x -c 'rm -f /tmp/nonexistent; systemctl start test10.socket; printf x >test.file; socat -t20 OPEN:test.file UNIX-CONNECT:/run/test.ctl; >/testok'
+ExecStart=sh -e -x -c 'rm -f /tmp/nonexistent; systemctl start test10.socket; printf x >test.file; socat -t20 OPEN:test.file UNIX-CONNECT:/run/test.ctl,cool-write; >/testok'

This option makes socat treat EPIPE and ECONNRESET as a non-fatal error (and the test seems to work once again with it), but it's masking the original issue...

@bluca
Copy link
Member Author

bluca commented Apr 5, 2021

It depends on whether it's one of our changes that introduced the regression or if it is environmental - if it's the latter, then it's probably ok to mask it, given the repro environment seems very narrow?

@mrc0mmand
Copy link
Member

I'll try to bisect it later today to see whether it's the former.

@mrc0mmand
Copy link
Member

So, after a painfully long bisecting session everything points to e355fb6. The related PR even mentions the TEST-10 fails: #19129 (comment) - however, I'm puzzled as well why this particular commit causes the issue.

I did a bisect between tags v247 and v248 (since I couldn't reproduce the issue on v247) and on every bisected commit I ran TEST-10 ~100 times (the issue is usually reproducible in the first ~60 runs). After reverting e355fb6 I can't reproduce the issue anymore.

bisect log
# git bisect log
git bisect start
# bad: [938bdfc0fa737d86eb3ecc70506e11e5f740e0dc] Merge pull request #19157 from keszybz/read-medium-sized-virtual-file
git bisect bad 938bdfc0fa737d86eb3ecc70506e11e5f740e0dc
# good: [4d484e14bb9864cef1d124885e625f33bf31e91c] Merge pull request #17738 from keszybz/hwdb-news-update-v247
git bisect good 4d484e14bb9864cef1d124885e625f33bf31e91c
# good: [edf370af9e9fafad01393699e7a6f34bf0568dd6] Merge pull request #18432 from yuwata/libude-list-cleanups
git bisect good edf370af9e9fafad01393699e7a6f34bf0568dd6
# good: [b332778b30d23193c792d5f5c5dcccd61f4a489c] Merge pull request #18684 from yuwata/sd-device-cleanups-and-udevadm-fix
git bisect good b332778b30d23193c792d5f5c5dcccd61f4a489c
# good: [2541462f1b6a4a3985c3b2db181c2ab61a9efefb] Merge pull request #18890 from keszybz/fuzz-bus-match
git bisect good 2541462f1b6a4a3985c3b2db181c2ab61a9efefb
# good: [b1eea703e01da1e280e179fb119449436a0c9b8e] resolved: don't flush answer RRs on CNAME redirect too early
git bisect good b1eea703e01da1e280e179fb119449436a0c9b8e
# good: [dfb433f2a77a04d44aaa64d8364805e36cb25f5a] firewall-util: probe firewall backend in fw_ctx_new()
git bisect good dfb433f2a77a04d44aaa64d8364805e36cb25f5a
# good: [3b0754b16c5535fd48f48ca39f69335ef695efe7] config files: recommend systemd-analyze cat-config
git bisect good 3b0754b16c5535fd48f48ca39f69335ef695efe7
# bad: [063a43a1a7e29a6db97c5282435f716d06274fb4] Revert "fix wrong statement JOURNAL_FILE_FORMAT.md doc"
git bisect bad 063a43a1a7e29a6db97c5282435f716d06274fb4
# bad: [a81c7ac8d408a2618d488e708b40530bcdad6bd1] Merge pull request #19129 from keszybz/test-random-range
git bisect bad a81c7ac8d408a2618d488e708b40530bcdad6bd1
# good: [008adf875e356b7c3258b5b0aff55a8f9480e340] Merge pull request #19125 from keszybz/cat-config
git bisect good 008adf875e356b7c3258b5b0aff55a8f9480e340
# bad: [93457c0c7a03f28539f248220659cf6c08bd20f1] test-random-util: modernization
git bisect bad 93457c0c7a03f28539f248220659cf6c08bd20f1
# bad: [e355fb6fb1c926d0ac109c35704b301e37172c4e] basic/log: fix log_trace()
git bisect bad e355fb6fb1c926d0ac109c35704b301e37172c4e
# first bad commit: [e355fb6fb1c926d0ac109c35704b301e37172c4e] basic/log: fix log_trace()

@bluca
Copy link
Member Author

bluca commented Apr 5, 2021

Thanks for that - yes, I have no idea how log_trace could affect a socket unit and socat. Timing issue? Is LOG_TRACE enabled in these builds? @keszybz any clue?

@mrc0mmand
Copy link
Member

Thanks for that - yes, I have no idea how log_trace could affect a socket unit and socat. Timing issue? Is LOG_TRACE enabled in these builds? @keszybz any clue?

Looks like a timing issue. Just to be sure, I tested the latest master with -Dlog-trace= set to true/false and the issue is indeed reproducible when it's set to false, but fails to reproduce when -Dlog-trace=true. Ugh...

@bluca
Copy link
Member Author

bluca commented Apr 6, 2021

Oh boy... Shall we set -Dlog-trace=true in the CI then, so that at least we get it back to green?

@mrc0mmand
Copy link
Member

Oh boy... Shall we set -Dlog-trace=true in the CI then, so that at least we get it back to green?

I'd give it a go to see if it indeed resolves the issue (I might've been just extremely lucky).

@bluca
Copy link
Member Author

bluca commented Apr 6, 2021

Ok, pushed: https://salsa.debian.org/systemd-team/systemd/-/commit/b3938f0a1d3ca4d0fc19b941655c0e9a905b9c82

( @mbiebl @ddstreet FYI)

We should see soon if it makes a difference in the next few PRs

@mrc0mmand
Copy link
Member

Hah, I was just browsing through CentOS CI journals (looking for a completely different bug) and noticed that this issue is present on Arch Linux as well, but it's far less frequent (and mitigated thanks to the task retry mechanism):

-- Logs begin at Wed 2021-04-07 20:09:39 CEST, end at Wed 2021-04-07 20:09:45 CEST. --
[    0.000000] kernel: Linux version 5.11.11-arch1-1 (linux@archlinux) (gcc (GCC) 10.2.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Tue, 30 Mar 2021 14:10:17 +0000
...
[ 1451.701112] sh[27]: + printf x
[ 1451.701154] systemd[1]: Registering bus object implementation for path=/org/freedesktop/systemd1/unit iface=org.freedesktop.systemd1.Timer
[ 1451.701201] systemd[1]: Registering bus object implementation for path=/org/freedesktop/LogControl1 iface=org.freedesktop.LogControl1
[ 1451.701247] systemd[1]: Accepted new private connection.
[ 1451.701293] systemd[1]: Bus private-bus-connection: changing state AUTHENTICATING → RUNNING
[ 1451.701340] sh[27]: + socat -t20 OPEN:test.file UNIX-CONNECT:/run/test.ctl
...
[ 1451.710612] systemd[1]: test10.service: Job 517 test10.service/start finished, result=done
[ 1451.710663] systemd[24]: Restricting namespace to: n/a.
[ 1451.710723] sh[31]: 2021/04/07 18:09:45 socat[31] E write(6, 0x55b5d0dd1000, 1): Broken pipe
[ 1451.711039] systemd[1]: Condition check resulted in test10.service being skipped.
...

At least one mystery solved :-)

@bluca
Copy link
Member Author

bluca commented Apr 7, 2021

Ah that's good to know - CIs are back to green now, so our suspicions were correct. Still no idea about the root cause though...

@keszybz
Copy link
Member

keszybz commented Apr 8, 2021

Looks like a timing issue.

That'd be my guess too.

sh[31]: 2021/04/07 18:09:45 socat[31] E write(6, 0x55b5d0dd1000, 1): Broken pipe

So the test is broken, right?

@bluca
Copy link
Member Author

bluca commented Apr 8, 2021

I am unsure of what's broken here - the test is quite simple and it seems to me it's not doing anything weird, just running socat. I do not know why the pipe gets broken without tracing.

@keszybz
Copy link
Member

keszybz commented Apr 9, 2021

Yeah, I don't grok this.
We start the socket, so pid1 opens /run/test.ctl, and then we call socat, which (here, but it seems the same everywhere):

socket(AF_UNIX, SOCK_STREAM, 0) = 6
fcntl(6, F_SETFD, FD_CLOEXEC) = 0
connect(6, {sa_family=AF_UNIX, sun_path="/run/test.ctl"}, 15) = 0
getsockname(6, {sa_family=AF_UNIX}, [112->2]) = 0
write(6, "x", 1)      = 1

This last write triggers the "Incoming data" thing and pid1 starts the service.

In the failing case, the write fails with EPIPE.

I don't see any race here, because it's the first write that fails. @poettering, any ideas?

ddstreet pushed a commit to ddstreet/systemd that referenced this issue Jul 1, 2021
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail.
Enable it while we figure it out. Also probably it is helpful to have
it always turned on for the CI, to get more info out of test runs.

See: systemd#19154
@bluca
Copy link
Member Author

bluca commented Aug 9, 2021

as per @mbiebl it looks like (without tracing) this is still an issue - but I still haven't got the faintest clue why socat fails like that

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

I ran with socat -d -d -d -d to get a bit more debug output and see if it's the same issue.

A "good" run

[  200.313094] systemd[1]: Starting testsuite-10.service...
[  200.343635] systemd[1]: Finished systemd-user-sessions.service.
[  200.344173] systemd[1]: Started console-getty.service.
[  200.344359] systemd[1]: Reached target getty.target.
[  200.377495] systemd[1]: Listening on test10.socket.
[  200.381915] sh[32]: + printf x
[  200.405526] socat[33]: 2021/08/09 08:46:50 socat[33] D getpid()
[  200.405747] socat[33]: 2021/08/09 08:46:50 socat[33] D getpid() -> 33
[  200.405832] socat[33]: 2021/08/09 08:46:50 socat[33] D setenv("SOCAT_PID", "33", 1)
[  200.405914] socat[33]: 2021/08/09 08:46:50 socat[33] D setenv() -> 0
[  200.406009] socat[33]: 2021/08/09 08:46:50 socat[33] D setenv("SOCAT_PPID", "33", 1)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D setenv() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] I socat by Gerhard Rieger and contributors - see www.dest-unreach.org
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] I This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (http://www.openssl.org/)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] I This product includes software written by Tim Hudson (tjh@cryptsoft.com)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D socat version 1.7.4.1 on Feb  3 2021 12:58:17
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D setenv("SOCAT_VERSION", "1.7.4.1", 1)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D setenv() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D running on Linux version #1 SMP Debian 5.10.46-4 (2021-08-03), release 5.10.0-8-amd64, machine x86_64
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[0]: "socat"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[1]: "-d"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[2]: "-d"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[3]: "-d"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[4]: "-d"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[5]: "-t20"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[6]: "OPEN:test.file"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D argv[7]: "UNIX-CONNECT:/run/test.ctl"
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(1, 0x7fffda73a450, 0x0)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(2, 0x7fffda73a450, 0x0)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(3, 0x7fffda73a450, 0x0)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(4, 0x7fffda73a450, 0x0)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(6, 0x7fffda73a450, 0x0)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(7, 0x7fffda73a450, 0x0)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(8, 0x7fffda73a450, 0x0)
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.409602] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(11, 0x7fffda73a450, 0x0)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction(15, 0x7fffda73a450, 0x0)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D sigaction() -> 0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D signal(13, 0x1)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D signal() -> 0x1
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D atexit(0x562198b78fe0)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D atexit() -> 0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D xioopen("OPEN:test.file")
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D calloc(1, 864)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D calloc() -> 0x56219978ba60
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D malloc(1024)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D malloc() -> 0x56219978b4c0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D stat64(test.file, 0x7fffda73a260)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D stat64(, {66304,1422,0100644,1,0,0,0,1,1024,2,...}) -> 0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] N opening regular file "test.file" for reading and writing
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D open("test.file", 02, 0666)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] I open("test.file", 02, 0666) -> 5
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D isatty(5)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D isatty() -> 0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D fcntl(5, 2, 1)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D fcntl() -> 0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D xioopen("UNIX-CONNECT:/run/test.ctl")
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D calloc(1, 864)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D calloc() -> 0x56219978be10
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D malloc(1024)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D malloc() -> 0x56219978c1c0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D malloc(128)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D malloc() -> 0x56219978c5d0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] N opening connection to AF=1 "/run/test.ctl"
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D socket(1, 1, 0)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] I socket(1, 1, 0) -> 6
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D fcntl(6, 2, 1)
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D fcntl() -> 0
[  200.410921] socat[33]: 2021/08/09 08:46:50 socat[33] D connect(6, {1,AF=1 "/run/test.ctl"}, 15)
[  200.412687] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.413279] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D connect() -> 0
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D getsockname(6, 0x7fffda739e90, 0x7fffda739e74{112})
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D getsockname(, {AF=1 "<anon>"}, {2}) -> 0
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] N successfully connected from local address AF=1 "\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE"
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] I resolved and opened all sock addresses
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D posix_memalign(0x7fffda73a3b8, 4096, 16385)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D posix_memalign(...) -> 0
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] N starting data transfer loop with FDs [5,5] and [6,6]
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select(7, &0x60, &0x60, &0x0, NULL/0.000000)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select -> (, 0x20, 0x60, 0x0, NULL/0.000000), 3
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read(5, 0x56219978d000, 8192)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read -> 1
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D write(6, 0x56219978d000, 1)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D write -> 1
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] I transferred 1 bytes from 5 to 6
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select(7, &0x60, &0x40, &0x0, NULL/0.000000)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select -> (, 0x20, 0x40, 0x0, NULL/0.000000), 2
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read(5, 0x56219978d000, 8192)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read -> 0
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] N socket 1 (fd 5) is at EOF
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] I shutdown(6, 1)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D shutdown()  -> 0
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D data loop: sock1->eof=3, sock2->eof=0, closing=1, wasaction=1, total_to={0.000000}
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select(7, &0x40, &0x0, &0x0, &20.000000)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select -> (, 0x40, 0x0, 0x0, &19.997870), 1
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read(6, 0x56219978d000, 8192)
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D read -> -1
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] W read(6, 0x56219978d000, 8192): Connection reset by peer
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] N socket 2 to socket 1 is in error
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] N socket 2 (fd 6) is at EOF
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] I shutdown(6, 2)
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D shutdown()  -> 0
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] N exiting with status 0
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D exit(0)
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D starting xioexit()
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D finished xioexit()
[  200.420790] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.421418] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.427324] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.427943] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.428541] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.429134] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.436408] sh[34]: +
[  200.436858] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.437514] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.438195] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.438792] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.439435] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.440068] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.440660] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.441300] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.445988] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.446596] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.447388] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.448137] systemd[1]: Condition check resulted in test10.service being skipped.
[  200.448316] systemd[1]: test10.socket: Trigger limit hit, refusing further activation.
[  200.448376] systemd[1]: test10.socket: Failed with result 'trigger-limit-hit'.
[  200.457933] systemd[1]: testsuite-10.service: Deactivated successfully.
[  200.458181] systemd[1]: Finished testsuite-10.service.

A "bad" run:

[  234.230406] systemd[1]: Starting testsuite-10.service...
[  234.260635] systemd[1]: Finished systemd-user-sessions.service.
[  234.261997] systemd[1]: Started console-getty.service.
[  234.262937] systemd[1]: Reached target getty.target.
[  234.307519] systemd[1]: Listening on test10.socket.
[  234.315379] sh[32]: + printf x
[  234.336628] socat[33]: 2021/08/09 09:16:56 socat[33] D getpid()
[  234.338640] socat[33]: 2021/08/09 09:16:56 socat[33] D getpid() -> 33
[  234.338734] socat[33]: 2021/08/09 09:16:56 socat[33] D setenv("SOCAT_PID", "33", 1)
[  234.338820] socat[33]: 2021/08/09 09:16:56 socat[33] D setenv() -> 0
[  234.338891] socat[33]: 2021/08/09 09:16:56 socat[33] D setenv("SOCAT_PPID", "33", 1)
[  234.338961] socat[33]: 2021/08/09 09:16:56 socat[33] D setenv() -> 0
[  234.339028] socat[33]: 2021/08/09 09:16:56 socat[33] I socat by Gerhard Rieger and contributors - see www.dest-unreach.org
[  234.339105] socat[33]: 2021/08/09 09:16:56 socat[33] I This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (http://www.openssl.org/)
[  234.339174] socat[33]: 2021/08/09 09:16:56 socat[33] I This product includes software written by Tim Hudson (tjh@cryptsoft.com)
[  234.339242] socat[33]: 2021/08/09 09:16:56 socat[33] D socat version 1.7.4.1 on Feb  3 2021 12:58:17
[  234.339310] socat[33]: 2021/08/09 09:16:56 socat[33] D setenv("SOCAT_VERSION", "1.7.4.1", 1)
[  234.339378] socat[33]: 2021/08/09 09:16:56 socat[33] D setenv() -> 0
[  234.339450] socat[33]: 2021/08/09 09:16:56 socat[33] D running on Linux version #1 SMP Debian 5.10.46-4 (2021-08-03), release 5.10.0-8-amd64, machine x86_64
[  234.339593] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[0]: "socat"
[  234.339667] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[1]: "-d"
[  234.339739] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[2]: "-d"
[  234.339811] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[3]: "-d"
[  234.339883] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[4]: "-d"
[  234.339954] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[5]: "-t20"
[  234.340026] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[6]: "OPEN:test.file"
[  234.340107] socat[33]: 2021/08/09 09:16:56 socat[33] D argv[7]: "UNIX-CONNECT:/run/test.ctl"
[  234.340182] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(1, 0x7ffd013ba290, 0x0)
[  234.340257] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.340329] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(2, 0x7ffd013ba290, 0x0)
[  234.340401] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.340473] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(3, 0x7ffd013ba290, 0x0)
[  234.340546] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.340617] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(4, 0x7ffd013ba290, 0x0)
[  234.340707] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.340775] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(6, 0x7ffd013ba290, 0x0)
[  234.340844] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.340911] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(7, 0x7ffd013ba290, 0x0)
[  234.340986] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.341054] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(8, 0x7ffd013ba290, 0x0)
[  234.341123] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.341190] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(11, 0x7ffd013ba290, 0x0)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction(15, 0x7ffd013ba290, 0x0)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D sigaction() -> 0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D signal(13, 0x1)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D signal() -> 0x1
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D atexit(0x5587799c9fe0)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D atexit() -> 0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D xioopen("OPEN:test.file")
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D calloc(1, 864)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D calloc() -> 0x558779d0ba60
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D malloc(1024)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D malloc() -> 0x558779d0b4c0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D stat64(test.file, 0x7ffd013ba0a0)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D stat64(, {66304,1422,0100644,1,0,0,0,1,1024,2,...}) -> 0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] N opening regular file "test.file" for reading and writing
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D open("test.file", 02, 0666)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] I open("test.file", 02, 0666) -> 5
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D isatty(5)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D isatty() -> 0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D fcntl(5, 2, 1)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D fcntl() -> 0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D xioopen("UNIX-CONNECT:/run/test.ctl")
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D calloc(1, 864)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D calloc() -> 0x558779d0be10
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D malloc(1024)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D malloc() -> 0x558779d0c1c0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D malloc(128)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D malloc() -> 0x558779d0c5d0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] N opening connection to AF=1 "/run/test.ctl"
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D socket(1, 1, 0)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] I socket(1, 1, 0) -> 6
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D fcntl(6, 2, 1)
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D fcntl() -> 0
[  234.345271] socat[33]: 2021/08/09 09:16:56 socat[33] D connect(6, {1,AF=1 "/run/test.ctl"}, 15)
[  234.347150] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.347749] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.348320] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.348964] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D connect() -> 0
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D getsockname(6, 0x7ffd013b9cd0, 0x7ffd013b9cb4{112})
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D getsockname(, {AF=1 "<anon>"}, {2}) -> 0
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] N successfully connected from local address AF=1 "\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE\xEE"
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] I resolved and opened all sock addresses
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D posix_memalign(0x7ffd013ba1f8, 4096, 16385)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D posix_memalign(...) -> 0
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] N starting data transfer loop with FDs [5,5] and [6,6]
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D select(7, &0x60, &0x60, &0x0, NULL/0.000000)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D select -> (, 0x60, 0x60, 0x0, NULL/0.000000), 4
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D read(5, 0x558779d0d000, 8192)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D read -> 1
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D write(6, 0x558779d0d000, 1)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D write -> -1
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] E write(6, 0x558779d0d000, 1): Broken pipe
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] N exit(1)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D starting xioexit()
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] I shutdown(6, 2)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D shutdown()  -> 0
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D finished xioexit()
[  234.360745] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.362436] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.363069] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.363692] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.364331] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.364919] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.369707] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.370357] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.370941] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.371526] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.372110] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.372689] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.377343] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.378000] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.378585] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.379222] systemd[1]: Condition check resulted in test10.service being skipped.
[  234.379394] systemd[1]: test10.socket: Trigger limit hit, refusing further activation.
[  234.379453] systemd[1]: test10.socket: Failed with result 'trigger-limit-hit'.
[  234.379946] systemd[1]: testsuite-10.service: Main process exited, code=exited, status=1/FAILURE
[  234.380007] systemd[1]: testsuite-10.service: Failed with result 'exit-code'.
[  234.381130] systemd[1]: Failed to start testsuite-10.service.

The main difference:

[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] N starting data transfer loop with FDs [5,5] and [6,6]
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select(7, &0x60, &0x60, &0x0, NULL/0.000000)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select -> (, 0x20, 0x60, 0x0, NULL/0.000000), 3
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read(5, 0x56219978d000, 8192)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read -> 1
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D write(6, 0x56219978d000, 1)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D write -> 1
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] I transferred 1 bytes from 5 to 6
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select(7, &0x60, &0x40, &0x0, NULL/0.000000)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select -> (, 0x20, 0x40, 0x0, NULL/0.000000), 2
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read(5, 0x56219978d000, 8192)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read -> 0
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] N socket 1 (fd 5) is at EOF
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] I shutdown(6, 1)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D shutdown()  -> 0
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D data loop: sock1->eof=3, sock2->eof=0, closing=1, wasaction=1, total_to={0.000000}
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select(7, &0x40, &0x0, &0x0, &20.000000)
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D select -> (, 0x40, 0x0, 0x0, &19.997870), 1
[  200.418964] socat[33]: 2021/08/09 08:46:50 socat[33] D read(6, 0x56219978d000, 8192)
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D read -> -1
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] W read(6, 0x56219978d000, 8192): Connection reset by peer
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] N socket 2 to socket 1 is in error
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] N socket 2 (fd 6) is at EOF
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] I shutdown(6, 2)
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D shutdown()  -> 0
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] N exiting with status 0
[  200.420131] socat[33]: 2021/08/09 08:46:50 socat[33] D exit(0)

vs

[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] N starting data transfer loop with FDs [5,5] and [6,6]
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D data loop: sock1->eof=0, sock2->eof=0, closing=0, wasaction=1, total_to={0.000000}
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D select(7, &0x60, &0x60, &0x0, NULL/0.000000)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D select -> (, 0x60, 0x60, 0x0, NULL/0.000000), 4
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D read(5, 0x558779d0d000, 8192)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D read -> 1
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D write(6, 0x558779d0d000, 1)
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] D write -> -1
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] E write(6, 0x558779d0d000, 1): Broken pipe
[  234.359276] socat[33]: 2021/08/09 09:16:56 socat[33] N exit(1)

Which basically matches the current findings. On this particular system, the failure rate is I'd say about 20%.

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

What I don't quite understand what this test is supposed to check.
The original issue #2467 was about the rate limit of a socket unit not being applied and systemd contiously restarting a socket unit when being triggered.
Why do we care about the return code of socat?

Shouldn't the test rather be something like this:

  • Start test10.socket
  • Trigger test10.socket by connecting to the socket
  • Check after a sleep X, if test10.socket is in state trigger-limit-hit
    ?

I honestly don't understand the test.

@bluca
Copy link
Member Author

bluca commented Aug 9, 2021

What I don't quite understand what this test is supposed to check.
The original issue #2467 was about the rate limit of a socket unit not being applied and systemd contiously restarting a socket unit when being triggered.
Why do we care about the return code of socat?

Shouldn't the test rather be something like this:

  • Start test10.socket
  • Trigger test10.socket by connecting to the socket
  • Check after a sleep X, if test10.socket is in state trigger-limit-hit
    ?

I honestly don't understand the test.

I'm not sure, to check that the socket unit actually did its job? Want to send a PR to implement the change you suggested?

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

What I don't quite understand what this test is supposed to check.

@evverx ^ could you chime in here.

It's quite possible that I'm missing some finer details here and why the return code of socat is important and why the test was written the way it is now.

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

I'm not sure, to check that the socket unit actually did its job?

What does "did its job" mean in this particular case?

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

Fwiw, if you want to make the test fail (100%) reliably, you can replace printf x >test.file; with dd if=/dev/zero of=/test.file bs=1M count=8;.
Then it is pretty much guaranteed that you hit the "Broken pipe" error on write, even with log-tracing enabled.

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

Shouldn't the test rather be something like this:

  • Start test10.socket
  • Trigger test10.socket by connecting to the socket
  • Check after a sleep X, if test10.socket is in state trigger-limit-hit
    ?

Since the default TriggerLimitIntervalSec= is 2s, we could make this a sleep 3 and if we wanted to be super fancy, we could check the journal, if we actually hit TriggerLimitBurst=20, i.e. if the journal contains 20 systemd[1]: Condition check resulted in test10.service being skipped. messages.

@evverx
Copy link
Member

evverx commented Aug 9, 2021

Check after a sleep X, if test10.socket is in state trigger-limit-hit

As far as I can remember, both nc and socat got stuck and timed out eventually when the bug was there so sleep ... (or any other command following socat) wouldn't have been run anyway. The idea was to just make sure that socat doesn't time out and in this particular case its exit code can be safely ignored I think.

@evverx
Copy link
Member

evverx commented Aug 9, 2021

but I still haven't got the faintest clue why socat fails like that

My guess would be that systemd closes the receiving end right before socat tries to write that one byte to the socket. FWIW if the test fails too much and generally annoys everybody I think it could be removed.

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

@evverx
So the attempts to replace nc by echo (9b45c2b) and then socat (b35d6d8) where actually missing the point of this test?
Would you agree that the following would probably be more in line what we actually expect to check for correct behaviour:

  • Start socket: systemctl start test10.socket
  • Trigger activation of test10.socket (via nc or socat doesn't matter). We ignore the return code and background it via & as this call might potentially block (fwiw, I don't remember nc doing that, but I'd have to check v228 again...)
  • sleep 4 (2s + some safety buffer)
  • check if systemctl show test10.socket -P Result == trigger-limit-hit and systemctl show test10.socket -P ActiveState == failed
    If not, fail the test.

@evverx
Copy link
Member

evverx commented Aug 9, 2021

So the attempts to replace nc by echo (9b45c2b) and then socat (b35d6d8) where actually missing the point of this test?

I don't think they were. They just kind of made the test work at the time by probably relying on some implementation details that shouldn't be relied on apparently.

Would you agree that the following would probably be more in line what we actually expect to check for correct behaviour

Yes, I would but I think that sleep would make the test flaky under ASan (which slows everything down) and probably on slow VMs (where it seems the flakiness usually manifests itself).

@ddstreet
Copy link
Contributor

ddstreet commented Aug 9, 2021

but I still haven't got the faintest clue why socat fails like that

My guess would be that systemd closes the receiving end right before socat tries to write that one byte to the socket. FWIW if the test fails too much and generally annoys everybody I think it could be removed.

this seems the most likely explanation, since the test intentionally is triggering test10.service failure, which will then cause test10.socket failure (and stopping), it's just a race condition; when socat performs the connect() to the unit socket, if systemd does all test10.service restart attempts and then stops test10.socket before socat reaches write(), then socat will get EPIPE from systemd, since it's already closed its end of the socket. That would seem to match 'passing' case when log-trace is enabled, which presumably would slow down systemd so the socket unit didn't close before socat reached write().

Start socket: systemctl start test10.socket
Trigger activation of test10.socket (via nc or socat doesn't matter). We ignore the return code and background it via & as this call might potentially block (fwiw, I don't remember nc doing that, but I'd have to check v228 again...)

yes, since we know that the socket will be closing at some point after the connect(), i don't think we should care about anything that happens to socat/nc after that first connect(). In fact even if the connect() fails we shouldn't care about that.

sleep 4 (2s + some safety buffer)

I'm always somewhat concerned about arbitrary delays in the test code, since testbed performance can vary so much, but this test really should move very quickly so a few seconds sleep should be enough.

check if systemctl show test10.socket -P Result == trigger-limit-hit and systemctl show test10.socket -P ActiveState == failed
If not, fail the test.

yep LGTM

@mbiebl
Copy link
Contributor

mbiebl commented Aug 9, 2021

sleep 4 (2s + some safety buffer)

I'm always somewhat concerned about arbitrary delays in the test code, since testbed performance can vary so much, but this test really should move very quickly so a few seconds sleep should be enough.

Nod. Fwiw in this test run it took about 30ms, until the trigger-limit was hit. We could start with something like 10s to be extra safe and increase it should we still run into issues.

mbiebl added a commit to mbiebl/systemd that referenced this issue Aug 9, 2021
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE
from systemd. The latter will cause it to exit(1) and subsequently the
test to fail.
We are not actually interested in the return code of socat though. The
test is supposed to check, whether rate limiting of a socket unit works
properly.

So ignore any failures from the socat invocation and instead check, if
test10.socket is in state "failed" with result "trigger-limit-hit" after
it has been triggered.

TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should
give systemd enough time even on slower machines, to reach the trigger
limit.

For better readability, break the test into separate ExecStart lines.

Fixes systemd#19154.
mbiebl added a commit to mbiebl/systemd that referenced this issue Aug 10, 2021
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE
from systemd. The latter will cause it to exit(1) and subsequently the
test to fail.
We are not actually interested in the return code of socat though. The
test is supposed to check, whether rate limiting of a socket unit works
properly.

So ignore any failures from the socat invocation and instead check, if
test10.socket is in state "failed" with result "trigger-limit-hit" after
it has been triggered.

TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should
give systemd enough time even on slower machines, to reach the trigger
limit.

For better readability, break the test into separate ExecStart lines.

Fixes systemd#19154.
@bluca bluca linked a pull request Aug 10, 2021 that will close this issue
mbiebl added a commit to mbiebl/systemd that referenced this issue Aug 10, 2021
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE
from systemd. The latter will cause it to exit(1) and subsequently the
test to fail.
We are not actually interested in the return code of socat though. The
test is supposed to check, whether rate limiting of a socket unit works
properly.

So ignore any failures from the socat invocation and instead check, if
test10.socket is in state "failed" with result "trigger-limit-hit" after
it has been triggered.

TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should
give systemd enough time even on slower machines, to reach the trigger
limit.

For better readability, break the test into separate ExecStart lines.

Fixes systemd#19154.
yuwata pushed a commit that referenced this issue Aug 10, 2021
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE
from systemd. The latter will cause it to exit(1) and subsequently the
test to fail.
We are not actually interested in the return code of socat though. The
test is supposed to check, whether rate limiting of a socket unit works
properly.

So ignore any failures from the socat invocation and instead check, if
test10.socket is in state "failed" with result "trigger-limit-hit" after
it has been triggered.

TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should
give systemd enough time even on slower machines, to reach the trigger
limit.

For better readability, break the test into separate ExecStart lines.

Fixes #19154.
ddstreet pushed a commit to ddstreet/systemd that referenced this issue Aug 20, 2021
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail.
Enable it while we figure it out. Also probably it is helpful to have
it always turned on for the CI, to get more info out of test runs.

See: systemd#19154
ddstreet pushed a commit to ddstreet/systemd that referenced this issue Nov 5, 2021
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail.
Enable it while we figure it out. Also probably it is helpful to have
it always turned on for the CI, to get more info out of test runs.

See: systemd#19154
keszybz pushed a commit to fbuihuu/systemd that referenced this issue Dec 6, 2021
Writing a byte to test10.socket is actually the root cause of issue systemd#19154:
depending on the timing, it's possible that PID1 closes the socket before socat
(or nc, it doesn't matter which tool is actually used) tries to write that one
byte to the socket. In this case writing to the socket returns EPIPE, which
causes socat to exit(1) and subsequently make the test fail.

Since we're only interested in connecting to the socket and triggering the rate
limit of the socket, this patch removes the parts that write the single byte to
the socket, which should remove the race for good.

Since it shouldn't matter whether the test uses socat or nc, let's switch back
to nc and hence remove the sole user of socat.
fbuihuu added a commit to fbuihuu/systemd that referenced this issue Dec 6, 2021
Writing a byte to test10.socket is actually the root cause of issue systemd#19154:
depending on the timing, it's possible that PID1 closes the socket before socat
(or nc, it doesn't matter which tool is actually used) tries to write that one
byte to the socket. In this case writing to the socket returns EPIPE, which
causes socat to exit(1) and subsequently make the test fail.

Since we're only interested in connecting to the socket and triggering the rate
limit of the socket, this patch removes the parts that write the single byte to
the socket, which should remove the race for good.

Since it shouldn't matter whether the test uses socat or nc, let's switch back
to nc and hence remove the sole user of socat. The exit status of nc is however
ignored because some versions might choke when the socket is closed
unexpectedly.
yuwata pushed a commit that referenced this issue Dec 6, 2021
Writing a byte to test10.socket is actually the root cause of issue #19154:
depending on the timing, it's possible that PID1 closes the socket before socat
(or nc, it doesn't matter which tool is actually used) tries to write that one
byte to the socket. In this case writing to the socket returns EPIPE, which
causes socat to exit(1) and subsequently make the test fail.

Since we're only interested in connecting to the socket and triggering the rate
limit of the socket, this patch removes the parts that write the single byte to
the socket, which should remove the race for good.

Since it shouldn't matter whether the test uses socat or nc, let's switch back
to nc and hence remove the sole user of socat. The exit status of nc is however
ignored because some versions might choke when the socket is closed
unexpectedly.
Werkov pushed a commit to Werkov/systemd that referenced this issue Nov 1, 2023
Depending on the timing, socat will either get ECONNREFUSED oder EPIPE
from systemd. The latter will cause it to exit(1) and subsequently the
test to fail.
We are not actually interested in the return code of socat though. The
test is supposed to check, whether rate limiting of a socket unit works
properly.

So ignore any failures from the socat invocation and instead check, if
test10.socket is in state "failed" with result "trigger-limit-hit" after
it has been triggered.

TriggerLimitIntervalSec= by default is set to 2s. A "sleep 10" should
give systemd enough time even on slower machines, to reach the trigger
limit.

For better readability, break the test into separate ExecStart lines.

Fixes systemd#19154.

(cherry picked from commit d84f316)
Werkov pushed a commit to Werkov/systemd that referenced this issue Nov 1, 2023
Writing a byte to test10.socket is actually the root cause of issue systemd#19154:
depending on the timing, it's possible that PID1 closes the socket before socat
(or nc, it doesn't matter which tool is actually used) tries to write that one
byte to the socket. In this case writing to the socket returns EPIPE, which
causes socat to exit(1) and subsequently make the test fail.

Since we're only interested in connecting to the socket and triggering the rate
limit of the socket, this patch removes the parts that write the single byte to
the socket, which should remove the race for good.

Since it shouldn't matter whether the test uses socat or nc, let's switch back
to nc and hence remove the sole user of socat. The exit status of nc is however
ignored because some versions might choke when the socket is closed
unexpectedly.

(cherry picked from commit 051ea71)
g7 pushed a commit to droidian/systemd that referenced this issue Mar 8, 2024
Its abscence appears to cause TEST-10-ISSUE-2467 to reliably fail.
Enable it while we figure it out. Also probably it is helpful to have
it always turned on for the CI, to get more info out of test runs.

See: systemd/systemd#19154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

7 participants