-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(#1712524) Change job mode of manager triggered restarts to JOB_REPLACE #3
Merged
systemd-rhel-bot
merged 1 commit into
redhat-plumbers:master
from
jsynacek:bz1712524-job-mode-change
Jul 26, 2019
Merged
(#1712524) Change job mode of manager triggered restarts to JOB_REPLACE #3
systemd-rhel-bot
merged 1 commit into
redhat-plumbers:master
from
jsynacek:bz1712524-job-mode-change
Jul 26, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes: #11305 Fixes: #3260 Related: #11456 So, here's what happens in the described scenario in #11305. A unit goes down, and that triggeres stop jobs for the other two units as they were bound to it. Now, the timer for manager triggered restarts kicks in and schedules a restart job with the JOB_FAIL job mode. This means there is a stop job installed on those units, and now due to them being bound to us they also get a restart job enqueued. This however is a conflicts, as neither stop can merge into restart, nor restart into stop. However, restart should be able to replace stop in any case. If the stop procedure is ongoing, it can cancel the stop job, install itself, and then after reaching dead finish and convert itself to a start job. However, if we increase the timer, then it can always take those units from inactive -> auto-restart. We change the job mode to JOB_REPLACE so the restart job cancels the stop job and installs itself. Also, the original bug could be worked around by bumping RestartSec= to avoid the conflicting. This doesn't seem to be something that is going to break uses. That is because for those who already had it working, there must have never been conflicting jobs, as that would result in a desctructive transaction by virtue of the job mode used. After this change, the test case is able to work nicely without issues. (cherry picked from commit 03ff2dc) Resolves: #1712524
jsynacek
added
pr/needs-review
Formerly needs-review
tracker/unapproved
Formerly needs-acks
labels
May 22, 2019
systemd-rhel-bot
changed the title
Change job mode of manager triggered restarts to JOB_REPLACE
(#11456) Change job mode of manager triggered restarts to JOB_REPLACE
May 22, 2019
systemd-rhel-bot
added
null
pr/needs-ci
Formerly needs-ci
and removed
pr/needs-ci
Formerly needs-ci
labels
May 28, 2019
systemd-rhel-bot
changed the title
(#11456) Change job mode of manager triggered restarts to JOB_REPLACE
(#11456) (#11456) Change job mode of manager triggered restarts to JOB_REPLACE
Jun 4, 2019
systemd-rhel-bot
changed the title
(#11456) (#11456) Change job mode of manager triggered restarts to JOB_REPLACE
(#1712524) (#11456) (#11456) Change job mode of manager triggered restarts to JOB_REPLACE
Jun 5, 2019
lnykryn
changed the title
(#1712524) (#11456) (#11456) Change job mode of manager triggered restarts to JOB_REPLACE
Change job mode of manager triggered restarts to JOB_REPLACE
Jun 5, 2019
lnykryn
changed the title
Change job mode of manager triggered restarts to JOB_REPLACE
(#1712524) Change job mode of manager triggered restarts to JOB_REPLACE
Jun 5, 2019
msekletar
approved these changes
Jun 14, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
systemd-rhel-bot
added
pr/needs-ci
Formerly needs-ci
and removed
pr/needs-review
Formerly needs-review
pr/needs-ci
Formerly needs-ci
labels
Jul 8, 2019
systemd-rhel-bot
added
pr/needs-review
Formerly needs-review
and removed
pr/needs-review
Formerly needs-review
labels
Jul 15, 2019
systemd-rhel-bot
added
pr/needs-review
Formerly needs-review
tracker/missing
Formerly needs-bz
tracker/unapproved
Formerly needs-acks
and removed
tracker/unapproved
Formerly needs-acks
pr/needs-review
Formerly needs-review
tracker/missing
Formerly needs-bz
labels
Jul 22, 2019
mrc0mmand
pushed a commit
to mrc0mmand/rhel-8
that referenced
this pull request
Nov 27, 2019
This is a follow-up to 8857fb9 that prevents the fuzzer from crashing with ``` ==220==ERROR: AddressSanitizer: ABRT on unknown address 0x0000000000dc (pc 0x7ff4953c8428 bp 0x7ffcf66ec290 sp 0x7ffcf66ec128 T0) SCARINESS: 10 (signal) #0 0x7ff4953c8427 in gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x35427) redhat-plumbers#1 0x7ff4953ca029 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x37029) redhat-plumbers#2 0x7ff49666503a in log_assert_failed_realm /work/build/../../src/systemd/src/basic/log.c:805:9 redhat-plumbers#3 0x7ff496614ecf in safe_close /work/build/../../src/systemd/src/basic/fd-util.c:66:17 redhat-plumbers#4 0x548806 in server_done /work/build/../../src/systemd/src/journal/journald-server.c:2064:9 redhat-plumbers#5 0x5349fa in LLVMFuzzerTestOneInput /work/build/../../src/systemd/src/fuzz/fuzz-journald-kmsg.c:26:9 redhat-plumbers#6 0x592755 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:571:15 redhat-plumbers#7 0x590627 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool*) /src/libfuzzer/FuzzerLoop.cpp:480:3 redhat-plumbers#8 0x594432 in fuzzer::Fuzzer::MutateAndTestOne() /src/libfuzzer/FuzzerLoop.cpp:708:19 redhat-plumbers#9 0x5973c6 in fuzzer::Fuzzer::Loop(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, fuzzer::fuzzer_allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) /src/libfuzzer/FuzzerLoop.cpp:839:5 redhat-plumbers#10 0x574541 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:764:6 redhat-plumbers#11 0x5675fc in main /src/libfuzzer/FuzzerMain.cpp:20:10 redhat-plumbers#12 0x7ff4953b382f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f) redhat-plumbers#13 0x420f58 in _start (/out/fuzz-journald-kmsg+0x420f58) ``` (cherry picked from commit cc55ac0) Resolves: #1764560
systemd-rhel-bot
pushed a commit
that referenced
this pull request
Dec 3, 2019
This is a follow-up to 8857fb9 that prevents the fuzzer from crashing with ``` ==220==ERROR: AddressSanitizer: ABRT on unknown address 0x0000000000dc (pc 0x7ff4953c8428 bp 0x7ffcf66ec290 sp 0x7ffcf66ec128 T0) SCARINESS: 10 (signal) #0 0x7ff4953c8427 in gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x35427) #1 0x7ff4953ca029 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x37029) #2 0x7ff49666503a in log_assert_failed_realm /work/build/../../src/systemd/src/basic/log.c:805:9 #3 0x7ff496614ecf in safe_close /work/build/../../src/systemd/src/basic/fd-util.c:66:17 #4 0x548806 in server_done /work/build/../../src/systemd/src/journal/journald-server.c:2064:9 #5 0x5349fa in LLVMFuzzerTestOneInput /work/build/../../src/systemd/src/fuzz/fuzz-journald-kmsg.c:26:9 #6 0x592755 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:571:15 #7 0x590627 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool*) /src/libfuzzer/FuzzerLoop.cpp:480:3 #8 0x594432 in fuzzer::Fuzzer::MutateAndTestOne() /src/libfuzzer/FuzzerLoop.cpp:708:19 #9 0x5973c6 in fuzzer::Fuzzer::Loop(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, fuzzer::fuzzer_allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) /src/libfuzzer/FuzzerLoop.cpp:839:5 #10 0x574541 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:764:6 #11 0x5675fc in main /src/libfuzzer/FuzzerMain.cpp:20:10 #12 0x7ff4953b382f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f) #13 0x420f58 in _start (/out/fuzz-journald-kmsg+0x420f58) ``` (cherry picked from commit cc55ac0) Resolves: #1764560
msekletar
pushed a commit
to msekletar/systemd-rhel8
that referenced
this pull request
Dec 22, 2023
I had a test machine with ulimit -n set to 1073741816 through pam ("session required pam_limits.so set_all", which copies the limits from PID 1, left over from testing of #10921). test-execute would "hang" and then fail with a timeout when running exec-inaccessiblepaths-proc.service. It turns out that the problem was in close_all_fds(), which would go to the fallback path of doing close() 1073741813 times. Let's just fail if we hit this case. This only matters for cases where both /proc is inaccessible, and the *soft* limit has been raised. (gdb) bt #0 0x00007f7e2e73fdc8 in close () from target:/lib64/libc.so.6 redhat-plumbers#1 0x00007f7e2e42cdfd in close_nointr () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so redhat-plumbers#2 0x00007f7e2e42d525 in close_all_fds () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so redhat-plumbers#3 0x0000000000426e53 in exec_child () redhat-plumbers#4 0x0000000000429578 in exec_spawn () redhat-plumbers#5 0x00000000004ce1ab in service_spawn () redhat-plumbers#6 0x00000000004cff77 in service_enter_start () redhat-plumbers#7 0x00000000004d028f in service_enter_start_pre () redhat-plumbers#8 0x00000000004d16f2 in service_start () redhat-plumbers#9 0x00000000004568f4 in unit_start () redhat-plumbers#10 0x0000000000416987 in test () redhat-plumbers#11 0x0000000000417632 in test_exec_inaccessiblepaths () redhat-plumbers#12 0x0000000000419362 in run_tests () redhat-plumbers#13 0x0000000000419632 in main () (cherry picked from commit 6a461d1) Related: RHEL-18302
msekletar
pushed a commit
to msekletar/systemd-rhel8
that referenced
this pull request
Dec 22, 2023
I had a test machine with ulimit -n set to 1073741816 through pam ("session required pam_limits.so set_all", which copies the limits from PID 1, left over from testing of #10921). test-execute would "hang" and then fail with a timeout when running exec-inaccessiblepaths-proc.service. It turns out that the problem was in close_all_fds(), which would go to the fallback path of doing close() 1073741813 times. Let's just fail if we hit this case. This only matters for cases where both /proc is inaccessible, and the *soft* limit has been raised. (gdb) bt #0 0x00007f7e2e73fdc8 in close () from target:/lib64/libc.so.6 redhat-plumbers#1 0x00007f7e2e42cdfd in close_nointr () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so redhat-plumbers#2 0x00007f7e2e42d525 in close_all_fds () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so redhat-plumbers#3 0x0000000000426e53 in exec_child () redhat-plumbers#4 0x0000000000429578 in exec_spawn () redhat-plumbers#5 0x00000000004ce1ab in service_spawn () redhat-plumbers#6 0x00000000004cff77 in service_enter_start () redhat-plumbers#7 0x00000000004d028f in service_enter_start_pre () redhat-plumbers#8 0x00000000004d16f2 in service_start () redhat-plumbers#9 0x00000000004568f4 in unit_start () redhat-plumbers#10 0x0000000000416987 in test () redhat-plumbers#11 0x0000000000417632 in test_exec_inaccessiblepaths () redhat-plumbers#12 0x0000000000419362 in run_tests () redhat-plumbers#13 0x0000000000419632 in main () (cherry picked from commit 6a461d1) Related: RHEL-18302
msekletar
pushed a commit
to msekletar/systemd-rhel8
that referenced
this pull request
Dec 22, 2023
I had a test machine with ulimit -n set to 1073741816 through pam ("session required pam_limits.so set_all", which copies the limits from PID 1, left over from testing of #10921). test-execute would "hang" and then fail with a timeout when running exec-inaccessiblepaths-proc.service. It turns out that the problem was in close_all_fds(), which would go to the fallback path of doing close() 1073741813 times. Let's just fail if we hit this case. This only matters for cases where both /proc is inaccessible, and the *soft* limit has been raised. (gdb) bt #0 0x00007f7e2e73fdc8 in close () from target:/lib64/libc.so.6 redhat-plumbers#1 0x00007f7e2e42cdfd in close_nointr () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so redhat-plumbers#2 0x00007f7e2e42d525 in close_all_fds () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so redhat-plumbers#3 0x0000000000426e53 in exec_child () redhat-plumbers#4 0x0000000000429578 in exec_spawn () redhat-plumbers#5 0x00000000004ce1ab in service_spawn () redhat-plumbers#6 0x00000000004cff77 in service_enter_start () redhat-plumbers#7 0x00000000004d028f in service_enter_start_pre () redhat-plumbers#8 0x00000000004d16f2 in service_start () redhat-plumbers#9 0x00000000004568f4 in unit_start () redhat-plumbers#10 0x0000000000416987 in test () redhat-plumbers#11 0x0000000000417632 in test_exec_inaccessiblepaths () redhat-plumbers#12 0x0000000000419362 in run_tests () redhat-plumbers#13 0x0000000000419632 in main () (cherry picked from commit 6a461d1) Related: RHEL-18302
github-actions bot
pushed a commit
that referenced
this pull request
Jan 8, 2024
I had a test machine with ulimit -n set to 1073741816 through pam ("session required pam_limits.so set_all", which copies the limits from PID 1, left over from testing of #10921). test-execute would "hang" and then fail with a timeout when running exec-inaccessiblepaths-proc.service. It turns out that the problem was in close_all_fds(), which would go to the fallback path of doing close() 1073741813 times. Let's just fail if we hit this case. This only matters for cases where both /proc is inaccessible, and the *soft* limit has been raised. (gdb) bt #0 0x00007f7e2e73fdc8 in close () from target:/lib64/libc.so.6 #1 0x00007f7e2e42cdfd in close_nointr () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so #2 0x00007f7e2e42d525 in close_all_fds () from target:/home/zbyszek/src/systemd-work3/build-rawhide/src/shared/libsystemd-shared-241.so #3 0x0000000000426e53 in exec_child () #4 0x0000000000429578 in exec_spawn () #5 0x00000000004ce1ab in service_spawn () #6 0x00000000004cff77 in service_enter_start () #7 0x00000000004d028f in service_enter_start_pre () #8 0x00000000004d16f2 in service_start () #9 0x00000000004568f4 in unit_start () #10 0x0000000000416987 in test () #11 0x0000000000417632 in test_exec_inaccessiblepaths () #12 0x0000000000419362 in run_tests () #13 0x0000000000419632 in main () (cherry picked from commit 6a461d1) Related: RHEL-18302
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes: #11305
Fixes: #3260
Related: #11456
So, here's what happens in the described scenario in #11305. A unit goes
down, and that triggeres stop jobs for the other two units as they were
bound to it. Now, the timer for manager triggered restarts kicks in and
schedules a restart job with the JOB_FAIL job mode. This means there is
a stop job installed on those units, and now due to them being bound to
us they also get a restart job enqueued. This however is a conflicts, as
neither stop can merge into restart, nor restart into stop. However,
restart should be able to replace stop in any case. If the stop
procedure is ongoing, it can cancel the stop job, install itself, and
then after reaching dead finish and convert itself to a start job.
However, if we increase the timer, then it can always take those units
from inactive -> auto-restart.
We change the job mode to JOB_REPLACE so the restart job cancels the
stop job and installs itself.
Also, the original bug could be worked around by bumping RestartSec= to
avoid the conflicting.
This doesn't seem to be something that is going to break uses. That is
because for those who already had it working, there must have never been
conflicting jobs, as that would result in a desctructive transaction by
virtue of the job mode used.
After this change, the test case is able to work nicely without issues.
(cherry picked from commit 03ff2dc)
Resolves: #1712524