job: Don't mark as redundant if deps are relevant #14086

kkuehlz · 2019-11-19T21:57:13Z

In the steps given in #13850, the resulting graph looks like:

C (Anchor) -> B -> A

Since B is inactive, it will be flagged as redundant and removed from
the transaction, causing A to get garbage collected. The proposed fix is
to not mark nodes as redundant if doing so causes a relevant node to be
garbage collected.

Fixes #13850

kkuehlz · 2019-11-19T22:01:32Z

This is just one approach. Some other ways to skin this could be still letting the job get marked as redundant but flattening the graph before GC, not GCing units that matter to the anchor, etc...

mrc0mmand · 2019-11-19T22:40:36Z

Something in this patch is killing the CentOS CI machines:

[   10.449867] systemd[1]: segfault at 7ffcc0b4eff8 ip 000055809e50f9df sp 00007ffcc0b4f000 error 6 in systemd[55809e4e0000+bc000]
[   10.451687] Code: 00 bf 00 00 00 00 e8 c0 40 fd ff 55 48 89 e5 41 54 53 4c 8b 67 40 4d 85 e4 0f 84 88 00 00 00 89 f3 49 8b 44 24 08 48 8b 78 08 <e8> 0d 98 04 00 89 c2 85 db 74 14 83 fb 02 75 49 b8 01 00 00 00 83
[   10.580863] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[   10.582132] CPU: 1 PID: 1 Comm: systemd Not tainted 5.3.11-arch1-1 #1
[   10.583146] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module_el8.0.0+189+f9babebb 04/01/2014
[   10.584430] Call Trace:
[   10.584816]  dump_stack+0x66/0x90
[   10.585335]  panic+0x112/0x2e8
[   10.585807]  do_exit.cold+0x76/0xd3
[   10.586359]  do_group_exit+0x33/0xa0
[   10.586914]  get_signal+0x136/0x8d0
[   10.587451]  do_signal+0x43/0x680
[   10.587976]  exit_to_usermode_loop+0xbe/0x110
[   10.588641]  prepare_exit_to_usermode+0x67/0x90
[   10.589337]  retint_user+0x8/0x8
[   10.589840] RIP: 0033:0x55809e50f9df
[   10.590398] Code: Bad RIP value.
[   10.590904] RSP: 002b:00007ffcc0b4f000 EFLAGS: 00010206
[   10.591701] RAX: 00005580a06e3bf0 RBX: 0000000000000000 RCX: 000055809e50fa95
[   10.592787] RDX: 00000000fffffffe RSI: 0000000000000000 RDI: 00005580a06b14f0
[   10.593880] RBP: 00007ffcc0b4f010 R08: 00005580a05de022 R09: 0000000000000003
[   10.594957] R10: 0000000000000000 R11: 00007ff54e3cca40 R12: 00005580a06e3ca0
[   10.596949] R13: 00007ffcc134e0c8 R14: 00005580a05e24a0 R15: 00005580a06dbf40
[   10.598677] Kernel Offset: 0x39400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   10.600432] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b ]---

In the steps given in systemd#13850, the resulting graph looks like: C (Anchor) -> B -> A Since B is inactive, it will be flagged as redundant and removed from the transaction, causing A to get garbage collected. The proposed fix is to not mark nodes as redundant if doing so causes a relevant node to be garbage collected. Fixes systemd#13850

kkuehlz · 2019-11-20T21:28:33Z

@mrc0mmand Apologies. Once I built it onto my system tree was able to repro the panic. Didn't happen with the mkosi image. Should be fixed now.

poettering · 2020-01-03T14:58:03Z

looks correct. thanks!

AdamWill · 2020-02-18T16:44:09Z

I suspect this may not actually be correct:

https://bugzilla.redhat.com/show_bug.cgi?id=1803293

we're running into an issue in Fedora 32 and Rawhide where plymouth-start.service (and several other services) are running after a TTY is spawned or a login session starts. I did a bit of digging this morning and the results seem to point to this change as a suspect.

AdamWill · 2020-02-18T17:09:35Z

Yep, reverting this change definitely seems to fix that bug.

kkuehlz · 2020-02-18T18:04:52Z

@AdamWill Gotcha. If this patch is changing the behavior of existing unit files then it should be reverted. Even if there is something that can be changed in the unit, the edge case it fixes is not worth a regression.

mrc0mmand · 2020-02-19T10:05:21Z

This is a yet another case where it would be really great if we could, somehow, resurrect the Fedora Rawhide CI for the systemd upstream repo.

mbiebl · 2020-03-12T03:25:30Z

I can confirm the breakage
See #15091 (I suspect #15090 and #15062 are duplicates)
There is also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953670

I ran a git bisect and ended up with 097537f as the first faulty commit. Imho this change should be reverted until a solution is found that doesn't break plymouth.

AdamWill · 2020-03-12T03:58:39Z

It doesn't only affect Plymouth, it affects over a dozen other services. See this comment.

mbiebl · 2020-03-12T12:31:25Z

@AdamWill good to know. I'll be reverting this commit downstream as this seems to hit users hard.
I see that you did the same for Fedora.

keszybz · 2020-03-12T13:47:28Z

It doesn't only affect Plymouth, it affects over a dozen other services.

Actually the only ones known so far are systemd-network-generator.service, systemd-vconsole-setup.service, and plymouth-start.service. And systemd-network-generator.service is idempotent, so not running it twice is just a minor optimization.

(#1803293 lists some other services, but they already had RemainAfterExit=true, so they would not be invoked again anyway.)

--
@yuwata, I removed the "needs fixing" label, because we use don't use that for external projects.

AdamWill · 2020-03-12T23:24:04Z

What's with all the other service names listed in that comment, then?

Anyhow, at least systemd-vconsole-setup is also very significant: that one being run again seemed to result in about half of the keypresses on tty1 being eaten somehow. It was almost impossible to log in or type anything. That was downstream 1801272.

keszybz · 2020-03-13T07:23:51Z

What's with all the other service names listed in that comment, then?

Not sure. Most of those units have Type=oneshot, RemainAfterExit=yes. In v244 it is possible that in some scenarios they would not run even though they should, but it would be unexpected if they run twice. I guess that they had already been started, and those lines in the log are just noise, systemd deciding not to run them again, when it doesn't really matter why. If you paste a full log, we could try to figure it out, but I don't it is worth the time.

sergio-correia · 2020-04-28T02:18:31Z

So, I have been having weird issues on ArchLinux with systemd 245 recently, and today I was able to bisect and got this as the first bad commit.

Initially I noticed issues when trying to use VMs with virt-manager. For some reason, I would start a VM and nothing happened, it just stayed frozen.

I tried to restart libvirtd and then I saw systemd asking me for my LUKS password. Digging further, it seems any operation using systemctl would request my LUKS password (after boot, wait a few minutes for the password provided in the boot to expire from the keyring, then try to start/restart/stop a service and it would ask me for the LUKS password). I do not have plymouth in this system.

Any ideas what is going on here?

sergio-correia · 2020-04-28T15:07:39Z

And I guess I have have found my issue: somehow I had the same device listed twice in crypttab and one of the systemd-cryptsetup units associated to this device would be marked as failed after boot.

After this commit it seems systemd keeps retrying failed units (question in #15242), which is what causes the issue I had. A little subtle in my case, as since the password is cached for a little while, the issue wont be reproducible immediately.

kkuehlz force-pushed the partof_inactive_units branch from 0968e76 to 662c8f5 Compare November 20, 2019 21:20

kkuehlz force-pushed the partof_inactive_units branch from 662c8f5 to 8f3178f Compare November 20, 2019 21:25

anitazha added the pid1 label Nov 25, 2019

keszybz requested a review from poettering December 4, 2019 09:59

poettering merged commit 097537f into systemd:master Jan 3, 2020

yuwata added the merged-but-needs-fixing label Mar 12, 2020

keszybz removed the merged-but-needs-fixing label Mar 12, 2020

mbiebl mentioned this pull request Mar 28, 2020

Question: is systemd supposed to retry failed units at login? #15242

Closed

dechamps mentioned this pull request Mar 28, 2020

PartOf= does not work on inactive units #13850

Closed

bjoe2k4 mentioned this pull request Jun 5, 2020

journal filling up with Condition check messages #15188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job: Don't mark as redundant if deps are relevant #14086

job: Don't mark as redundant if deps are relevant #14086

kkuehlz commented Nov 19, 2019

kkuehlz commented Nov 19, 2019 •

edited

Loading

mrc0mmand commented Nov 19, 2019

kkuehlz commented Nov 20, 2019

poettering commented Jan 3, 2020

AdamWill commented Feb 18, 2020

AdamWill commented Feb 18, 2020

kkuehlz commented Feb 18, 2020 •

edited

Loading

mrc0mmand commented Feb 19, 2020

mbiebl commented Mar 12, 2020

AdamWill commented Mar 12, 2020

mbiebl commented Mar 12, 2020

keszybz commented Mar 12, 2020

AdamWill commented Mar 12, 2020

keszybz commented Mar 13, 2020

sergio-correia commented Apr 28, 2020

sergio-correia commented Apr 28, 2020

job: Don't mark as redundant if deps are relevant #14086

job: Don't mark as redundant if deps are relevant #14086

Conversation

kkuehlz commented Nov 19, 2019

kkuehlz commented Nov 19, 2019 • edited Loading

mrc0mmand commented Nov 19, 2019

kkuehlz commented Nov 20, 2019

poettering commented Jan 3, 2020

AdamWill commented Feb 18, 2020

AdamWill commented Feb 18, 2020

kkuehlz commented Feb 18, 2020 • edited Loading

mrc0mmand commented Feb 19, 2020

mbiebl commented Mar 12, 2020

AdamWill commented Mar 12, 2020

mbiebl commented Mar 12, 2020

keszybz commented Mar 12, 2020

AdamWill commented Mar 12, 2020

keszybz commented Mar 13, 2020

sergio-correia commented Apr 28, 2020

sergio-correia commented Apr 28, 2020

kkuehlz commented Nov 19, 2019 •

edited

Loading

kkuehlz commented Feb 18, 2020 •

edited

Loading