Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job: Don't mark as redundant if deps are relevant #14086

Merged
merged 1 commit into from
Jan 3, 2020

Conversation

kkuehlz
Copy link
Contributor

@kkuehlz kkuehlz commented Nov 19, 2019

In the steps given in #13850, the resulting graph looks like:

C (Anchor) -> B -> A

Since B is inactive, it will be flagged as redundant and removed from
the transaction, causing A to get garbage collected. The proposed fix is
to not mark nodes as redundant if doing so causes a relevant node to be
garbage collected.

Fixes #13850

@kkuehlz
Copy link
Contributor Author

kkuehlz commented Nov 19, 2019

This is just one approach. Some other ways to skin this could be still letting the job get marked as redundant but flattening the graph before GC, not GCing units that matter to the anchor, etc...

@mrc0mmand
Copy link
Member

Something in this patch is killing the CentOS CI machines:

[   10.449867] systemd[1]: segfault at 7ffcc0b4eff8 ip 000055809e50f9df sp 00007ffcc0b4f000 error 6 in systemd[55809e4e0000+bc000]
[   10.451687] Code: 00 bf 00 00 00 00 e8 c0 40 fd ff 55 48 89 e5 41 54 53 4c 8b 67 40 4d 85 e4 0f 84 88 00 00 00 89 f3 49 8b 44 24 08 48 8b 78 08 <e8> 0d 98 04 00 89 c2 85 db 74 14 83 fb 02 75 49 b8 01 00 00 00 83
[   10.580863] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[   10.582132] CPU: 1 PID: 1 Comm: systemd Not tainted 5.3.11-arch1-1 #1
[   10.583146] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module_el8.0.0+189+f9babebb 04/01/2014
[   10.584430] Call Trace:
[   10.584816]  dump_stack+0x66/0x90
[   10.585335]  panic+0x112/0x2e8
[   10.585807]  do_exit.cold+0x76/0xd3
[   10.586359]  do_group_exit+0x33/0xa0
[   10.586914]  get_signal+0x136/0x8d0
[   10.587451]  do_signal+0x43/0x680
[   10.587976]  exit_to_usermode_loop+0xbe/0x110
[   10.588641]  prepare_exit_to_usermode+0x67/0x90
[   10.589337]  retint_user+0x8/0x8
[   10.589840] RIP: 0033:0x55809e50f9df
[   10.590398] Code: Bad RIP value.
[   10.590904] RSP: 002b:00007ffcc0b4f000 EFLAGS: 00010206
[   10.591701] RAX: 00005580a06e3bf0 RBX: 0000000000000000 RCX: 000055809e50fa95
[   10.592787] RDX: 00000000fffffffe RSI: 0000000000000000 RDI: 00005580a06b14f0
[   10.593880] RBP: 00007ffcc0b4f010 R08: 00005580a05de022 R09: 0000000000000003
[   10.594957] R10: 0000000000000000 R11: 00007ff54e3cca40 R12: 00005580a06e3ca0
[   10.596949] R13: 00007ffcc134e0c8 R14: 00005580a05e24a0 R15: 00005580a06dbf40
[   10.598677] Kernel Offset: 0x39400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   10.600432] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b ]---

In the steps given in systemd#13850, the resulting graph looks like:

    C (Anchor) -> B -> A

Since B is inactive, it will be flagged as redundant and removed from
the transaction, causing A to get garbage collected. The proposed fix is
to not mark nodes as redundant if doing so causes a relevant node to be
garbage collected.

Fixes systemd#13850
@kkuehlz
Copy link
Contributor Author

kkuehlz commented Nov 20, 2019

@mrc0mmand Apologies. Once I built it onto my system tree was able to repro the panic. Didn't happen with the mkosi image. Should be fixed now.

@poettering
Copy link
Member

looks correct. thanks!

@poettering poettering merged commit 097537f into systemd:master Jan 3, 2020
@AdamWill
Copy link
Contributor

I suspect this may not actually be correct:

https://bugzilla.redhat.com/show_bug.cgi?id=1803293

we're running into an issue in Fedora 32 and Rawhide where plymouth-start.service (and several other services) are running after a TTY is spawned or a login session starts. I did a bit of digging this morning and the results seem to point to this change as a suspect.

@AdamWill
Copy link
Contributor

Yep, reverting this change definitely seems to fix that bug.

@kkuehlz
Copy link
Contributor Author

kkuehlz commented Feb 18, 2020

@AdamWill Gotcha. If this patch is changing the behavior of existing unit files then it should be reverted. Even if there is something that can be changed in the unit, the edge case it fixes is not worth a regression.

@mrc0mmand
Copy link
Member

This is a yet another case where it would be really great if we could, somehow, resurrect the Fedora Rawhide CI for the systemd upstream repo.

@mbiebl
Copy link
Contributor

mbiebl commented Mar 12, 2020

I can confirm the breakage
See #15091 (I suspect #15090 and #15062 are duplicates)
There is also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953670

I ran a git bisect and ended up with 097537f as the first faulty commit. Imho this change should be reverted until a solution is found that doesn't break plymouth.

@AdamWill
Copy link
Contributor

It doesn't only affect Plymouth, it affects over a dozen other services. See this comment.

@mbiebl
Copy link
Contributor

mbiebl commented Mar 12, 2020

@AdamWill good to know. I'll be reverting this commit downstream as this seems to hit users hard.
I see that you did the same for Fedora.

@keszybz
Copy link
Member

keszybz commented Mar 12, 2020

It doesn't only affect Plymouth, it affects over a dozen other services.

Actually the only ones known so far are systemd-network-generator.service, systemd-vconsole-setup.service, and plymouth-start.service. And systemd-network-generator.service is idempotent, so not running it twice is just a minor optimization.

(#1803293 lists some other services, but they already had RemainAfterExit=true, so they would not be invoked again anyway.)

--
@yuwata, I removed the "needs fixing" label, because we use don't use that for external projects.

@AdamWill
Copy link
Contributor

What's with all the other service names listed in that comment, then?

Anyhow, at least systemd-vconsole-setup is also very significant: that one being run again seemed to result in about half of the keypresses on tty1 being eaten somehow. It was almost impossible to log in or type anything. That was downstream 1801272.

@keszybz
Copy link
Member

keszybz commented Mar 13, 2020

What's with all the other service names listed in that comment, then?

Not sure. Most of those units have Type=oneshot, RemainAfterExit=yes. In v244 it is possible that in some scenarios they would not run even though they should, but it would be unexpected if they run twice. I guess that they had already been started, and those lines in the log are just noise, systemd deciding not to run them again, when it doesn't really matter why. If you paste a full log, we could try to figure it out, but I don't it is worth the time.

@sergio-correia
Copy link

So, I have been having weird issues on ArchLinux with systemd 245 recently, and today I was able to bisect and got this as the first bad commit.

Initially I noticed issues when trying to use VMs with virt-manager. For some reason, I would start a VM and nothing happened, it just stayed frozen.

I tried to restart libvirtd and then I saw systemd asking me for my LUKS password. Digging further, it seems any operation using systemctl would request my LUKS password (after boot, wait a few minutes for the password provided in the boot to expire from the keyring, then try to start/restart/stop a service and it would ask me for the LUKS password). I do not have plymouth in this system.

Any ideas what is going on here?

@sergio-correia
Copy link

And I guess I have have found my issue: somehow I had the same device listed twice in crypttab and one of the systemd-cryptsetup units associated to this device would be marked as failed after boot.

After this commit it seems systemd keeps retrying failed units (question in #15242), which is what causes the issue I had. A little subtle in my case, as since the password is cached for a little while, the issue wont be reproducible immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

PartOf= does not work on inactive units
9 participants