Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Booting to multi-user.target and doing systemctl isolate graphical.target does not work with 253 #26364

Closed
AdamWill opened this issue Feb 8, 2023 · 15 comments · Fixed by #26388
Labels
bug 🐛 Programming errors, that need preferential fixing downstream/fedora Tracking bugs for Fedora pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit systemctl
Milestone

Comments

@AdamWill
Copy link
Contributor

AdamWill commented Feb 8, 2023

systemd version the issue has been seen with

253

Used distribution

Fedora Rawhide

Linux kernel version used

6.2.0-0.rc7.20230207git05ecb680708a.51.fc38.x86_64

CPU architectures issue was seen on

x86_64

Component

systemctl, systemd

Expected behaviour you didn't see

On a Fedora Rawhide system with a graphical desktop installed that usually boots successfully to graphical.target, I instead boot to multi-user.target, and run systemctl isolate graphical.target. This should bring up the graphical desktop.

Unexpected behaviour you saw

Instead, the system simply becomes stuck at a blank screen. This started happening when systemd 253 landed in Rawhide; with 252 it was fine.

Steps to reproduce the problem

Install Fedora Rawhide (Workstation or KDE - find images at https://openqa.fedoraproject.org/nightlies.html), boot normally to verify it works, then boot to multi-user.target (by changing the default target or booting with 3 kernel arg), log in as root, and run systemctl isolate graphical.target.

journal messages from a boot reproducing the issue

Additional program output to the terminal or log subsystem illustrating the issue

No response

@AdamWill AdamWill added the bug 🐛 Programming errors, that need preferential fixing label Feb 8, 2023
@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 8, 2023

Downstream report: https://bugzilla.redhat.com/show_bug.cgi?id=2165692

@yuwata yuwata added the regression ⚠️ A bug in something that used to work correctly and broke through some recent commit label Feb 8, 2023
@yuwata yuwata added this to the v253 milestone Feb 8, 2023
@dtardon dtardon added the downstream/fedora Tracking bugs for Fedora label Feb 9, 2023
@keszybz
Copy link
Member

keszybz commented Feb 9, 2023

When isolate is executed, we start stopping all kinds of units, incl. dbus-broker.service, which essentially brings the machine down. The first question is what changed. 017a7ba looks a bit suspicious. There aren't other changes to src/core/ that would seem related. The first answer is that isolate is known-busted. I think we need to keep it limping along, but it'd be better to just not use it.

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 9, 2023

I can't actually think of a way to not use it in this case.

The use case is testing update notifications. For Fedora, we have a release criterion that the desktop must not notify the user about updates when running live, but must notify the user about updates when installed. So we need to test that.

Unfortunately, the desktops - GNOME in particular - try to be very clever about when and what to notify about, so in order for the test to be reliable, we first need to prepare some stuff so we're absolutely sure that we're in a scenario where an update notification would be shown, unless we're on the live path and the "we're running live" stipulation should prevent it.

We obviously want to do that before we reach the desktop, otherwise our attempt to fiddle with things is racing with the desktop actually checking for updates and notification timers kicking in and so on.

So the test in question boots to multi-user and does a bunch of prep - setting up a repo and downgrading a package to a dummy version to ensure an update is definitely available, then fiddling with various settings to game all the desktop's heuristics to make sure it ought to notify of the update. On GNOME we even have to set the system clock in some circumstances, because GNOME has a rule that it doesn't show update notifications between midnight and 6am (boy, that was fun to track down).

Once we're done with all of that, we do systemctl isolate graphical.target to get the graphical environment to actually start so we can run the test.

On the 'installed system' path we could just reboot at that point instead, sure. But on the "live boot" path, we obviously can't - you can't reboot a live system. I can't think of any other practical way to do all this for a live scenario. Can you?

@YHNdnzj
Copy link
Member

YHNdnzj commented Feb 9, 2023 via email

keszybz added a commit to keszybz/systemd that referenced this issue Feb 9, 2023
This reverts commit 5d71e46.

It turns out that this commit caused a noticable change in behaviour for
'systemctl isolate graphical.target' in Fedora, as found by git bisect.
Reverting on top of current git also restores behaviour from v252. I don't have
time to analyze this right now, so this is a quick revert to unblock Fedora
and possibly allow us to release v253 in case a full solution is harder.

Fixes systemd#26364.
@keszybz
Copy link
Member

keszybz commented Feb 9, 2023

The problem is that isolate is a bad idea. The systemd unit model is a mix of dependency-based logic and event-based logic, incl. hardware changes and user actions like logins. The isolation logic works for a static dependency-based system, but is very hard to reconcile with units started in response to state changes outside of systemd. We have been papering this over by adding IgnoreOnIsolate on this and that, but this is not a solution. In particular, it would require that we only use isolate for one specific purpose. As soon as you have units that should be stopped in one target that might be isolated but not in some other one, this approach breaks down. This can be compared with starting of units: units are grouped into targets and arbitrary combinations can be started and stopped via Conflicts depending on what is needed.

I can't think of any other practical way to do all this for a live scenario.

Why not just do systemctl start graphical.target?

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 9, 2023

well, if that's supported/intended I can certainly do that. I just recall from past experience/documentation that isolate was supposed to be The Right Way to change between targets. If using start is supposed to be OK, though, I can certainly give that a shot.

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 9, 2023

OK, using start seems to work, so I changed the test to do that.

@dtardon
Copy link
Collaborator

dtardon commented Feb 10, 2023

well, if that's supported/intended I can certainly do that. I just recall from past experience/documentation that isolate was supposed to be The Right Way to change between targets.

Only if one wants to mimic the "runlevels" behavior, i.e., only units needed by the new target should continue to run. If one wants to run something in addition to the current set, then systemctl start ... is the right way. There's no difference in this case anyway, as graphical.target requires multi-user.target. Therefore, a boot into graphical.target and a boot into multi-user.target followed by systemctl start graphical.target
should have the same effect. The second one just splits the operation into two steps, that's all.

@poettering
Copy link
Member

We need debug logs for this. i.e. systemd-analyze log-level debug before you trigger the issue. Otherwise there's nothing we can do.

poettering added a commit to poettering/systemd that referenced this issue Feb 10, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)
poettering added a commit to poettering/systemd that referenced this issue Feb 10, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)
@YHNdnzj
Copy link
Member

YHNdnzj commented Feb 10, 2023

I reproduced this with debug log enabled in a Fedora VM: https://fars.ee/y5Ps

However, I was not able to reproduce this on Arch - dbus.service is not stopped there (tried both freedesktop/dbus and dbus-broker).

@bluca
Copy link
Member

bluca commented Feb 10, 2023

I reproduced this with debug log enabled in a Fedora VM: https://fars.ee/y5Ps

However, I was not able to reproduce this on Arch - dbus.service is not stopped there (tried both freedesktop/dbus and dbus-broker).

@YHNdnzj are you able to reproduce it after applying the fix from #26388 ?

@keszybz
Copy link
Member

keszybz commented Feb 10, 2023

well, if that's supported/intended I can certainly do that. I just recall from past experience/documentation that isolate was supposed to be The Right Way to change between targets.

Only if one wants to mimic the "runlevels" behavior, i.e., only units needed by the new target should continue to run.

The analogy with runlevels in sysvinit is not exact. When a runlevel change was triggered, sysvinit would start a bunch of scripts (S* and K*), depending on the runlevel configuration. But stuff that was not covered by those scripts wouldn't generally be touched. There was no notion of "kill everything that doesn't have an S script for this runlevel", just because there was no notion of ownership of processes. So e.g. stuff that would have been launched in response to udev triggers would almost certainly survive a runlevel change. Similarly, stuff spawned from other services, e.g. user sessions or remote logins, would likewise not be touched. This is different from isolate, where things are either in the dependency tree of the new target, or explicitly excluded, or killed.

@YHNdnzj
Copy link
Member

YHNdnzj commented Feb 10, 2023

I reproduced this with debug log enabled in a Fedora VM: https://fars.ee/y5Ps
However, I was not able to reproduce this on Arch - dbus.service is not stopped there (tried both freedesktop/dbus and dbus-broker).

@YHNdnzj are you able to reproduce it after applying the fix from #26388 ?

I'm not really familiar with Fedora's build system 🤔

But TBH it feels weird that this doesn't trigger on Arch

@bluca
Copy link
Member

bluca commented Feb 10, 2023

I reproduced this with debug log enabled in a Fedora VM: https://fars.ee/y5Ps
However, I was not able to reproduce this on Arch - dbus.service is not stopped there (tried both freedesktop/dbus and dbus-broker).

@YHNdnzj are you able to reproduce it after applying the fix from #26388 ?

I'm not really familiar with Fedora's build system thinking

But TBH it feels weird that this doesn't trigger on Arch

@YHNdnzj I have not tried it, but there are instructions to install the packages built by the CI from that PR, so shouldn't be necessary to build it by hand, if you want to give it a shot: https://dashboard.packit.dev/results/copr-builds/614038

@keszybz
Copy link
Member

keszybz commented Feb 10, 2023

No need, I'm checking Lennart's patch now.

poettering added a commit to poettering/systemd that referenced this issue Feb 10, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)

Fixes: systemd#23055
poettering added a commit to poettering/systemd that referenced this issue Feb 10, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)

Fixes: systemd#23055
bluca pushed a commit that referenced this issue Feb 10, 2023
…gered by units we keep running

Inspired by: #26364

(this might even "fix" #26364, but without debug logs it's hard to make
such claims)

Fixes: #23055
d-hatayama pushed a commit to d-hatayama/systemd that referenced this issue Feb 15, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)

Fixes: systemd#23055
keszybz pushed a commit to keszybz/systemd that referenced this issue Mar 30, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)

Fixes: systemd#23055
(cherry picked from commit 32d6707)
valentindavid pushed a commit to valentindavid/systemd that referenced this issue Aug 8, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)

Fixes: systemd#23055
(cherry picked from commit 32d6707)
(cherry picked from commit c973e22)
Werkov pushed a commit to Werkov/systemd that referenced this issue Nov 1, 2023
…gered by units we keep running

Inspired by: systemd#26364

(this might even "fix" systemd#26364, but without debug logs it's hard to make
such claims)

Fixes: systemd#23055
(cherry picked from commit 32d6707)
(cherry picked from commit c973e22)
(cherry picked from commit bfe6d1d)
(cherry picked from commit 54b580e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing downstream/fedora Tracking bugs for Fedora pid1 regression ⚠️ A bug in something that used to work correctly and broke through some recent commit systemctl
7 participants