New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StartServiceByName does not report service starting failure #28877
Comments
Have you tried with dbus-broker instead? |
No, because every single distro I use ships the reference dbus by default. The reference dbus maintainers state that the code to do systemd activation was contributed by systemd maintainers and that they expect someone to send them a patch to fix it. I'm personally suspicious of it being a regression in some form because it seems very unlikely this didn't work originally. n.b. I ran busctl monitor and didn't find any clear evidence for there being a signal emitted by systemd that dbus could use to know of the failure but I may have missed it. |
Which distro is that? I'm pretty sure dbus-broker can be installed on most distributions - fedora, debian, ubuntu, suse |
by default is the keyword here: the distros in question are Arch and NixOS both of which definitely allow using dbus-broker, but they do not do it by default, so the vast majority of their user-bases who do not have dbus daemon opinions are affected by this bug. |
Sure, can you try again the same repro with dbus-broker then, and see if it still happens? |
I think I got it set up correctly. Here's the immediate output:
|
Does that mean you don't see the reported issue (long timeout)? |
Yes, correct, but I am not convinced this is necessarily a dbus reference bug: it could equally be that something on the systemd side is behaving in an unexpected way. I simply don't know. |
dbus-broker fully implements the transient unit start/signal watching protocol, while it seems dbus-daemon does not |
Hm, but this is not a transient unit, and indeed the production use case that was broken (all GTK applications taking 30 seconds to launch due to desktop portal failing to start; pretty painful) also was not a transient unit. |
I would (perhaps naively?) expect that systemd would send an ActivationFailure in this case, just based on the name of the thing. But I don't have full context about the semantics of that message. It looks like dbus-daemon makes that assumption too: https://gitlab.freedesktop.org/dbus/dbus/blob/7a4c47a929f1ae2b0e725f329488ea14c2e230db/bus/driver.c#L3007-L3048 |
The signals are the same across all units actually, with JobRemoved and friends |
Alright, I've looked through the systemd code. Here's what's happening: systemd treats ActivationFailure as a queuing failure but not a startup failure (since it yeets a job out the door without watching its result): Lines 191 to 195 in d936595
but dbus-daemon treats ActivationFailure as any failure: I think this is a simple disagreement about the semantic meaning of this undocumented(??) message, which is nobody's fault in particular, but it needs to be fixed on either side. I understand better why the d-bus folks consider this a systemd bug: they are handling the signal that they expect to be sent when the job fails, but on the systemd side the signal is not being sent except if the actual service failed to parse or some other failure that would cause queuing to fail. The fact that Lennart originally wrote that error path in dbus-daemon probably contributes to this perception ;p My read as someone who is not super deep in the internals of either side is that ActivationFailure from a d-bus perspective does sound like any failure to start the thing. |
That sounds about right, activation is just about enqueueing a job. The result of a job is sent via the JobRemoved signal. We cannot change any of these behaviours as applications are relying on them (I tried). |
Right. I checked also, this seems to line up with how systemd has treated this since 2010 based on git history. Given this information, I think the correct move is to subscribe to JobRemoved signals matching the name of the service, correct? |
Yes, that is correct |
From the d-bus maintainer:
Overall this feels like a plain miscommunication moreso than a software bug 😆 many of those around. The miscommunication dates back to 2010 when the whole lot was first implemented. Can ActivationFailure specifically not be fired in more cases on the systemd side, or is it more generally all the D-Bus API where this is the case? [I went looking for anything on the issue tracker identifying the previous attempt and couldn't find anything] |
I'm afraid we can't change how signals work - I tried, but it's just too ingrained in various clients, and would cause too many breakages |
systemd version the issue has been seen with
253
Used distribution
NixOS
Linux kernel version used
6.1.41
CPU architectures issue was seen on
x86_64
Component
systemd
Expected behaviour you didn't see
Should not timeout; should report failure in a reasonable time.
Unexpected behaviour you saw
Invoking StartServiceByName times out.
Steps to reproduce the problem
/usr/share/dbus-1/services/org.dummy.service
:/usr/lib/systemd/user/dummy.service
Then:
dbus-send --session --print-reply --reply-timeout=30000 --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.StartServiceByName string:'org.dummy' uint32:0
Sorta related to #21520, but that one is condition failure, whereas this is a demonstration that starting the unit failing also does not effectively report failure to dbus-daemon, which is more surprising.
Taken from https://gitlab.freedesktop.org/dbus/dbus/-/issues/472#note_2045734, which identifies this is probably a systemd bug.
Related downstream pain: https://gitlab.gnome.org/GNOME/gtk/-/issues/4830, all the issues marked related on #21520.
Additional program output to the terminal or log subsystem illustrating the issue
quoth the journal: Aug 17 16:42:22 tail-bot systemd[2831]: Starting Dummy service exiting with a failure code... Aug 17 16:42:22 tail-bot systemd[2831]: dummy.service: Main process exited, code=exited, status=1/FAILURE Aug 17 16:42:22 tail-bot systemd[2831]: dummy.service: Failed with result 'exit-code'. Aug 17 16:42:22 tail-bot systemd[2831]: Failed to start Dummy service exiting with a failure code. dbus-send says: Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
The text was updated successfully, but these errors were encountered: