Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StartServiceByName does not report service starting failure #28877

Open
lf- opened this issue Aug 17, 2023 · 18 comments
Open

StartServiceByName does not report service starting failure #28877

lf- opened this issue Aug 17, 2023 · 18 comments
Labels
bug 🐛 Programming errors, that need preferential fixing needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer pid1

Comments

@lf-
Copy link
Contributor

lf- commented Aug 17, 2023

systemd version the issue has been seen with

253

Used distribution

NixOS

Linux kernel version used

6.1.41

CPU architectures issue was seen on

x86_64

Component

systemd

Expected behaviour you didn't see

Should not timeout; should report failure in a reasonable time.

Unexpected behaviour you saw

Invoking StartServiceByName times out.

Steps to reproduce the problem

/usr/share/dbus-1/services/org.dummy.service:

[D-BUS Service]
Name=org.dummy
Exec=/usr/bin/false
SystemdService=dummy.service
AssumedAppArmorLabel=unconfined

/usr/lib/systemd/user/dummy.service

[Unit]
Description=Dummy service exiting with a failure code

[Service]
Type=dbus
BusName=org.dummy
ExecStart=/usr/bin/false
Slice=session.slice

Then:

dbus-send --session --print-reply --reply-timeout=30000 --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.StartServiceByName string:'org.dummy' uint32:0


Sorta related to #21520, but that one is condition failure, whereas this is a demonstration that starting the unit failing also does not effectively report failure to dbus-daemon, which is more surprising.

Taken from https://gitlab.freedesktop.org/dbus/dbus/-/issues/472#note_2045734, which identifies this is probably a systemd bug.

Related downstream pain: https://gitlab.gnome.org/GNOME/gtk/-/issues/4830, all the issues marked related on #21520.

Additional program output to the terminal or log subsystem illustrating the issue

quoth the journal:
Aug 17 16:42:22 tail-bot systemd[2831]: Starting Dummy service exiting with a failure code...
Aug 17 16:42:22 tail-bot systemd[2831]: dummy.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 16:42:22 tail-bot systemd[2831]: dummy.service: Failed with result 'exit-code'.
Aug 17 16:42:22 tail-bot systemd[2831]: Failed to start Dummy service exiting with a failure code.



dbus-send says:
Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
@lf- lf- added the bug 🐛 Programming errors, that need preferential fixing label Aug 17, 2023
@github-actions github-actions bot added the pid1 label Aug 17, 2023
@bluca
Copy link
Member

bluca commented Aug 18, 2023

Have you tried with dbus-broker instead?

@bluca bluca added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Aug 18, 2023
@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

Have you tried with dbus-broker instead?

No, because every single distro I use ships the reference dbus by default. The reference dbus maintainers state that the code to do systemd activation was contributed by systemd maintainers and that they expect someone to send them a patch to fix it. I'm personally suspicious of it being a regression in some form because it seems very unlikely this didn't work originally.

n.b. I ran busctl monitor and didn't find any clear evidence for there being a signal emitted by systemd that dbus could use to know of the failure but I may have missed it.

@bluca
Copy link
Member

bluca commented Aug 18, 2023

Which distro is that? I'm pretty sure dbus-broker can be installed on most distributions - fedora, debian, ubuntu, suse

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

by default is the keyword here: the distros in question are Arch and NixOS both of which definitely allow using dbus-broker, but they do not do it by default, so the vast majority of their user-bases who do not have dbus daemon opinions are affected by this bug.

@bluca
Copy link
Member

bluca commented Aug 18, 2023

Sure, can you try again the same repro with dbus-broker then, and see if it still happens?

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

I think I got it set up correctly. Here's the immediate output:

Error org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.dummy exited with status 1

@bluca
Copy link
Member

bluca commented Aug 18, 2023

Does that mean you don't see the reported issue (long timeout)?

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

Yes, correct, but I am not convinced this is necessarily a dbus reference bug: it could equally be that something on the systemd side is behaving in an unexpected way. I simply don't know.

@bluca
Copy link
Member

bluca commented Aug 18, 2023

dbus-broker fully implements the transient unit start/signal watching protocol, while it seems dbus-daemon does not

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

Hm, but this is not a transient unit, and indeed the production use case that was broken (all GTK applications taking 30 seconds to launch due to desktop portal failing to start; pretty painful) also was not a transient unit.

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

I would (perhaps naively?) expect that systemd would send an ActivationFailure in this case, just based on the name of the thing. But I don't have full context about the semantics of that message.

It looks like dbus-daemon makes that assumption too: https://gitlab.freedesktop.org/dbus/dbus/blob/7a4c47a929f1ae2b0e725f329488ea14c2e230db/bus/driver.c#L3007-L3048

@bluca
Copy link
Member

bluca commented Aug 18, 2023

The signals are the same across all units actually, with JobRemoved and friends

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

Alright, I've looked through the systemd code. Here's what's happening:

systemd treats ActivationFailure as a queuing failure but not a startup failure (since it yeets a job out the door without watching its result):

systemd/src/core/dbus.c

Lines 191 to 195 in d936595

r = sd_bus_message_new_signal(sd_bus_message_get_bus(message), &reply, "/org/freedesktop/systemd1", "org.freedesktop.systemd1.Activator", "ActivationFailure");
if (r < 0) {
bus_log_create_error(r);
return 0;
}

but dbus-daemon treats ActivationFailure as any failure:
https://gitlab.freedesktop.org/dbus/dbus/blob/7a4c47a929f1ae2b0e725f329488ea14c2e230db/bus/driver.c#L3007-L3048

I think this is a simple disagreement about the semantic meaning of this undocumented(??) message, which is nobody's fault in particular, but it needs to be fixed on either side. I understand better why the d-bus folks consider this a systemd bug: they are handling the signal that they expect to be sent when the job fails, but on the systemd side the signal is not being sent except if the actual service failed to parse or some other failure that would cause queuing to fail. The fact that Lennart originally wrote that error path in dbus-daemon probably contributes to this perception ;p

My read as someone who is not super deep in the internals of either side is that ActivationFailure from a d-bus perspective does sound like any failure to start the thing.

@bluca
Copy link
Member

bluca commented Aug 18, 2023

That sounds about right, activation is just about enqueueing a job. The result of a job is sent via the JobRemoved signal. We cannot change any of these behaviours as applications are relying on them (I tried).

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

Right. I checked also, this seems to line up with how systemd has treated this since 2010 based on git history.

Given this information, I think the correct move is to subscribe to JobRemoved signals matching the name of the service, correct?

@bluca
Copy link
Member

bluca commented Aug 18, 2023

Yes, that is correct

@lf-
Copy link
Contributor Author

lf- commented Aug 18, 2023

From the d-bus maintainer:

If this is important to you, I'd be happy to review a merge request [to d-bus] that worked around the systemd bug by listening to systemd's more elaborate job control signals, but I would consider that to be a workaround. I still think systemd should keep track of which jobs were created as a result of D-Bus activation, and send ActivationFailure if one of those fails.

Overall this feels like a plain miscommunication moreso than a software bug 😆 many of those around. The miscommunication dates back to 2010 when the whole lot was first implemented. Can ActivationFailure specifically not be fired in more cases on the systemd side, or is it more generally all the D-Bus API where this is the case? [I went looking for anything on the issue tracker identifying the previous attempt and couldn't find anything]

@bluca
Copy link
Member

bluca commented Aug 18, 2023

I'm afraid we can't change how signals work - I tried, but it's just too ingrained in various clients, and would cause too many breakages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer pid1
Development

No branches or pull requests

2 participants