Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock between nss-systemd and dbus (dbus call timeout) #22038

Closed
wants to merge 1 commit into from

Conversation

slyon
Copy link
Contributor

@slyon slyon commented Jan 7, 2022

The fix from fd63e71 wasn't updated when dynamic user lookup switched away from d-bus.

This solves a long standing issue where dbus calls would hit a timeout that we have been observing randomly since systemd v245 and is really hard to reproduce. Most recently @nowrep came up with a good suggestion that seems to solve the issue!

We're looking for review of this suggested change.

Fixes: #15316

The fix from fd63e71 wasn't updated when dynamic user lookup switched away from d-bus.

Fixes: systemd#15316

Co-authored-by: David Rosca <nowrep@gmail.com>
@bluca
Copy link
Member

bluca commented Jan 7, 2022

How is varlink causing a deadlock with dbus exactly? The old env var is no longer used for anything and could be removed yes, but could you please explain why skipping resolution of DynamicUser has anything to do with dbus-daemon, which uses a fixed user everywhere (afaik)?

@nowrep
Copy link

nowrep commented Jan 7, 2022

It's deadlocking because both d-bus and varlink calls are blocking.

@bluca
Copy link
Member

bluca commented Jan 7, 2022

But how? Which parts are? Dbus is clear as it's dbus-daemon that provides it, so it's a circular dependency. What's blocking the varlink resolution?

@nowrep
Copy link

nowrep commented Jan 7, 2022

systemd does blocking GetConnectionUnixUser and dbus will then (via nss-systemd) do blocking varlink call to systemd.

It's exactly the same situation as before, it doesn't matter nss-systemd no longer uses dbus to communicate with systemd daemon. It still does blocking call so it will deadlock just the same.

@bluca
Copy link
Member

bluca commented Jan 7, 2022

I see, what's the call stack? The commit message is very vague, it would greatly benefit from having all these details

@bluca bluca added needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer and removed needs-discussion 🤔 labels Jan 10, 2022
@DemiMarie
Copy link

systemd does blocking GetConnectionUnixUser and dbus will then (via nss-systemd) do blocking varlink call to systemd.

It's exactly the same situation as before, it doesn't matter nss-systemd no longer uses dbus to communicate with systemd daemon. It still does blocking call so it will deadlock just the same.

Why is systemd doing a blocking call? nss-systemd is stuck having to conform to a blocking interface but systemd is under no such restrictions.

@slyon
Copy link
Contributor Author

slyon commented Feb 17, 2022

Unfortunately I am still unable to reproduce this issue myself.
@nowrep has been spot on with his suggestion for a workaround, I wonder if you know any more about the call stack that leads to this situation?

@nowrep
Copy link

nowrep commented Feb 17, 2022

I don't have a stack trace as it only happens early during the boot (in my case it's triggered when starting user session).

The issue is easy to see however. This is where the systemd does the problematic blocking call:

r = sd_bus_call_method(
bus,
"org.freedesktop.DBus",
"/org/freedesktop/DBus",
"org.freedesktop.DBus",
"GetConnectionUnixUser",
NULL,
&reply,
"s",
unique ?: name);
if (r < 0)
return r;
r = sd_bus_message_read(reply, "u", &u);
if (r < 0)
return r;
c->euid = u;
c->mask |= SD_BUS_CREDS_EUID;
reply = sd_bus_message_unref(reply);
}

eventually dbus will call into nss-systemd while handling this call and you have a deadlock.

It's a race and it will only deadlock in a rare case where this systemd call is the first time dbus will need to query info about that particular user (dbus has a cache for user database).

@poettering
Copy link
Member

Why is systemd doing a blocking call? nss-systemd is stuck having to conform to a blocking interface but systemd is under no such restrictions.

Because it's hard. For incoming requests PID 1 often requests metadata (uids, pids, …) about the sender for doing auth and stuff, or to track clients otherwise. But doing cascades of async stuff from async handlers is kinda nasty to write in C and not lose track of what happens. Because of that we so far had the rule that the dbus broker is "special": it's OK from PID 1 to block on requests to the broker's own interfaces, but never OK to block on any of the services reachable through it. i.e. there's a special relationship between PID 1 and the broker anyway, and hence we just said: if it's requests that can be answered by the broker itself as opposed to the services reachable through it, then we are OK with synchronous requests, to keep things simpler on our side.

But of course things are never that easy: dbus-daemon does NSS requests and those are also blocking. So there might be a deadlock when PID 1 blocks on dbus-daemon for some call to the broker's own interface, and at the same time the broker blocks on PID1 for some NSS call that ultimately is varlink.

I think in the long run we have no other option than to make everything async when it comes to dbus, i.e. also query the uid/pid stuff async, even if it makes our stuff more complex... In the mean time we can hackishly turn off the PID1 provided NSS entries the way this commit suggests.

(The deadlock that fd63e71 addressed was a bit different btw, it was about an NSS module loaded into dbus-daemon deadlocking because it wanted to talk to dbus and thus dbus-daemon itself. i.e. it was a deadlock within the same process)

@poettering
Copy link
Member

(btw, my educated guess is that this is an issue with dbus-daemon only, and not with dbus-broker, since the latter resolves names very differently and very specific times only. That's why this was never noticed on Fedora)

@slyon
Copy link
Contributor Author

slyon commented Feb 17, 2022

(btw, my educated guess is that this is an issue with dbus-daemon only, and not with dbus-broker, since the latter resolves names very differently and very specific times only. That's why this was never noticed on Fedora)

I can confirm that we did not see this issue when using dbus-broker

In the mean time we can hackishly turn off the PID1 provided NSS entries the way this commit suggests.

So you feel like we could merge this?

@DemiMarie
Copy link

(btw, my educated guess is that this is an issue with dbus-daemon only, and not with dbus-broker, since the latter resolves names very differently and very specific times only. That's why this was never noticed on Fedora)

How does dbus-broker avoid this deadlock?

But doing cascades of async stuff from async handlers is kinda nasty to write in C and not lose track of what happens.

Would using Rust and async functions help? What about some sort of coroutine abstraction?

poettering added a commit to poettering/systemd that referenced this pull request Feb 17, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as systemd#15316.

Also-see: systemd#22038 (comment)
poettering added a commit to poettering/systemd that referenced this pull request Feb 17, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd#22038

Fixes: systemd#15316
@poettering
Copy link
Member

I prepped #22552 now, which closely matches this PR, but is a bit more comprehensive, and starts with converting synchronous calls to the dbus broker into asynchronous code. But there's a lot left to do in that area.

Anyway, let's contine this in #22552. I credited @slyon on the patch (and for tracking down the deadlock) given my patch is just a more comprehensive version of his work.

I hope it's OK if I close this one here.

@poettering poettering closed this Feb 17, 2022
@poettering
Copy link
Member

So you feel like we could merge this?

Well, yes more or less. But the fix should be more comprehensive, to make things easy see → #22552

@poettering
Copy link
Member

(btw, my educated guess is that this is an issue with dbus-daemon only, and not with dbus-broker, since the latter resolves names very differently and very specific times only. That's why this was never noticed on Fedora)

How does dbus-broker avoid this deadlock?

Iirc dbus-daemon resolves lazily and dbus-broker resolves ahead of time. the lazy resolution means that it might happen when triggered by a blocking PID 1 operation, but in dbus-broker that's not gonna happen then.

But doing cascades of async stuff from async handlers is kinda nasty to write in C and not lose track of what happens.

Would using Rust and async functions help? What about some sort of coroutine abstraction?

Well, the developers of systemd are generally positive on Rust I think, me included. The big issue is build systems: when starting to go Rust with systemd we need some way to build a hybrid codebase effectively, and that for a long time. So far no build system even remotely makes that easy. The projects that mix C and Rust usually have a very clear separation about leaf/stem components and only one side is Rust. In systemd with its large body of interdependent components we need something that can deal with more complex builds, i.e we want to port library functions over, but others not, and some services overs but others not, and that means you have dependencies from C to rust to C to rust and so on. So far the meson and rust/cargo people were not the biggest of friends, and given there aren't really other projects of this scales who started doing such a transition with a codebase arranged that way, we'd be pioneering this, and we really don#t want to be pioneers with this, but just use rust as tool that already works.

or to say this differently: we build a lot of spearate libraries, internally and externally, that consumer each others and a lot of separate ELF binaries that consume them all. This is quite different from the usual area where Rust is used where you basiclly just build one big binary and that's it...

in other words, we are waiting for other people to solve the toolchains problems properly before we bother, because we want to spend our time on solving other problems instead.

@DemiMarie
Copy link

in other words, we are waiting for other people to solve the toolchains problems properly before we bother, because we want to spend our time on solving other problems instead.

If I recall correctly, Meson has decent Rust support already, but only if you don’t have crates.io dependencies. That would be a blocker for lots of projects, but my understanding is that systemd would not need any such dependencies.

@poettering
Copy link
Member

if we go rust, we should go rust properly. i.e. we should be able to use crates where it makes sense.

poettering added a commit to poettering/systemd that referenced this pull request Feb 18, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as systemd#15316.

Also-see: systemd#22038 (comment)
poettering added a commit to poettering/systemd that referenced this pull request Feb 18, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd#22038

Fixes: systemd#15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
@evverx
Copy link
Member

evverx commented Feb 19, 2022

FWIW apart from build systems I'd add that rust testing infrastructure is far from what I would call ideal (especially when it comes to mixing two languages). It was discussed in #19598 (comment) and as far as I can tell none of those issues have been addressed.

@DemiMarie another issue is https://dl.acm.org/doi/10.1145/3418898 (the paper was published about a year ago so I'm not sure how relevant it is today. If that was addressed somehow I'd appreciate it if you could point me in the right direction) Thanks!

bluca pushed a commit to systemd/systemd-stable that referenced this pull request Mar 5, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.

Also-see: systemd/systemd#22038 (comment)
(cherry picked from commit e39eb04)
bluca pushed a commit to systemd/systemd-stable that referenced this pull request Mar 5, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd/systemd#22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
(cherry picked from commit de90700)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Mar 9, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.

Also-see: systemd/systemd#22038 (comment)
(cherry picked from commit e39eb04)
(cherry picked from commit cf39014)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Mar 9, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd/systemd#22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
(cherry picked from commit de90700)
(cherry picked from commit 367041a)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Mar 9, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.

Also-see: systemd/systemd#22038 (comment)
(cherry picked from commit e39eb04)
(cherry picked from commit cf39014)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Mar 9, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd/systemd#22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
(cherry picked from commit de90700)
(cherry picked from commit 367041a)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Mar 10, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.

Also-see: systemd/systemd#22038 (comment)
(cherry picked from commit e39eb04)
(cherry picked from commit cf39014)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Mar 10, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd/systemd#22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
(cherry picked from commit de90700)
(cherry picked from commit 367041a)
bluca pushed a commit to systemd/systemd-stable that referenced this pull request Mar 10, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.

Also-see: systemd/systemd#22038 (comment)
(cherry picked from commit e39eb04)
(cherry picked from commit cf39014)
bluca pushed a commit to systemd/systemd-stable that referenced this pull request Mar 10, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd/systemd#22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
(cherry picked from commit de90700)
(cherry picked from commit 367041a)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Nov 4, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.

Also-see: systemd/systemd#22038 (comment)
(cherry picked from commit e39eb04)
(cherry picked from commit cf39014)
(cherry picked from commit 1daa382)
bluca pushed a commit to bluca/systemd-stable that referenced this pull request Nov 4, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd/systemd#22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
(cherry picked from commit de90700)
(cherry picked from commit 367041a)
(cherry picked from commit 0863a55)
fbuihuu pushed a commit to openSUSE/systemd that referenced this pull request Nov 30, 2022
A first step of removing blocking calls to the D-Bus broker from PID 1.
There's a lot more to got (i.e. grep src/core/ for sd_bus_creds
basically), but it's a start.

Removing blocking calls to D-Bus broker deals systematicallly with
deadlocks caused by dbus-daemon blocking on synchronous IPC calls back
to PID1 (e.g. Varlink calls through nss-systemd). Bugs such as #15316.

Also-see: systemd/systemd#22038 (comment)
(cherry picked from commit e39eb04)
fbuihuu pushed a commit to openSUSE/systemd that referenced this pull request Nov 30, 2022
There's currently a deadlock between PID 1 and dbus-daemon: in some
cases dbus-daemon will do NSS lookups (which are blocking) at the same
time PID 1 synchronously blocks on some call to dbus-daemon. Let's break
that by setting SYSTEMD_NSS_DYNAMIC_BYPASS=1 env var for dbus-daemon,
which will disable synchronously blocking varlink calls from nss-systemd
to PID 1.

In the long run we should fix this differently: remove all synchronous
calls to dbus-daemon from PID 1. This is not trivial however: so far we
had the rule that synchronous calls from PID 1 to the dbus broker are OK
as long as they only go to interfaces implemented by the broke itself
rather than services reachable through it. Given that the relationship
between PID 1 and dbus is kinda special anyway, this was considered
acceptable for the sake of simplicity, since we quite often need
metadata about bus peers from the broker, and the asynchronous logic
would substantially complicate even the simplest method handlers.

This mostly reworks the existing code that sets SYSTEMD_NSS_BYPASS_BUS=
(which is a similar hack to deal with deadlocks between nss-systemd and
dbus-daemon itself) to set SYSTEMD_NSS_DYNAMIC_BYPASS=1 instead. No code
was checking SYSTEMD_NSS_BYPASS_BUS= anymore anyway, and it used to
solve a similar problem, hence it's an obvious piece of code to rework
like this.

Issue originally tracked down by Lukas Märdian. This patch is inspired
and closely based on his patch:

       systemd/systemd#22038

Fixes: #15316
Co-authored-by: Lukas Märdian <slyon@ubuntu.com>
(cherry picked from commit de90700)

[fbui: adjust context]
[fbui: fixes bsc#1203857]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer nss
Development

Successfully merging this pull request may close these issues.

"Unexpected error response from GetNameOwner(): Connection terminated" messages + boot takes a lot of time
7 participants