Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to run as user instance, but $XDG_RUNTIME_DIR is not set #232

Closed
smcv opened this issue Jun 16, 2015 · 22 comments
Closed

Trying to run as user instance, but $XDG_RUNTIME_DIR is not set #232

smcv opened this issue Jun 16, 2015 · 22 comments
Labels

Comments

@smcv
Copy link
Contributor

smcv commented Jun 16, 2015

I'm working on a semi-embedded distribution which starts a PAM session for a designated user during boot, then instructs that session to start a target that includes Xorg and a GUI. This mostly works fine, but there seems to be a race condition in which the systemd --user run by user@.service doesn't always pick up its $XDG_RUNTIME_DIR correctly:

Jun 16 17:02:25 xxx systemd[1]: Starting User Manager for UID 1000...
Jun 16 17:02:25 xxx systemd[539]: Trying to run as user instance, but $XDG_RUNTIME_DIR is not set.

It seems to fail about 1/3 of the time on my virtual machine setup. I've found one possible reason for the race, which does look like a genuine bug, for which I'll send a pull request; but unfortunately, that commit doesn't actually fix it.

smcv added a commit to smcv/systemd that referenced this issue Jun 16, 2015
Previously, I think this was a race condition during a user's first login.
Some component calls CreateSession (most likely by a PAM service
other than 'systemd-user' running pam_systemd), with the following
results:

- logind:
  * create the user's XDG_RUNTIME_DIR
  * tell pid 1 to create user-UID.slice
  * tell pid 1 to start user@UID.service

Then these two processes race:

- logind:
  * save information including XDG_RUNTIME_DIR to /run/systemd/users/UID

- (subprocesses of) pid 1:
  * start a 'systemd-user' PAM session, which reads XDG_RUNTIME_DIR
    and puts it in the environment
  * run systemd --user, which requires XDG_RUNTIME_DIR in the
    environment

If logind wins the race, everything is fine, but if the subprocesses
of pid 1 win the race, systemd --user exits unsuccessfully.

This is an attempt to fix <systemd#232>,
but unfortunately does not actually do so. It seems to be a genuine bug
anyway, though.
smcv added a commit to smcv/systemd that referenced this issue Jun 16, 2015
There seems to be some race condition in which logind does not
always communicate the XDG_RUNTIME_DIR to the process that is
about to exec systemd --user
(<systemd#232>).

systemd --user fails if there is no XDG_RUNTIME_DIR; try falling
back to /run/user/UID instead of giving up.
@poettering
Copy link
Member

The first patch indeed should fix a bug. Can you send a PR for it please?

@poettering
Copy link
Member

But the second patch looks like a hack indeed... This really shouldn't be necessary.

Have you tried adding an ExecStartPre=/bin/cat /run/systemd/users/1000 to your user@.service, so that that file is logged right before the user instance is started?

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

But the second patch looks like a hack indeed... This really shouldn't be necessary.

I know, I deliberately didn't submit a PR for it (but we needed a workaround right now, so we can start X reliably again, and it does work).

I'm continuing to look into the root cause; the ExecStartPre is a good idea, I'll try that.

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

The first patch indeed should fix a bug.

Should, but doesn't, because I call:

        user_save(u);

but user_save() has:

        if (!u->started)
                return 0;

and u->started hasn't been set TRUE yet.

I'll try to work out the least problematic way to fix this...

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

We're doing a "transaction" in which we start the user's user-session. It looks as though we want to only write out /run/systemd/users/ after the transaction has been started successfully; but one of the things we do in the transaction is to StartUnit for user@.service, which can fail with ENOMEM or a D-Bus error, and we need the XDG_RUNTIME_DIR before we can rely on that to work.

We can either:

  • create the file earlier/more often (make the early-return not trigger in this particular case); or
  • have a separate way to communicate the XDG_RUNTIME_DIR to system --user

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

One way to do it would be to turn my workaround into the real "API" used by pam-systemd: if /run/user/UID exists and is owned by UID, use it; and don't bother with /run/systemd/users/UID at all. This is unsuitable if you are more likely to change the location of users' runtime directories than to change how /run/systemd/users/UID works, though.

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

The other way to do it would be to write out the state-file in this one particular case even if u->starting is not yet true, because it is about to be true. We shouldn't actually set u->starting to true any earlier, because that has an effect on how much we roll back when the user is finalized (we shouldn't emit UserRemoved if there was no UserNew).

@poettering
Copy link
Member

I think we should probably create a call user_get_runtime_dir(uid_t uid, char **ret) in src/basic/login-util.[ch], and then use that in the PAM module and in logind, and then stop reading the /run file from the PAM module. That we make sure both sides use the same logic to come up with the dir, but they will no longer need synchronization.

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

Do you mean a stateless call that doesn't look at the filesystem, and just returns the equivalent of printf ("/run/user/" UID_FMT, uid)? I'd assumed that if it was that simple, you'd already have done it...

I have a patch for the current way of doing it (as described in my most recent comment here) that needs some testing, but is hopefully ready. But I could go for the other approach if desired.

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

One situation where the state-file business could matter is if the way we construct XDG_RUNTIME_DIR changes, and an upgrade results in briefly running a new PAM module against XDG_RUNTIME_DIRs created by an old logind, or an old PAM module (in a long-running daemon) against XDG_RUNTIME_DIRs created by a newly restarted logind. I'd assumed that this sort of concern was why we didn't just do (API calls that end up with) the simple printf.

@poettering
Copy link
Member

Do you mean a stateless call that doesn't look at the filesystem, and just returns the equivalent of printf ("/run/user/" UID_FMT, uid)?

Yes!

I'd assumed that if it was that simple, you'd already have done it...

Well, we didn't have src/basic/login-util.c and hence no nice place to have function to share between the module and logind ;-).

@poettering
Copy link
Member

One situation where the state-file business could matter is if the way we construct XDG_RUNTIME_DIR changes, and an upgrade results in briefly running a new PAM module against XDG_RUNTIME_DIRs created by an old logind, or an old PAM module (in a long-running daemon) against XDG_RUNTIME_DIRs created by a newly restarted logind. I'd assumed that this sort of concern was why we didn't just do (API calls that end up with) the simple printf.

Yupp, that was something I thought, but back when I hacked this up I didn't realize the race you ran into. I think just accepting that upgrades are fucked is nicer in this case then adding more synchronization between the components...

smcv added a commit to smcv/systemd that referenced this issue Jun 17, 2015
Previously, this had a race condition during a user's first login.
Some component calls CreateSession (most likely by a PAM service
other than 'systemd-user' running pam_systemd), with the following
results:

- logind:
  * create the user's XDG_RUNTIME_DIR
  * tell pid 1 to create user-UID.slice
  * tell pid 1 to start user@UID.service

Then these two processes race:

- logind:
  * save information including XDG_RUNTIME_DIR to /run/systemd/users/UID

- the subprocess of pid 1 responsible for user@service:
  * start a 'systemd-user' PAM session, which reads XDG_RUNTIME_DIR
    and puts it in the environment
  * run systemd --user, which requires XDG_RUNTIME_DIR in the
    environment

If logind wins the race, which usually happens, everything is fine;
but if the subprocesses of pid 1 win the race, which can happen
under load, then systemd --user exits unsuccessfully.

To avoid this race, we have to write out /run/systemd/users/UID
even though the service has not "officially" started yet;
previously this did an early-return without saving anything.
Record its state as OPENING in this case.

Bug: systemd#232
Reviewed-by: Philip Withnall <philip.withnall@collabora.co.uk>
@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

I have a patch for the current way of doing it (as described in my most recent comment here) that needs some testing, but is hopefully ready

That patch is smcv@7116130, and seems to work (a test that failed 2/6 times on a 219 derivative without this patch has worked 10/10 times with it).

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

Well, we didn't have src/basic/login-util.c and hence no nice place

I'll do a patch for master that adds the function you suggested, and a 219 backport (since that's what I actually need, and it doesn't have src/basic) that just does the equivalent asprintf like smcv@15b4044 did. It is going to take a little while to get that through continuous integration machinery though.

@poettering
Copy link
Member

I like you patch smcv@7116130 actually better than my idea. If you post a PR for it, I'll merge it.

@smcv
Copy link
Contributor Author

smcv commented Jun 17, 2015

#265 is the requested PR. Thank you for not forcing me to do another two rounds of testing :-)

@poettering
Copy link
Member

Closing, as #265 has been merged. Thanks!

keszybz pushed a commit to systemd/systemd-stable that referenced this issue Jul 29, 2015
Previously, this had a race condition during a user's first login.
Some component calls CreateSession (most likely by a PAM service
other than 'systemd-user' running pam_systemd), with the following
results:

- logind:
  * create the user's XDG_RUNTIME_DIR
  * tell pid 1 to create user-UID.slice
  * tell pid 1 to start user@UID.service

Then these two processes race:

- logind:
  * save information including XDG_RUNTIME_DIR to /run/systemd/users/UID

- the subprocess of pid 1 responsible for user@service:
  * start a 'systemd-user' PAM session, which reads XDG_RUNTIME_DIR
    and puts it in the environment
  * run systemd --user, which requires XDG_RUNTIME_DIR in the
    environment

If logind wins the race, which usually happens, everything is fine;
but if the subprocesses of pid 1 win the race, which can happen
under load, then systemd --user exits unsuccessfully.

To avoid this race, we have to write out /run/systemd/users/UID
even though the service has not "officially" started yet;
previously this did an early-return without saving anything.
Record its state as OPENING in this case.

Bug: systemd/systemd#232
Reviewed-by: Philip Withnall <philip.withnall@collabora.co.uk>
(cherry picked from commit 7116130)
msekletar pushed a commit to msekletar/systemd-fedora that referenced this issue Dec 4, 2015
Previously, this had a race condition during a user's first login.
Some component calls CreateSession (most likely by a PAM service
other than 'systemd-user' running pam_systemd), with the following
results:

- logind:
  * create the user's XDG_RUNTIME_DIR
  * tell pid 1 to create user-UID.slice
  * tell pid 1 to start user@UID.service

Then these two processes race:

- logind:
  * save information including XDG_RUNTIME_DIR to /run/systemd/users/UID

- the subprocess of pid 1 responsible for user@service:
  * start a 'systemd-user' PAM session, which reads XDG_RUNTIME_DIR
    and puts it in the environment
  * run systemd --user, which requires XDG_RUNTIME_DIR in the
    environment

If logind wins the race, which usually happens, everything is fine;
but if the subprocesses of pid 1 win the race, which can happen
under load, then systemd --user exits unsuccessfully.

To avoid this race, we have to write out /run/systemd/users/UID
even though the service has not "officially" started yet;
previously this did an early-return without saving anything.
Record its state as OPENING in this case.

Bug: systemd/systemd#232
Reviewed-by: Philip Withnall <philip.withnall@collabora.co.uk>
(cherry picked from commit 7116130)
fbuihuu pushed a commit to openSUSE/systemd that referenced this issue Sep 22, 2016
Previously, this had a race condition during a user's first login.
Some component calls CreateSession (most likely by a PAM service
other than 'systemd-user' running pam_systemd), with the following
results:

- logind:
  * create the user's XDG_RUNTIME_DIR
  * tell pid 1 to create user-UID.slice
  * tell pid 1 to start user@UID.service

Then these two processes race:

- logind:
  * save information including XDG_RUNTIME_DIR to /run/systemd/users/UID

- the subprocess of pid 1 responsible for user@service:
  * start a 'systemd-user' PAM session, which reads XDG_RUNTIME_DIR
    and puts it in the environment
  * run systemd --user, which requires XDG_RUNTIME_DIR in the
    environment

If logind wins the race, which usually happens, everything is fine;
but if the subprocesses of pid 1 win the race, which can happen
under load, then systemd --user exits unsuccessfully.

To avoid this race, we have to write out /run/systemd/users/UID
even though the service has not "officially" started yet;
previously this did an early-return without saving anything.
Record its state as OPENING in this case.

Bug: systemd/systemd#232
Reviewed-by: Philip Withnall <philip.withnall@collabora.co.uk>

(cherry picked from commit 7116130)

[fbui: fixes bsc#996269]
Yamakuzure pushed a commit to elogind/elogind that referenced this issue Dec 30, 2016
Previously, this had a race condition during a user's first login.
Some component calls CreateSession (most likely by a PAM service
other than 'systemd-user' running pam_systemd), with the following
results:

- logind:
  * create the user's XDG_RUNTIME_DIR
  * tell pid 1 to create user-UID.slice
  * tell pid 1 to start user@UID.service

Then these two processes race:

- logind:
  * save information including XDG_RUNTIME_DIR to /run/systemd/users/UID

- the subprocess of pid 1 responsible for user@service:
  * start a 'systemd-user' PAM session, which reads XDG_RUNTIME_DIR
    and puts it in the environment
  * run systemd --user, which requires XDG_RUNTIME_DIR in the
    environment

If logind wins the race, which usually happens, everything is fine;
but if the subprocesses of pid 1 win the race, which can happen
under load, then systemd --user exits unsuccessfully.

To avoid this race, we have to write out /run/systemd/users/UID
even though the service has not "officially" started yet;
previously this did an early-return without saving anything.
Record its state as OPENING in this case.

Bug: systemd/systemd#232
Reviewed-by: Philip Withnall <philip.withnall@collabora.co.uk>
Yamakuzure pushed a commit to elogind/elogind that referenced this issue Mar 14, 2017
Previously, this had a race condition during a user's first login.
Some component calls CreateSession (most likely by a PAM service
other than 'systemd-user' running pam_systemd), with the following
results:

- logind:
  * create the user's XDG_RUNTIME_DIR
  * tell pid 1 to create user-UID.slice
  * tell pid 1 to start user@UID.service

Then these two processes race:

- logind:
  * save information including XDG_RUNTIME_DIR to /run/systemd/users/UID

- the subprocess of pid 1 responsible for user@service:
  * start a 'systemd-user' PAM session, which reads XDG_RUNTIME_DIR
    and puts it in the environment
  * run systemd --user, which requires XDG_RUNTIME_DIR in the
    environment

If logind wins the race, which usually happens, everything is fine;
but if the subprocesses of pid 1 win the race, which can happen
under load, then systemd --user exits unsuccessfully.

To avoid this race, we have to write out /run/systemd/users/UID
even though the service has not "officially" started yet;
previously this did an early-return without saving anything.
Record its state as OPENING in this case.

Bug: systemd/systemd#232
Reviewed-by: Philip Withnall <philip.withnall@collabora.co.uk>
@gclawes
Copy link
Contributor

gclawes commented Oct 29, 2018

I'm seeing this issue on systemd 239 (sys-apps/systemd-239-r2 on gentoo)

It looks like /run/user/UID is being created, but /run/systemd/users/UID is not

enterprise ~ # ls -lah /run/user
total 0
drwxr-xr-x  2 root root  40 Oct 29 17:49 .
drwxr-xr-x 24 root root 740 Oct 29 17:45 ..
enterprise ~ # ls -lah /run/systemd/user
ls: cannot access '/run/systemd/user': No such file or directory
Oct 29 17:49:41 enterprise.lan systemd[1]: Created slice User Slice of UID 1000.
-- Subject: Unit user-1000.slice has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-1000.slice has finished starting up.
--
-- The start-up result is RESULT.
Oct 29 17:49:41 enterprise.lan systemd[1]: Created slice system-user\x2druntime\x2ddir.slice.
-- Subject: Unit system-user\x2druntime\x2ddir.slice has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit system-user\x2druntime\x2ddir.slice has finished starting up.
--
-- The start-up result is RESULT.
Oct 29 17:49:41 enterprise.lan systemd[1]: Started /run/user/1000 mount wrapper.
-- Subject: Unit user-runtime-dir@1000.service has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-runtime-dir@1000.service has finished starting up.
--
-- The start-up result is RESULT.
Oct 29 17:49:41 enterprise.lan systemd[1]: Starting User Manager for UID 1000...
-- Subject: Unit user@1000.service has begun start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user@1000.service has begun starting up.
Oct 29 17:49:41 enterprise.lan systemd[17543]: pam_unix(systemd-user:session): session opened for user gclawes by (uid=0)
Oct 29 17:49:41 enterprise.lan systemd[17543]: Trying to run as user instance, but $XDG_RUNTIME_DIR is not set.
Oct 29 17:49:41 enterprise.lan systemd[1]: user@1000.service: Failed with result 'protocol'.
Oct 29 17:49:41 enterprise.lan systemd[1]: Failed to start User Manager for UID 1000.
-- Subject: Unit user@1000.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user@1000.service has failed.
--
-- The result is RESULT.
Oct 29 17:49:41 enterprise.lan systemd[1]: user-runtime-dir@1000.service: Unit not needed anymore. Stopping.
Oct 29 17:49:41 enterprise.lan systemd[1]: Stopping /run/user/1000 mount wrapper...
-- Subject: Unit user-runtime-dir@1000.service has begun shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-runtime-dir@1000.service has begun shutting down.
Oct 29 17:49:41 enterprise.lan systemd[1]: Stopped /run/user/1000 mount wrapper.
-- Subject: Unit user-runtime-dir@1000.service has finished shutting down
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-runtime-dir@1000.service has finished shutting down.

@gclawes
Copy link
Contributor

gclawes commented Oct 29, 2018

With ExecStartPre=/bin/cat /run/systemd/users/%i added to user@.service:

-- Unit user@1000.service has begun starting up.
Oct 29 18:16:48 enterprise.lan systemd[1]: Started /run/user/1000 mount wrapper.
Oct 29 18:16:48 enterprise.lan systemd[12809]: pam_unix(systemd-user:session): session opened for user gclawes by (uid=0)
Oct 29 18:16:48 enterprise.lan cat[12809]: /bin/cat: /run/systemd/users/1000: No such file or directory
Oct 29 18:16:48 enterprise.lan systemd[12810]: pam_unix(systemd-user:session): session closed for user gclawes
Oct 29 18:16:48 enterprise.lan systemd[1]: user@1000.service: Control process exited, code=exited status=1
Oct 29 18:16:48 enterprise.lan systemd[1]: user@1000.service: Failed with result 'exit-code'.
Oct 29 18:16:48 enterprise.lan systemd[1]: Failed to start User Manager for UID 1000.
-- Subject: Unit user@1000.service has failed

@gclawes
Copy link
Contributor

gclawes commented Oct 29, 2018

@poettering should a new issue be opened for this?

@smcv
Copy link
Contributor Author

smcv commented Oct 30, 2018

should a new issue be opened for this?

Yes please. If in doubt, open a new issue, and in particular if you have the same symptom in a version a lot later than the one where it's meant to have been fixed, always open a new issue.

If you open too many issues, it's easy for developers to close the duplicates, but if you reuse an existing issue number when the same symptom is seen for a different reason, it can become very hard to disentangle.

@gclawes
Copy link
Contributor

gclawes commented Oct 30, 2018

Opened: #10574

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants