-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
journald is unable to attribute messages incoming from processes that exited to their cgroup, due to /proc vs SCM_CREDS race #2913
Comments
See #1456 (comment)
See also https://bugs.freedesktop.org/show_bug.cgi?id=50184 #2280 is trying to improve the situation |
@evverx good to know, thanks. Do you know why this doesn't seem to apply to system-level jobs, or was that maybe just a coincidence? Also a super naive question: If systemd created the stdout pipe that bash is writing too, and systemd had all the unit metadata at that time, why does it need to ask bash for that same metadata later when it reads from the pipe? |
You can change something:) How to change
I think that was a coincidence. |
yeah, the race applies to all kinds of services. |
journalctl --user-unit=...
is missing log lines
Do you know of a workaround? Like something that could be written in the |
@alban , try
and run -bash-4.3# systemctl cat test-with-syslog-identifier --no-pager
# /run/systemd/system/test-with-syslog-identifier.service
[Service]
ExecStart=/bin/date
User=1000
SyslogIdentifier=heya-short-lived-date
-bash-4.3# systemctl start test-with-syslog-identifier
-bash-4.3# journalctl --sync
-bash-4.3# journalctl _SYSTEMD_UNIT=test-with-syslog-identifier.service + UNIT=test-with-syslog-identifier.service + SYSLOG_IDENTIFIER=heya-short-lived-date --no-hostname -q | grep -i utc
May 17 04:51:13 heya-short-lived-date[174]: Tue May 17 04:51:13 UTC 2016
-bash-4.3# journalctl -u test-with-syslog-identifier --no-hostname -q | grep -i utc Anyway, you lose some metadata:
|
There's a race[1][1] on systemd that causes the systemd unit name not getting written to the journal for short-lived non-root services. To provide a way to identify an app within a pod in the logs, we set `SyslogIdentifierp` to the app name as a workaround. This causes processes forked by the app and pre/post start hooks to be identified as the app, but it was already happening before we removed appexec[2][2] with the binary exec'd by the app instead of the app name. In theory we could remove that and we'll get the binary name of each process executed in the pod as the syslog identifier, but then this workaround would not *work*. Setting the app name as the identifier is already an improvement over the previous situation. [1]: systemd/systemd#2913 [2]: rkt#2493
There's a race[1][1] on systemd that causes the systemd unit name not getting written to the journal for short-lived non-root services. To provide a way to identify an app within a pod in the logs, we set `SyslogIdentifierp` to the app name as a workaround. This causes processes forked by the app and pre/post start hooks to be identified as the app, but it was already happening before we removed appexec[2][2] with the binary exec'd by the app instead of the app name. In theory we could remove that and we'll get the binary name of each process executed in the pod as the syslog identifier, but then this workaround would not *work*. Setting the app name as the identifier is already an improvement over the previous situation. [1]: systemd/systemd#2913 [2]: rkt#2493
There's a race[1][1] on systemd that causes the systemd unit name not getting written to the journal for short-lived non-root services. To provide a way to identify an app within a pod in the logs, we set `SyslogIdentifierp` to the app name as a workaround. This causes processes forked by the app and pre/post start hooks to be identified as the app, but it was already happening before we removed appexec[2][2] with the binary exec'd by the app instead of the app name. In theory we could remove that and we'll get the binary name of each process executed in the pod as the syslog identifier, but then this workaround would not *work*. Setting the app name as the identifier is already an improvement over the previous situation. [1]: systemd/systemd#2913 [2]: rkt#2493
I have an idea for fixing this more properly, but it involves introducing a new linux-specific With this option enabled, rearrange journald to acquire the process metadata upon This way ephemeral processes and logs sent immediately before process exit cannot race with the metadata sampling. Something could also be introduced for fd passes when this option is enabled as well: synchronously wait for in-flight data to be drained prior to performing the fd pass. This would allow journald the opportunity to sample /proc for any log data before the fd is moved to the destination process. There's no requirement that this be Just putting this out there, food for thought. I was going to start hacking on this but #2473 gave me cause to retreat considering this takes us further down the |
also tracked on https://bugs.freedesktop.org/show_bug.cgi?id=50184 |
Several options imply --syslog, without there being a way to disable it again. This commit adds that option. This is useful, for instance, when running `git daemon` as a systemd service with --inetd. systemd connects stderr to the journal by default, so logging to stderr is useful. On the other hand, log messages sent via syslog also reach the journal eventually, but run the risk of being processed at a time when the `git daemon` process has already exited (especially if the process was very short-lived, e.g. due to client error), so that the journal can no longer read its cgroup and attach the message to the correct systemd unit. See systemd/systemd#2913 [1]. [1]: systemd/systemd#2913 Signed-off-by: Lucas Werkmeister <mail@lucaswerkmeister.de>
Just leaving a comment here, I saw a unit also missing _SYSTEMD_INVOCATION_ID=, which makes perfect sense, it seems it reads it from the what the link points to in /run/systemd/units but that can go away before it can readlink the dangling symlink. We were using _SYSTEMD_INVOCATION_ID= to view logs from only the last invocation of the unit, but the unreliability makes it a lot more difficult (so now we fallback to filtering by _PID in combination with it). This seems easy to fix because systemd can create the link once and only update it the next time a unit starts and gets a new invocation id, thus preventing the race. |
I’m a bit confused about when this race condition happens and when it doesn’t happen. I’ll try to summarize my current understanding and hope it’s not too wrong, and then perhaps someone can fill in the missing bits :) In theory, I don’t see why log messages via stdout/stderr should have to be racy at all. systemd connects them to the journal during setup and can also inform the journal of the unit and other meta-info at the same time. (Note that the “stream logging” section of systemd-journald(8) explicitly states that “metadata for records transferred via such standard output/error streams reflect the metadata of the peer the stream was originally created for. If the stream connection is passed on to other processes … the log records … will continue to describe the original process”, supporting this assumption.) On the other hand, In practice, this communication of the unit name was added in 62bca2c, first shipped in systemd v186 – but only for system units (later slightly amended in c867611). So I think the original bug report here is actually accurate in saying that, as of systemd v229, for stdout/stderr (as in Now, enter 22e3a02 / #6392. With this change (see the long description in |
I was also diving in the code used for LogExtraField=, and exploring if it was possible to use the same mechanism to attach the unit name to messages (because we already use it do so, and have found it to be race free, with LogExtraField=PROGRAM_UNIT=%n). |
well, we need some clean-up scheme there. units come and go all the time, and run a single time only, and we cannot leave data in /run around unbounded... Just think "systemd-run" for example which creates a unit that is exactly run once and never again |
This new option can be used to override the implicit --syslog of --inetd, or to disable all logging. (While --detach also implies --syslog, --log-destination=stderr with --detach is useless since --detach disassociates the process from the original stderr.) --syslog is retained as an alias for --log-destination=syslog. --log-destination always overrides implicit --syslog regardless of option order. This is different than the “last one wins” logic that applies to some implicit options elsewhere in Git, but should hopefully be less confusing. (I also don’t know if *all* implicit options in Git follow “last one wins”.) The combination of --inetd with --log-destination=stderr is useful, for instance, when running `git daemon` as an instanced systemd service (with associated socket unit). In this case, log messages sent via syslog are received by the journal daemon, but run the risk of being processed at a time when the `git daemon` process has already exited (especially if the process was very short-lived, e.g. due to client error), so that the journal daemon can no longer read its cgroup and attach the message to the correct systemd unit (see systemd/systemd#2913 [1]). Logging to stderr instead can solve this problem, because systemd can connect stderr directly to the journal daemon, which then already knows which unit is associated with this stream. [1]: systemd/systemd#2913 Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Helped-by: Junio C Hamano <gitster@pobox.com> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Lucas Werkmeister <mail@lucaswerkmeister.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Why keep this option when it's unreliable? |
At least in some cases this can be solved by having processes send their own pidfd to systemd-journald. |
@DemiMarie journald implements the BSD syslog proto. There are about a million implementations of the client side for that around, including in glibc. None of those send a pidfd of themselves along. (There's progress on that front btw, some people are prepping a kernel patch for adding SCM_PIDFD that works alongside SCM_CREDENTIALS and allows receivers to get the sending pid as pidfd. But this is not sufficient to close the issue fully, since this won't pin the /proc/ entry, i.e. while we might safely reference the sending process that way and will never mistake it for something else if that process died we still cannot read info from it) |
What about having |
Seriously? So you can't just use the PIDFD as a dirfd with the Surprising? No. Disappointing? Yes. I'm pretty sure pidfd is primarily about sending signals processes by PID race-free, but we have more pressing needs than just that. |
There was a patch from 2019 [1] that adds an ioctl command [1] https://lore.kernel.org/lkml/20190329155425.26059-3-christian@brauner.io/ |
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
Avoid hitting systemd#2913 by adding some more sleeps. This is required to make the test pass when executed with mkosi on my machine.
As a workaround for services you can change the code of that use the native journal protocol, you can synchronize with journald's message processing by sending a datagram containing the write end of a pipe, closing your copy of the write end, and blocking on reading from the read end, which will return EOF when journald closes its copy of the write end. Unix datagrams are not reordered and once sd-event calls into I am not affiliated with systemd, this is not API, this comment is based on my empirical experiments ("works for me") and code reading, there's even less warranty than the usual no warranty. If you send just the FD, journald will log an error (file not regular), but if you also send data, it logs a warning (too many file descriptors). If it were possible to send an FD that stats as a regular file with size 0 and detect when journald had closed it, that would avoid the warning/error, but I don't believe that's possible. If you send data with the FD to get the warning rather than the error, I suggest |
[Arch Linux] It looks like
journalctl --user-unit=...
is dropping log lines that happen immediately before a user job exits. Note that I was only able to repro this with user jobs; system jobs don't seem to have this problem.I create
~/.config/systemd/user/test.service
with the following:In one terminal I run
journalctl --user-unit=test.service --follow
. In another terminal, I run (a bunch of times)systemctl --user start test.service
.EXPECTED: I should see both
before sleep
andafter sleep
in the output ofjournalctl
.ACTUAL: I only see
before sleep
.Here's what shows up in the
-o verbose
logs:Note that
_SYSTEMD_USER_UNIT
is not defined for the second message.The text was updated successfully, but these errors were encountered: