journald: Add integration tests #1709

swsnr · 2021-11-04T20:58:35Z

Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698.

These tests send messages of various complexity to the journal, and then use journalctl's JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to.

Motivation

Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.

tracing-journald/tests/journal.rs

Ralith

Should we also have a test with an internal null?

hawkw

looks good to me, thanks!

tracing-journald/tests/journal.rs

swsnr · 2021-11-05T06:23:22Z

I'll add one 🙂

swsnr · 2021-11-05T07:16:42Z

@Ralith I've added another test for an internal null; I had to add a dev-dependency for serde and add a custom struct for field values, because journalctl seems to consider data with internal nulls as binary and dumps them as Json Number array, so I had to add a bit more deserialization to the journalctl decoding. I hope that's okay 🙂

Ralith · 2021-11-05T07:27:32Z

A number array? Weird. But it looks like you're decoding it as a string? What's it do if you don't ask for JSON to begin with, and pass --all?

swsnr · 2021-11-05T09:35:32Z

@Ralith Well, short of base64 strings an array of numbers is the only way to do binary data in JSON, I guess 🤷 Anyway, it's documented in man journalctl:

See also https://www.freedesktop.org/wiki/Software/systemd/json/ which the [2] refers to.

I'm not decoding it as String, as far as I can see: It ends up as Vec<u8> in the Binary variant of the Field enum. In the test I compare it against a byte string, perhaps that's causing some confusion here? I'm using a byte string because it's easier to read in the test than a slice of raw numbers, so I find it easier to correlate the expected output with what's in the debug! macro arguments.

I wasn't aware of --all, but from the above I guess --all won't have any effect on the resulting JSON.

I can drop JSON but I'm not sure what else I'd use. JSON is one of two structured output formats in journalctl, the other being export which is some binary dump in a custom systemd format. I could parse that with some nom code, but that'd hardly make things easier to understand. The verbose output is unstructured multiline text which will also require a bunch of weird parsing code to pick apart. The other --output formats don't include extra fields 🤷

@hawkw I'm sorry for the formatting issue; I forgot to enable rustfmt formatting for the project. I'll push a fix shortly.

As for the test failures it looks as if the tests are flaky: They pass on nightly but fail on beta and stable 🤔 I can't reproduce these issues on my machine, and I've had the same tests in libsystemd-rs pass reliably on the Github Ubuntu containers, so I tend to think that tracing itself is to "blame"?

Is logging in tracing asynchronous perchance so we might run into timing issues?

hawkw · 2021-11-05T16:21:41Z

@hawkw I'm sorry for the formatting issue; I forgot to enable rustfmt formatting for the project. I'll push a fix shortly.

No worries, that's why we have a formatting check on CI. :)

As for the test failures it looks as if the tests are flaky: They pass on nightly but fail on beta and stable thinking I can't reproduce these issues on my machine, and I've had the same tests in libsystemd-rs pass reliably on the Github Ubuntu containers, so I tend to think that tracing itself is to "blame"?

Is logging in tracing asynchronous perchance so we might run into timing issues?

In these tests, logging should happen synchronously; we're using the synchronous std version of UDS sockets... I'm going to try rerunning the tests to see if they fail consistently on those versions, or if they're just flaky, so we can narrow down the issue here.

tracing-journald/tests/journal.rs

hawkw · 2021-11-05T16:42:58Z

I'm running the tests in a loop locally with a quick bash script, to try and see if I can reproduce the flaky failures on my machine:

#!/usr/bin/env bash

RUNS=0
trap 'echo "killed after $RUNS runs!"; exit 0' QUIT KILL TERM INT

while cargo test -p tracing-journald; do
    RUNS=$(($RUNS + 1))
done

echo "failed after $RUNS runs!"

After running the tests 520 times (!!), I finally saw two of them fail:

Failure output

    Finished test [unoptimized + debuginfo] target(s) in 0.04s
     Running unittests (target/debug/deps/tracing_journald-1de82bbe4eda16df)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests/journal.rs (target/debug/deps/journal-4a6865a009a0bfc7)

running 4 tests
test multiline_message ... ok
test internal_null_byte ... ok
test multiline_message_trailing_newline ... FAILED
test simple_message ... FAILED

failures:

---- multiline_message_trailing_newline stdout ----
thread 'multiline_message_trailing_newline' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', tracing-journald/tests/journal.rs:122:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5
   4: journal::multiline_message_trailing_newline::{{closure}}
             at ./tests/journal.rs:122:9
   5: tracing_core::dispatch::with_default
             at /home/eliza/Code/tracing/tracing-core/src/dispatch.rs:263:5
   6: tracing::collect::with_default
             at /home/eliza/Code/tracing/tracing/src/collect.rs:24:5
   7: journal::with_journald
             at ./tests/journal.rs:21:13
   8: journal::multiline_message_trailing_newline
             at ./tests/journal.rs:115:5
   9: journal::multiline_message_trailing_newline::{{closure}}
             at ./tests/journal.rs:114:1
  10: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
  11: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

---- simple_message stdout ----
thread 'simple_message' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', tracing-journald/tests/journal.rs:91:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5
   4: journal::simple_message::{{closure}}
             at ./tests/journal.rs:91:9
   5: tracing_core::dispatch::with_default
             at /home/eliza/Code/tracing/tracing-core/src/dispatch.rs:263:5
   6: tracing::collect::with_default
             at /home/eliza/Code/tracing/tracing/src/collect.rs:24:5
   7: journal::with_journald
             at ./tests/journal.rs:21:13
   8: journal::simple_message
             at ./tests/journal.rs:87:5
   9: journal::simple_message::{{closure}}
             at ./tests/journal.rs:86:1
  10: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
  11: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    multiline_message_trailing_newline
    simple_message

test result: FAILED. 2 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.03s

error: test failed, to rerun pass '-p tracing-journald --test journal'
failed after 520 runs!

So, it does appear there's some kind of flakiness here. Interestingly, I saw two of the tests fail, rather than all of them like we saw on CI.

For the record, my journalctl version is

:; journalctl --version
systemd 249 (249.4)
+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL -ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB -ZSTD -XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

swsnr · 2021-11-05T16:55:41Z

@hawkw Oh, that's a good idea, I'll try this myself. My guess is that we're too fast in asking the journal.

As far as I understand the unix socket API send only puts data into the send buffer, and doesn't actually block until the remote side actually got and digested the data.

So there's a brief time after return from the log call in which the data lingers in the send buffer before journald gets it. On my local fast multi-core system this time is too short because journald almost immediately gets the data, but on a comparatively slow system with few, perhaps only one core, such as the CI containers, it probably takes longer, and sometimes we're just too fast 🤷

That could also explain why I've never seen this failure with libsystemd-rs: That crate opens a new socket for every log call and closes it at the end, which perhaps immediately flushes the send buffer, so journald mostly gets the data before the log call returns.

Does this sound reasonable?

tracing-journald/tests/journal.rs

hawkw · 2021-11-05T17:09:53Z

As far as I understand the unix socket API send only puts data into the send buffer, and doesn't actually block until the remote side actually got and digested the data.

So there's a brief time after return from the log call in which the data lingers in the send buffer before journald gets it. On my local fast multi-core system this time is too short because journald almost immediately gets the data, but on a comparatively slow system with few, perhaps only one core, such as the CI containers, it probably takes longer, and sometimes we're just too fast shrug

hmm, i wonder if we want to flush after every send call, then!

swsnr · 2021-11-05T17:14:03Z

@hawkw I don't know whether you can flush, and the systemd docs actually recommend to increase the buffer to reduce blocking to a minimum:

A client probably should bump up the SO_SNDBUF socket option of its AF_UNIX socket towards journald in order to delay blocking I/O as much as possible.

The journald subscriber doesn't do that in fact, but still this seems to imply that there's some kind of buffering going on 🤔 : But I'm only superficially familiar with domain sockets 😇

Ralith · 2021-11-05T17:37:34Z

binary-in-JSON stuff

Ah, that seems reasonable then!

A client probably should bump up the SO_SNDBUF

Interesting point. Is there a limit to what an unprivileged process can raise it to? What does the reference client implementation do? For IP sockets generally the default is also the maximum, but maybe Unix sockets are different.

swsnr · 2021-11-05T20:05:21Z

Interesting point. Is there a limit to what an unprivileged process can raise it to? What does the reference client implementation do? For IP sockets generally the default is also the maximum, but maybe Unix sockets are different.

I honestly don't know 🤷 I'm sorry, I just did some wild guessing about unix domain sockets based on some superficial understanding of networking code, and read this paragraph in the docs 😄

I'll make a note to find out, but things are a piling up a bit now 😬 so I guess I'd like to finish what I came for first, namely #1698 (and also #1710 now) before digging deeper 😇 I hope that's okay for now?

Ralith · 2021-11-05T20:58:42Z

I hope that's okay for now?

Of course! Just food for thought. I, for one, am not producing enough logs for this to be a serious risk regardless.

swsnr · 2021-11-05T21:11:01Z

@Ralith I just checked the systemd C source because I can never remember out this ancillary data stuff in socket messages works, and libsystemd does this when opening the socket:

#define SNDBUF_SIZE (8*1024*1024)

static int journal_fd(void) {
        […]
        fd = socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0);
        if (fd < 0)
                return -errno;

        fd_inc_sndbuf(fd, SNDBUF_SIZE);

        […]
 
        return fd;
}

fd_inc_sndbuf comes down to fd_set_sndbuf(fd, SNDBUF_SIZE, true) with:

int fd_set_sndbuf(int fd, size_t n, bool increase) {
        int r, value;
        socklen_t l = sizeof(value);

        if (n > INT_MAX)
                return -ERANGE;

        r = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &value, &l);
        if (r >= 0 && l == sizeof(value) && increase ? (size_t) value >= n*2 : (size_t) value == n*2)
                return 0;

        /* First, try to set the buffer size with SO_SNDBUF. */
        r = setsockopt_int(fd, SOL_SOCKET, SO_SNDBUF, n);
        if (r < 0)
                return r;

        /* SO_SNDBUF above may set to the kernel limit, instead of the requested size.
         * So, we need to check the actual buffer size here. */
        l = sizeof(value);
        r = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &value, &l);
        if (r >= 0 && l == sizeof(value) && increase ? (size_t) value >= n*2 : (size_t) value == n*2)
                return 1;

        /* If we have the privileges we will ignore the kernel limit. */
        r = setsockopt_int(fd, SOL_SOCKET, SO_SNDBUFFORCE, n);
        if (r < 0)
                return r;

        return 1;
}

I don't really follow the details but it looks as if systemd tries hard to max out the send buffer for its socket.

Edit: Doesn't seem to make any difference for unprivileged processes, on my system the default's the same as the maximum permitted, and about 212kb, but privileged processes can apparently exceed this and systemd attempts to go to 8MiB if I understand the code correctly.

I can make a ticket for this, but I don't actually run any processes where this would make a difference, so it's not my priority, and I tend to leave this exercise to someone who's actually logging megabytes of things to journald 😇

Ralith · 2021-11-05T22:48:37Z

Nice research! Sounds like the impact would indeed be pretty niche, and I see no need to pursue it in the absence of demand.

See #1698

swsnr · 2021-11-09T17:31:12Z

I've rebased on master; is there anything from my side which still blocks this pull request? 🙂

Ralith

LGTM, thanks!

hawkw · 2021-11-09T18:53:38Z

I've rebased on master; is there anything from my side which still blocks this pull request? slightly_smiling_face

Are we reasonably confident that the flaky test failures won't occur often in the future? Of course, since these are integration tests that rely on communicating with an external process, we can't ever completely rule out spurious failures, but the main blocker on my side is knowing that they should be infrequent enough that we can generally trust the results of these tests.

If you feel confident that the spurious failures aren't going to be a problem in the future, I'm happy to merge this!

Ralith · 2021-11-09T19:02:46Z

We spend up to an entire second on retries now; I don't think we can possibly be more reliable than that without suppressing failures entirely, which comes with a very high risk of silent bitrot, or by mocking journald, which somewhat defeats the point of the integration tests.

swsnr · 2021-11-09T21:23:26Z

I tend to agree: I'd say if they still turn out to be flaky there's a general issue with journal logging. I've used your script to run the tests a few hundred times and had no further issues.

That'd said, if they break please do mark them as ignored meanwhile, and I promise to try and fix them 🙂

hawkw · 2021-11-09T22:05:28Z

Okay, in that case, I'm going to go ahead and merge this. Thanks for working on this!

swsnr · 2021-11-09T22:16:26Z

@hawkw Oh cool 🙏 I'll bring up #1710 and #1698 next 🙂

@hawkw

Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.

@hawkw

Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.

@hawkw

Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.

@hawkw

Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.

swsnr requested review from davidbarsky, hawkw and a team as code owners November 4, 2021 20:58

swsnr mentioned this pull request Nov 4, 2021

tracing-journald ignores EMSGSIZE from send() #1698

Closed

swsnr commented Nov 4, 2021

View reviewed changes

tracing-journald/tests/journal.rs Outdated Show resolved Hide resolved

hawkw reviewed Nov 4, 2021

View reviewed changes

tracing-journald/tests/journal.rs Outdated Show resolved Hide resolved

swsnr mentioned this pull request Nov 4, 2021

journald: Do not log additional string data in Debug representation #1710

Closed

Ralith approved these changes Nov 4, 2021

View reviewed changes

hawkw approved these changes Nov 4, 2021

View reviewed changes

tracing-journald/tests/journal.rs Show resolved Hide resolved

swsnr requested review from hawkw and Ralith November 5, 2021 07:17

hawkw reviewed Nov 5, 2021

View reviewed changes

tracing-journald/tests/journal.rs Outdated Show resolved Hide resolved

tracing-journald/tests/journal.rs Outdated Show resolved Hide resolved

hawkw reviewed Nov 5, 2021

View reviewed changes

tracing-journald/tests/journal.rs Show resolved Hide resolved

Add integration tests for journald subscriber

08e1963

See #1698

Ralith approved these changes Nov 9, 2021

View reviewed changes

hawkw merged commit fdf8178 into tokio-rs:master Nov 9, 2021

swsnr deleted the journald-integration-tests branch November 9, 2021 22:16

hawkw mentioned this pull request Dec 2, 2021

more backports #1767

Merged

davidbarsky mentioned this pull request Sep 26, 2023

chore: backport roughly a year's worth of changes #2728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

journald: Add integration tests #1709

journald: Add integration tests #1709

swsnr commented Nov 4, 2021

Ralith left a comment

hawkw left a comment

swsnr commented Nov 5, 2021

swsnr commented Nov 5, 2021

Ralith commented Nov 5, 2021

swsnr commented Nov 5, 2021

hawkw commented Nov 5, 2021

hawkw commented Nov 5, 2021

swsnr commented Nov 5, 2021

hawkw commented Nov 5, 2021

swsnr commented Nov 5, 2021

Ralith commented Nov 5, 2021

swsnr commented Nov 5, 2021

Ralith commented Nov 5, 2021

swsnr commented Nov 5, 2021 •

edited

Ralith commented Nov 5, 2021

swsnr commented Nov 9, 2021

Ralith left a comment

hawkw commented Nov 9, 2021

Ralith commented Nov 9, 2021

swsnr commented Nov 9, 2021

hawkw commented Nov 9, 2021

swsnr commented Nov 9, 2021

journald: Add integration tests #1709

journald: Add integration tests #1709

Conversation

swsnr commented Nov 4, 2021

Motivation

Ralith left a comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

swsnr commented Nov 5, 2021

swsnr commented Nov 5, 2021

Ralith commented Nov 5, 2021

swsnr commented Nov 5, 2021

hawkw commented Nov 5, 2021

hawkw commented Nov 5, 2021

swsnr commented Nov 5, 2021

hawkw commented Nov 5, 2021

swsnr commented Nov 5, 2021

Ralith commented Nov 5, 2021

swsnr commented Nov 5, 2021

Ralith commented Nov 5, 2021

swsnr commented Nov 5, 2021 • edited

Ralith commented Nov 5, 2021

swsnr commented Nov 9, 2021

Ralith left a comment

Choose a reason for hiding this comment

hawkw commented Nov 9, 2021

Ralith commented Nov 9, 2021

swsnr commented Nov 9, 2021

hawkw commented Nov 9, 2021

swsnr commented Nov 9, 2021

swsnr commented Nov 5, 2021 •

edited