Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

journald: Add integration tests #1709

Merged
merged 1 commit into from Nov 9, 2021
Merged

journald: Add integration tests #1709

merged 1 commit into from Nov 9, 2021

Conversation

swsnr
Copy link
Contributor

@swsnr swsnr commented Nov 4, 2021

Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698.

These tests send messages of various complexity to the journal, and then use journalctl's JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to.

Motivation

Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have a test with an internal null?

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks!

tracing-journald/tests/journal.rs Show resolved Hide resolved
@swsnr
Copy link
Contributor Author

swsnr commented Nov 5, 2021

I'll add one 🙂

@swsnr
Copy link
Contributor Author

swsnr commented Nov 5, 2021

@Ralith I've added another test for an internal null; I had to add a dev-dependency for serde and add a custom struct for field values, because journalctl seems to consider data with internal nulls as binary and dumps them as Json Number array, so I had to add a bit more deserialization to the journalctl decoding. I hope that's okay 🙂

@swsnr swsnr requested review from hawkw and Ralith November 5, 2021 07:17
@Ralith
Copy link
Collaborator

Ralith commented Nov 5, 2021

A number array? Weird. But it looks like you're decoding it as a string? What's it do if you don't ask for JSON to begin with, and pass --all?

@swsnr
Copy link
Contributor Author

swsnr commented Nov 5, 2021

@Ralith Well, short of base64 strings an array of numbers is the only way to do binary data in JSON, I guess 🤷 Anyway, it's documented in man journalctl:

grafik

See also https://www.freedesktop.org/wiki/Software/systemd/json/ which the [2] refers to.

I'm not decoding it as String, as far as I can see: It ends up as Vec<u8> in the Binary variant of the Field enum. In the test I compare it against a byte string, perhaps that's causing some confusion here? I'm using a byte string because it's easier to read in the test than a slice of raw numbers, so I find it easier to correlate the expected output with what's in the debug! macro arguments.

I wasn't aware of --all, but from the above I guess --all won't have any effect on the resulting JSON.

I can drop JSON but I'm not sure what else I'd use. JSON is one of two structured output formats in journalctl, the other being export which is some binary dump in a custom systemd format. I could parse that with some nom code, but that'd hardly make things easier to understand. The verbose output is unstructured multiline text which will also require a bunch of weird parsing code to pick apart. The other --output formats don't include extra fields 🤷


@hawkw I'm sorry for the formatting issue; I forgot to enable rustfmt formatting for the project. I'll push a fix shortly.

As for the test failures it looks as if the tests are flaky: They pass on nightly but fail on beta and stable 🤔 I can't reproduce these issues on my machine, and I've had the same tests in libsystemd-rs pass reliably on the Github Ubuntu containers, so I tend to think that tracing itself is to "blame"?

Is logging in tracing asynchronous perchance so we might run into timing issues?

@hawkw
Copy link
Member

hawkw commented Nov 5, 2021

@hawkw I'm sorry for the formatting issue; I forgot to enable rustfmt formatting for the project. I'll push a fix shortly.

No worries, that's why we have a formatting check on CI. :)

As for the test failures it looks as if the tests are flaky: They pass on nightly but fail on beta and stable thinking I can't reproduce these issues on my machine, and I've had the same tests in libsystemd-rs pass reliably on the Github Ubuntu containers, so I tend to think that tracing itself is to "blame"?

Is logging in tracing asynchronous perchance so we might run into timing issues?

In these tests, logging should happen synchronously; we're using the synchronous std version of UDS sockets... I'm going to try rerunning the tests to see if they fail consistently on those versions, or if they're just flaky, so we can narrow down the issue here.

tracing-journald/tests/journal.rs Outdated Show resolved Hide resolved
tracing-journald/tests/journal.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member

hawkw commented Nov 5, 2021

I'm running the tests in a loop locally with a quick bash script, to try and see if I can reproduce the flaky failures on my machine:

#!/usr/bin/env bash

RUNS=0
trap 'echo "killed after $RUNS runs!"; exit 0' QUIT KILL TERM INT

while cargo test -p tracing-journald; do
    RUNS=$(($RUNS + 1))
done

echo "failed after $RUNS runs!"

After running the tests 520 times (!!), I finally saw two of them fail:

Failure output
    Finished test [unoptimized + debuginfo] target(s) in 0.04s
     Running unittests (target/debug/deps/tracing_journald-1de82bbe4eda16df)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests/journal.rs (target/debug/deps/journal-4a6865a009a0bfc7)

running 4 tests
test multiline_message ... ok
test internal_null_byte ... ok
test multiline_message_trailing_newline ... FAILED
test simple_message ... FAILED

failures:

---- multiline_message_trailing_newline stdout ----
thread 'multiline_message_trailing_newline' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', tracing-journald/tests/journal.rs:122:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5
   4: journal::multiline_message_trailing_newline::{{closure}}
             at ./tests/journal.rs:122:9
   5: tracing_core::dispatch::with_default
             at /home/eliza/Code/tracing/tracing-core/src/dispatch.rs:263:5
   6: tracing::collect::with_default
             at /home/eliza/Code/tracing/tracing/src/collect.rs:24:5
   7: journal::with_journald
             at ./tests/journal.rs:21:13
   8: journal::multiline_message_trailing_newline
             at ./tests/journal.rs:115:5
   9: journal::multiline_message_trailing_newline::{{closure}}
             at ./tests/journal.rs:114:1
  10: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
  11: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

---- simple_message stdout ----
thread 'simple_message' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', tracing-journald/tests/journal.rs:91:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:140:5
   4: journal::simple_message::{{closure}}
             at ./tests/journal.rs:91:9
   5: tracing_core::dispatch::with_default
             at /home/eliza/Code/tracing/tracing-core/src/dispatch.rs:263:5
   6: tracing::collect::with_default
             at /home/eliza/Code/tracing/tracing/src/collect.rs:24:5
   7: journal::with_journald
             at ./tests/journal.rs:21:13
   8: journal::simple_message
             at ./tests/journal.rs:87:5
   9: journal::simple_message::{{closure}}
             at ./tests/journal.rs:86:1
  10: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
  11: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    multiline_message_trailing_newline
    simple_message

test result: FAILED. 2 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.03s

error: test failed, to rerun pass '-p tracing-journald --test journal'
failed after 520 runs!

So, it does appear there's some kind of flakiness here. Interestingly, I saw two of the tests fail, rather than all of them like we saw on CI.

For the record, my journalctl version is

:; journalctl --version
systemd 249 (249.4)
+PAM +AUDIT -SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL -ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB -ZSTD -XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

@swsnr
Copy link
Contributor Author

swsnr commented Nov 5, 2021

@hawkw Oh, that's a good idea, I'll try this myself. My guess is that we're too fast in asking the journal.

As far as I understand the unix socket API send only puts data into the send buffer, and doesn't actually block until the remote side actually got and digested the data.

So there's a brief time after return from the log call in which the data lingers in the send buffer before journald gets it. On my local fast multi-core system this time is too short because journald almost immediately gets the data, but on a comparatively slow system with few, perhaps only one core, such as the CI containers, it probably takes longer, and sometimes we're just too fast 🤷

That could also explain why I've never seen this failure with libsystemd-rs: That crate opens a new socket for every log call and closes it at the end, which perhaps immediately flushes the send buffer, so journald mostly gets the data before the log call returns.

Does this sound reasonable?

@hawkw
Copy link
Member

hawkw commented Nov 5, 2021

As far as I understand the unix socket API send only puts data into the send buffer, and doesn't actually block until the remote side actually got and digested the data.

So there's a brief time after return from the log call in which the data lingers in the send buffer before journald gets it. On my local fast multi-core system this time is too short because journald almost immediately gets the data, but on a comparatively slow system with few, perhaps only one core, such as the CI containers, it probably takes longer, and sometimes we're just too fast shrug

hmm, i wonder if we want to flush after every send call, then!

@swsnr
Copy link
Contributor Author

swsnr commented Nov 5, 2021

@hawkw I don't know whether you can flush, and the systemd docs actually recommend to increase the buffer to reduce blocking to a minimum:

A client probably should bump up the SO_SNDBUF socket option of its AF_UNIX socket towards journald in order to delay blocking I/O as much as possible.

The journald subscriber doesn't do that in fact, but still this seems to imply that there's some kind of buffering going on 🤔 : But I'm only superficially familiar with domain sockets 😇

@Ralith
Copy link
Collaborator

Ralith commented Nov 5, 2021

binary-in-JSON stuff

Ah, that seems reasonable then!

A client probably should bump up the SO_SNDBUF

Interesting point. Is there a limit to what an unprivileged process can raise it to? What does the reference client implementation do? For IP sockets generally the default is also the maximum, but maybe Unix sockets are different.

@swsnr
Copy link
Contributor Author

swsnr commented Nov 5, 2021

Interesting point. Is there a limit to what an unprivileged process can raise it to? What does the reference client implementation do? For IP sockets generally the default is also the maximum, but maybe Unix sockets are different.

I honestly don't know 🤷 I'm sorry, I just did some wild guessing about unix domain sockets based on some superficial understanding of networking code, and read this paragraph in the docs 😄

I'll make a note to find out, but things are a piling up a bit now 😬 so I guess I'd like to finish what I came for first, namely #1698 (and also #1710 now) before digging deeper 😇 I hope that's okay for now?

@Ralith
Copy link
Collaborator

Ralith commented Nov 5, 2021

I hope that's okay for now?

Of course! Just food for thought. I, for one, am not producing enough logs for this to be a serious risk regardless.

@swsnr
Copy link
Contributor Author

swsnr commented Nov 5, 2021

@Ralith I just checked the systemd C source because I can never remember out this ancillary data stuff in socket messages works, and libsystemd does this when opening the socket:

#define SNDBUF_SIZE (8*1024*1024)

static int journal_fd(void) {
        […]
        fd = socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0);
        if (fd < 0)
                return -errno;

        fd_inc_sndbuf(fd, SNDBUF_SIZE);

        […]
 
        return fd;
}

fd_inc_sndbuf comes down to fd_set_sndbuf(fd, SNDBUF_SIZE, true) with:

int fd_set_sndbuf(int fd, size_t n, bool increase) {
        int r, value;
        socklen_t l = sizeof(value);

        if (n > INT_MAX)
                return -ERANGE;

        r = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &value, &l);
        if (r >= 0 && l == sizeof(value) && increase ? (size_t) value >= n*2 : (size_t) value == n*2)
                return 0;

        /* First, try to set the buffer size with SO_SNDBUF. */
        r = setsockopt_int(fd, SOL_SOCKET, SO_SNDBUF, n);
        if (r < 0)
                return r;

        /* SO_SNDBUF above may set to the kernel limit, instead of the requested size.
         * So, we need to check the actual buffer size here. */
        l = sizeof(value);
        r = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &value, &l);
        if (r >= 0 && l == sizeof(value) && increase ? (size_t) value >= n*2 : (size_t) value == n*2)
                return 1;

        /* If we have the privileges we will ignore the kernel limit. */
        r = setsockopt_int(fd, SOL_SOCKET, SO_SNDBUFFORCE, n);
        if (r < 0)
                return r;

        return 1;
}

I don't really follow the details but it looks as if systemd tries hard to max out the send buffer for its socket.

Edit: Doesn't seem to make any difference for unprivileged processes, on my system the default's the same as the maximum permitted, and about 212kb, but privileged processes can apparently exceed this and systemd attempts to go to 8MiB if I understand the code correctly.

I can make a ticket for this, but I don't actually run any processes where this would make a difference, so it's not my priority, and I tend to leave this exercise to someone who's actually logging megabytes of things to journald 😇

@Ralith
Copy link
Collaborator

Ralith commented Nov 5, 2021

Nice research! Sounds like the impact would indeed be pretty niche, and I see no need to pursue it in the absence of demand.

@swsnr
Copy link
Contributor Author

swsnr commented Nov 9, 2021

I've rebased on master; is there anything from my side which still blocks this pull request? 🙂

Copy link
Collaborator

@Ralith Ralith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@hawkw
Copy link
Member

hawkw commented Nov 9, 2021

I've rebased on master; is there anything from my side which still blocks this pull request? slightly_smiling_face

Are we reasonably confident that the flaky test failures won't occur often in the future? Of course, since these are integration tests that rely on communicating with an external process, we can't ever completely rule out spurious failures, but the main blocker on my side is knowing that they should be infrequent enough that we can generally trust the results of these tests.

If you feel confident that the spurious failures aren't going to be a problem in the future, I'm happy to merge this!

@Ralith
Copy link
Collaborator

Ralith commented Nov 9, 2021

We spend up to an entire second on retries now; I don't think we can possibly be more reliable than that without suppressing failures entirely, which comes with a very high risk of silent bitrot, or by mocking journald, which somewhat defeats the point of the integration tests.

@swsnr
Copy link
Contributor Author

swsnr commented Nov 9, 2021

I tend to agree: I'd say if they still turn out to be flaky there's a general issue with journal logging. I've used your script to run the tests a few hundred times and had no further issues.

That'd said, if they break please do mark them as ignored meanwhile, and I promise to try and fix them 🙂

@hawkw
Copy link
Member

hawkw commented Nov 9, 2021

Okay, in that case, I'm going to go ahead and merge this. Thanks for working on this!

@hawkw hawkw merged commit fdf8178 into tokio-rs:master Nov 9, 2021
@swsnr
Copy link
Contributor Author

swsnr commented Nov 9, 2021

@hawkw Oh cool 🙏 I'll bring up #1710 and #1698 next 🙂

@swsnr swsnr deleted the journald-integration-tests branch November 9, 2021 22:16
hawkw pushed a commit that referenced this pull request Dec 2, 2021
Per discussion with @hawkw in #1698 I'm adding a few simple integration
tests for the journald subscriber, to have some safety net when
implementing the actual issue in #1698.

These tests send messages of various complexity to the journal, and then
use `journalctl`'s JSON output to get them back out, to check whether
the message arrives in the systemd journal as it was intended to.

## Motivation

Increase test coverage for the journald subscriber and codify a known
good state before approaching a fix for #1698.
@hawkw hawkw mentioned this pull request Dec 2, 2021
hawkw pushed a commit that referenced this pull request Dec 19, 2021
Per discussion with @hawkw in #1698 I'm adding a few simple integration
tests for the journald subscriber, to have some safety net when
implementing the actual issue in #1698.

These tests send messages of various complexity to the journal, and then
use `journalctl`'s JSON output to get them back out, to check whether
the message arrives in the systemd journal as it was intended to.

## Motivation

Increase test coverage for the journald subscriber and codify a known
good state before approaching a fix for #1698.
hawkw pushed a commit that referenced this pull request Dec 19, 2021
Per discussion with @hawkw in #1698 I'm adding a few simple integration
tests for the journald subscriber, to have some safety net when
implementing the actual issue in #1698.

These tests send messages of various complexity to the journal, and then
use `journalctl`'s JSON output to get them back out, to check whether
the message arrives in the systemd journal as it was intended to.

## Motivation

Increase test coverage for the journald subscriber and codify a known
good state before approaching a fix for #1698.
hawkw pushed a commit that referenced this pull request Dec 20, 2021
Per discussion with @hawkw in #1698 I'm adding a few simple integration
tests for the journald subscriber, to have some safety net when
implementing the actual issue in #1698.

These tests send messages of various complexity to the journal, and then
use `journalctl`'s JSON output to get them back out, to check whether
the message arrives in the systemd journal as it was intended to.

## Motivation

Increase test coverage for the journald subscriber and codify a known
good state before approaching a fix for #1698.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants