New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
journald: Add integration tests #1709
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have a test with an internal null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks!
I'll add one 🙂 |
@Ralith I've added another test for an internal null; I had to add a dev-dependency for serde and add a custom struct for field values, because |
A number array? Weird. But it looks like you're decoding it as a string? What's it do if you don't ask for JSON to begin with, and pass |
@Ralith Well, short of base64 strings an array of numbers is the only way to do binary data in JSON, I guess 🤷 Anyway, it's documented in See also https://www.freedesktop.org/wiki/Software/systemd/json/ which the I'm not decoding it as I wasn't aware of I can drop JSON but I'm not sure what else I'd use. JSON is one of two structured output formats in @hawkw I'm sorry for the formatting issue; I forgot to enable rustfmt formatting for the project. I'll push a fix shortly. As for the test failures it looks as if the tests are flaky: They pass on nightly but fail on beta and stable 🤔 I can't reproduce these issues on my machine, and I've had the same tests in libsystemd-rs pass reliably on the Github Ubuntu containers, so I tend to think that tracing itself is to "blame"? Is logging in |
No worries, that's why we have a formatting check on CI. :)
In these tests, logging should happen synchronously; we're using the synchronous |
I'm running the tests in a loop locally with a quick bash script, to try and see if I can reproduce the flaky failures on my machine: #!/usr/bin/env bash
RUNS=0
trap 'echo "killed after $RUNS runs!"; exit 0' QUIT KILL TERM INT
while cargo test -p tracing-journald; do
RUNS=$(($RUNS + 1))
done
echo "failed after $RUNS runs!" After running the tests 520 times (!!), I finally saw two of them fail: Failure output
So, it does appear there's some kind of flakiness here. Interestingly, I saw two of the tests fail, rather than all of them like we saw on CI. For the record, my
|
@hawkw Oh, that's a good idea, I'll try this myself. My guess is that we're too fast in asking the journal. As far as I understand the unix socket API So there's a brief time after return from the log call in which the data lingers in the send buffer before journald gets it. On my local fast multi-core system this time is too short because journald almost immediately gets the data, but on a comparatively slow system with few, perhaps only one core, such as the CI containers, it probably takes longer, and sometimes we're just too fast 🤷 That could also explain why I've never seen this failure with libsystemd-rs: That crate opens a new socket for every log call and closes it at the end, which perhaps immediately flushes the send buffer, so journald mostly gets the data before the log call returns. Does this sound reasonable? |
hmm, i wonder if we want to flush after every |
@hawkw I don't know whether you can flush, and the systemd docs actually recommend to increase the buffer to reduce blocking to a minimum:
The journald subscriber doesn't do that in fact, but still this seems to imply that there's some kind of buffering going on 🤔 : But I'm only superficially familiar with domain sockets 😇 |
Ah, that seems reasonable then!
Interesting point. Is there a limit to what an unprivileged process can raise it to? What does the reference client implementation do? For IP sockets generally the default is also the maximum, but maybe Unix sockets are different. |
I honestly don't know 🤷 I'm sorry, I just did some wild guessing about unix domain sockets based on some superficial understanding of networking code, and read this paragraph in the docs 😄 I'll make a note to find out, but things are a piling up a bit now 😬 so I guess I'd like to finish what I came for first, namely #1698 (and also #1710 now) before digging deeper 😇 I hope that's okay for now? |
Of course! Just food for thought. I, for one, am not producing enough logs for this to be a serious risk regardless. |
@Ralith I just checked the systemd C source because I can never remember out this ancillary data stuff in socket messages works, and libsystemd does this when opening the socket: #define SNDBUF_SIZE (8*1024*1024)
static int journal_fd(void) {
[…]
fd = socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0);
if (fd < 0)
return -errno;
fd_inc_sndbuf(fd, SNDBUF_SIZE);
[…]
return fd;
}
int fd_set_sndbuf(int fd, size_t n, bool increase) {
int r, value;
socklen_t l = sizeof(value);
if (n > INT_MAX)
return -ERANGE;
r = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &value, &l);
if (r >= 0 && l == sizeof(value) && increase ? (size_t) value >= n*2 : (size_t) value == n*2)
return 0;
/* First, try to set the buffer size with SO_SNDBUF. */
r = setsockopt_int(fd, SOL_SOCKET, SO_SNDBUF, n);
if (r < 0)
return r;
/* SO_SNDBUF above may set to the kernel limit, instead of the requested size.
* So, we need to check the actual buffer size here. */
l = sizeof(value);
r = getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &value, &l);
if (r >= 0 && l == sizeof(value) && increase ? (size_t) value >= n*2 : (size_t) value == n*2)
return 1;
/* If we have the privileges we will ignore the kernel limit. */
r = setsockopt_int(fd, SOL_SOCKET, SO_SNDBUFFORCE, n);
if (r < 0)
return r;
return 1;
} I don't really follow the details but it looks as if systemd tries hard to max out the send buffer for its socket. Edit: Doesn't seem to make any difference for unprivileged processes, on my system the default's the same as the maximum permitted, and about 212kb, but privileged processes can apparently exceed this and systemd attempts to go to 8MiB if I understand the code correctly. I can make a ticket for this, but I don't actually run any processes where this would make a difference, so it's not my priority, and I tend to leave this exercise to someone who's actually logging megabytes of things to journald 😇 |
Nice research! Sounds like the impact would indeed be pretty niche, and I see no need to pursue it in the absence of demand. |
I've rebased on master; is there anything from my side which still blocks this pull request? 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Are we reasonably confident that the flaky test failures won't occur often in the future? Of course, since these are integration tests that rely on communicating with an external process, we can't ever completely rule out spurious failures, but the main blocker on my side is knowing that they should be infrequent enough that we can generally trust the results of these tests. If you feel confident that the spurious failures aren't going to be a problem in the future, I'm happy to merge this! |
We spend up to an entire second on retries now; I don't think we can possibly be more reliable than that without suppressing failures entirely, which comes with a very high risk of silent bitrot, or by mocking journald, which somewhat defeats the point of the integration tests. |
I tend to agree: I'd say if they still turn out to be flaky there's a general issue with journal logging. I've used your script to run the tests a few hundred times and had no further issues. That'd said, if they break please do mark them as ignored meanwhile, and I promise to try and fix them 🙂 |
Okay, in that case, I'm going to go ahead and merge this. Thanks for working on this! |
Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.
Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.
Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.
Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698. These tests send messages of various complexity to the journal, and then use `journalctl`'s JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to. ## Motivation Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.
Per discussion with @hawkw in #1698 I'm adding a few simple integration tests for the journald subscriber, to have some safety net when implementing the actual issue in #1698.
These tests send messages of various complexity to the journal, and then use
journalctl
's JSON output to get them back out, to check whether the message arrives in the systemd journal as it was intended to.Motivation
Increase test coverage for the journald subscriber and codify a known good state before approaching a fix for #1698.